All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] vhost: improve dirty pages logging performance
@ 2018-04-30 15:59 Maxime Coquelin
  2018-05-03 11:56 ` Tiwei Bie
  0 siblings, 1 reply; 9+ messages in thread
From: Maxime Coquelin @ 2018-04-30 15:59 UTC (permalink / raw)
  To: dev, jianfeng.tan, tiwei.bie, mst; +Cc: stable, Maxime Coquelin

This patch caches all dirty pages logging until the used ring index
is updated. These dirty pages won't be accessed by the guest as
long as the host doesn't give them back to it by updating the
index.

The goal of this optimization is to fix a performance regression
introduced when the vhost library started to use atomic operations
to set bits in the shared dirty log map. While the fix was valid
as previous implementation wasn't safe against concurent accesses,
contention was induced.

With this patch, during migration, we have:
1. Less atomic operations as only a single atomic OR operation
per 32 pages.
2. Less atomic operations as during a burst, the same page will
be marked dirty only once.
3. Less write memory barriers.

Fixes: 897f13a1f726 ("vhost: make page logging atomic")

Cc: stable@dpdk.org

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---

Hi,

This series was tested with migrating a guest while running PVP
benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.

With this patch we recover the packet drops regressions seen since
the use of atomic operations to log dirty pages.

Some numbers:

A) PVP Live migration using testpmd (Single queue pair)
-------------------------------------------------------

Without patch:
=========================Stream Rate: 1Mpps==============
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      134     18966        16   11628963.0
 1       1Mpps      125     18790        16    8436300.0
 2       1Mpps      122     19171        15   13881342.0
 3       1Mpps      132     18913        15   12079492.0
<------------------------Summary------------------------>
   Max   1Mpps      134     19171        16     13881342
   Min   1Mpps      122     18790        15      8436300
  Mean   1Mpps      128     18960        15     11506524
Median   1Mpps      128     18939        15     11854227
 Stdev       0     5.68    158.81      0.58   2266371.52

=========================================================

With patch:
=========================Stream Rate: 1Mpps==============
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      119     13521        15     478174.0
 1       1Mpps      116     14084        14     452018.0
 2       1Mpps      122     13908        14     464486.0
 3       1Mpps      129     13787        16     478234.0
<------------------------Summary------------------------>
   Max   1Mpps      129     14084        16       478234
   Min   1Mpps      116     13521        14       452018
  Mean   1Mpps      121     13825        14       468228
Median   1Mpps      120     13847        14       471330
 Stdev       0     5.57    236.52      0.96     12593.78
=========================================================


B) OVS-DPDK migration with 2 queue pairs
----------------------------------------

Without patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      146     20270       116   15937394.0
 1       1Mpps      150     25617        16   11120370.0
 2       1Mpps      138     40983       115   24779971.0
 3       1Mpps      161     20435        17   15363519.0
<------------------------Summary------------------------>
   Max   1Mpps      161     40983       116     24779971
   Min   1Mpps      138     20270        16     11120370
  Mean   1Mpps      148     26826        66     16800313
Median   1Mpps      148     23026        66     15650456
 Stdev       0     9.57    9758.9     57.16   5737179.93

=========================================================

With patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      155     18915        17     481330.0
 1       1Mpps      156     21097        18     370556.0
 2       1Mpps      156     49610        15     369291.0
 3       1Mpps      144     31124        15     361914.0
<------------------------Summary------------------------>
   Max   1Mpps      156     49610        18       481330
   Min   1Mpps      144     18915        15       361914
  Mean   1Mpps      152     30186        16       395772
Median   1Mpps      155     26110        16       369923
 Stdev       0     5.85  13997.82       1.5     57165.33
=========================================================

C) OVS-DPDK migration with single queue pair
--------------------------------------------

Without patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      129     17411        15   11105414.0
 1       1Mpps      130     16544        15    8028438.0
 2       1Mpps      132     15202        15    7491584.0
 3       1Mpps      133     18100        15    8385047.0
<------------------------Summary------------------------>
   Max   1Mpps      133     18100        15     11105414
   Min   1Mpps      129     15202        15      7491584
  Mean   1Mpps      131     16814        15      8752620
Median   1Mpps      131     16977        15      8206742
 Stdev       0     1.83   1249.22       0.0   1610941.84
=========================================================

With patch:
=======================Stream Rate: 1Mpps================
No Stream_Rate Downtime Totaltime Ping_Loss trex_Loss
 0       1Mpps      117     12966        14     275873.0
 1       1Mpps      127     13398        14     460323.0
 2       1Mpps      119     13216        14     280831.0
 3       1Mpps      131     13544        15     465951.0
<------------------------Summary------------------------>
   Max   1Mpps      131     13544        15       465951
   Min   1Mpps      117     12966        14       275873
  Mean   1Mpps      123     13281        14       370744
Median   1Mpps      123     13307        14       370577
 Stdev       0     6.61     249.2       0.5     106729.6
=========================================================



 lib/librte_vhost/vhost.h      | 96 +++++++++++++++++++++++++++++++++++++++++++
 lib/librte_vhost/virtio_net.c | 29 ++++++++-----
 2 files changed, 115 insertions(+), 10 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 891978131..0d9129504 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -36,6 +36,8 @@
 
 #define BUF_VECTOR_MAX 256
 
+#define VHOST_LOG_CACHE_NR 32
+
 /**
  * Structure contains buffer address, length and descriptor index
  * from vring to do scatter RX.
@@ -69,6 +71,14 @@ struct batch_copy_elem {
 	uint64_t log_addr;
 };
 
+/*
+ * Structure that contains the info for batched dirty logging.
+ */
+struct log_cache_entry {
+	uint32_t offset;
+	uint32_t val;
+};
+
 /**
  * Structure contains variables relevant to RX/TX virtqueues.
  */
@@ -112,6 +122,9 @@ struct vhost_virtqueue {
 	struct batch_copy_elem	*batch_copy_elems;
 	uint16_t		batch_copy_nb_elems;
 
+	struct log_cache_entry log_cache[VHOST_LOG_CACHE_NR];
+	uint16_t log_cache_nb_elem;
+
 	rte_rwlock_t	iotlb_lock;
 	rte_rwlock_t	iotlb_pending_lock;
 	struct rte_mempool *iotlb_pool;
@@ -340,6 +353,89 @@ vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
 	}
 }
 
+static __rte_always_inline void
+vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
+{
+	uint32_t *log_base;
+	int i;
+
+	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		   !dev->log_base))
+		return;
+
+	log_base = (uint32_t *)(uintptr_t)dev->log_base;
+
+	/* To make sure guest memory updates are committed before logging */
+	rte_smp_wmb();
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+		__sync_fetch_and_or(log_base + elem->offset, elem->val);
+	}
+
+	vq->log_cache_nb_elem = 0;
+}
+
+static __rte_always_inline void
+vhost_log_cache_page(struct virtio_net *dev, struct vhost_virtqueue *vq,
+					 uint64_t page)
+{
+	uint32_t bit_nr = page % 32;
+	uint32_t offset = page / 32;
+	int i;
+
+	for (i = 0; i < vq->log_cache_nb_elem; i++) {
+		struct log_cache_entry *elem = vq->log_cache + i;
+
+		if (elem->offset == offset) {
+			elem->val |= (1U << bit_nr);
+			return;
+		}
+	}
+
+	if (unlikely(i >= VHOST_LOG_CACHE_NR)) {
+		/*
+		 * No more room for a new log cache entry,
+		 * so write the dirty log map directly.
+		 */
+		rte_smp_wmb();
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+
+		return;
+	}
+
+	vq->log_cache[i].offset = offset;
+	vq->log_cache[i].val = (1U << bit_nr);
+}
+
+static __rte_always_inline void
+vhost_log_cache_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+					  uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		   !dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_cache_page(dev, vq, page);
+		page += 1;
+	}
+}
+
+static __rte_always_inline void
+vhost_log_cache_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		     uint64_t offset, uint64_t len)
+{
+	vhost_log_cache_write(dev, vq, vq->log_guest_addr + offset, len);
+}
+
 static __rte_always_inline void
 vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		     uint64_t offset, uint64_t len)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5fdd4172b..ac42cdbd0 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -78,7 +78,7 @@ do_flush_shadow_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	rte_memcpy(&vq->used->ring[to],
 			&vq->shadow_used_ring[from],
 			size * sizeof(struct vring_used_elem));
-	vhost_log_used_vring(dev, vq,
+	vhost_log_cache_used_vring(dev, vq,
 			offsetof(struct vring_used, ring[to]),
 			size * sizeof(struct vring_used_elem));
 }
@@ -106,6 +106,8 @@ flush_shadow_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq)
 
 	rte_smp_wmb();
 
+	vhost_log_cache_sync(dev, vq);
+
 	*(volatile uint16_t *)&vq->used->idx += vq->shadow_used_idx;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 		sizeof(vq->used->idx));
@@ -130,7 +132,7 @@ do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 
 	for (i = 0; i < count; i++) {
 		rte_memcpy(elem[i].dst, elem[i].src, elem[i].len);
-		vhost_log_write(dev, elem[i].log_addr, elem[i].len);
+		vhost_log_cache_write(dev, vq, elem[i].log_addr, elem[i].len);
 		PRINT_PACKET(dev, (uintptr_t)elem[i].dst, elem[i].len, 0);
 	}
 }
@@ -251,7 +253,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 		virtio_enqueue_offload(m,
 				(struct virtio_net_hdr *)(uintptr_t)desc_addr);
 		PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
-		vhost_log_write(dev, desc_gaddr, dev->vhost_hlen);
+		vhost_log_cache_write(dev, vq, desc_gaddr, dev->vhost_hlen);
 	} else {
 		struct virtio_net_hdr vnet_hdr;
 		uint64_t remain = dev->vhost_hlen;
@@ -274,7 +276,7 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 					(void *)(uintptr_t)src, len);
 
 			PRINT_PACKET(dev, (uintptr_t)dst, (uint32_t)len, 0);
-			vhost_log_write(dev, guest_addr, len);
+			vhost_log_cache_write(dev, vq, guest_addr, len);
 			remain -= len;
 			guest_addr += len;
 			dst += len;
@@ -355,7 +357,8 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
 							desc_offset)),
 				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
 				cpy_len);
-			vhost_log_write(dev, desc_gaddr + desc_offset, cpy_len);
+			vhost_log_cache_write(dev, vq, desc_gaddr + desc_offset,
+					cpy_len);
 			PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset),
 				     cpy_len, 0);
 		} else {
@@ -444,7 +447,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		vq->used->ring[used_idx].id = desc_indexes[i];
 		vq->used->ring[used_idx].len = pkts[i]->pkt_len +
 					       dev->vhost_hlen;
-		vhost_log_used_vring(dev, vq,
+		vhost_log_cache_used_vring(dev, vq,
 			offsetof(struct vring_used, ring[used_idx]),
 			sizeof(vq->used->ring[used_idx]));
 	}
@@ -504,6 +507,8 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_smp_wmb();
 
+	vhost_log_cache_sync(dev, vq);
+
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx += count;
 	vhost_log_used_vring(dev, vq,
@@ -767,7 +772,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
 
 					PRINT_PACKET(dev, (uintptr_t)dst,
 							(uint32_t)len, 0);
-					vhost_log_write(dev, guest_addr, len);
+					vhost_log_cache_write(dev, vq,
+							guest_addr, len);
 
 					remain -= len;
 					guest_addr += len;
@@ -776,7 +782,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			} else {
 				PRINT_PACKET(dev, (uintptr_t)hdr_addr,
 						dev->vhost_hlen, 0);
-				vhost_log_write(dev, hdr_phys_addr,
+				vhost_log_cache_write(dev, vq, hdr_phys_addr,
 						dev->vhost_hlen);
 			}
 
@@ -790,7 +796,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net *dev, struct vhost_virtqueue *vq,
 							desc_offset)),
 				rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
 				cpy_len);
-			vhost_log_write(dev, desc_gaddr + desc_offset, cpy_len);
+			vhost_log_cache_write(dev, vq, desc_gaddr + desc_offset,
+					cpy_len);
 			PRINT_PACKET(dev, (uintptr_t)(desc_addr + desc_offset),
 				cpy_len, 0);
 		} else {
@@ -1320,7 +1327,7 @@ update_used_ring(struct virtio_net *dev, struct vhost_virtqueue *vq,
 {
 	vq->used->ring[used_idx].id  = desc_idx;
 	vq->used->ring[used_idx].len = 0;
-	vhost_log_used_vring(dev, vq,
+	vhost_log_cache_used_vring(dev, vq,
 			offsetof(struct vring_used, ring[used_idx]),
 			sizeof(vq->used->ring[used_idx]));
 }
@@ -1335,6 +1342,8 @@ update_used_idx(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	rte_smp_wmb();
 	rte_smp_rmb();
 
+	vhost_log_cache_sync(dev, vq);
+
 	vq->used->idx += count;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-04-30 15:59 [PATCH] vhost: improve dirty pages logging performance Maxime Coquelin
@ 2018-05-03 11:56 ` Tiwei Bie
  2018-05-04 15:48   ` Maxime Coquelin
  2018-05-15 13:50   ` Maxime Coquelin
  0 siblings, 2 replies; 9+ messages in thread
From: Tiwei Bie @ 2018-05-03 11:56 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev, jianfeng.tan, mst, stable

On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> This patch caches all dirty pages logging until the used ring index
> is updated. These dirty pages won't be accessed by the guest as
> long as the host doesn't give them back to it by updating the
> index.

Below sentence in above commit message isn't the reason why
we can cache the dirty page logging. Right?

"""
These dirty pages won't be accessed by the guest as
long as the host doesn't give them back to it by updating the
index.
"""

> 
> The goal of this optimization is to fix a performance regression
> introduced when the vhost library started to use atomic operations
> to set bits in the shared dirty log map. While the fix was valid
> as previous implementation wasn't safe against concurent accesses,
> contention was induced.
> 
> With this patch, during migration, we have:
> 1. Less atomic operations as only a single atomic OR operation
> per 32 pages.

Why not do it per 64 pages?

> 2. Less atomic operations as during a burst, the same page will
> be marked dirty only once.
> 3. Less write memory barriers.
> 
> Fixes: 897f13a1f726 ("vhost: make page logging atomic")
> 
> Cc: stable@dpdk.org
> 
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
> 
> Hi,
> 
> This series was tested with migrating a guest while running PVP
> benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.

If the throughput is higher (e.g. by adding more cores
and queues), will the live migration fail due to the
higher dirty page generating speed?

> 
> With this patch we recover the packet drops regressions seen since
> the use of atomic operations to log dirty pages.
[...]
>  
> +static __rte_always_inline void
> +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> +{
> +	uint32_t *log_base;
> +	int i;
> +
> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +		   !dev->log_base))
> +		return;
> +
> +	log_base = (uint32_t *)(uintptr_t)dev->log_base;
> +
> +	/* To make sure guest memory updates are committed before logging */
> +	rte_smp_wmb();

It seems that __sync_fetch_and_or() can be considered a full
barrier [1]. So do we really need this rte_smp_wmb()?

Besides, based on the same doc [1], it seems that the __sync_
version is deprecated in favor of the __atomic_ one.

[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html

> +
> +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
> +		struct log_cache_entry *elem = vq->log_cache + i;
> +
> +		__sync_fetch_and_or(log_base + elem->offset, elem->val);
> +	}
> +
> +	vq->log_cache_nb_elem = 0;
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-05-03 11:56 ` Tiwei Bie
@ 2018-05-04 15:48   ` Maxime Coquelin
  2018-05-04 18:54     ` Michael S. Tsirkin
  2018-05-07  3:49     ` Tiwei Bie
  2018-05-15 13:50   ` Maxime Coquelin
  1 sibling, 2 replies; 9+ messages in thread
From: Maxime Coquelin @ 2018-05-04 15:48 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: dev, jianfeng.tan, mst, stable

Hi Tiwei,

On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
>> This patch caches all dirty pages logging until the used ring index
>> is updated. These dirty pages won't be accessed by the guest as
>> long as the host doesn't give them back to it by updating the
>> index.
> 
> Below sentence in above commit message isn't the reason why
> we can cache the dirty page logging. Right?
> 
> """
> These dirty pages won't be accessed by the guest as
> long as the host doesn't give them back to it by updating the
> index.
> """
> 
>>
>> The goal of this optimization is to fix a performance regression
>> introduced when the vhost library started to use atomic operations
>> to set bits in the shared dirty log map. While the fix was valid
>> as previous implementation wasn't safe against concurent accesses,
>> contention was induced.
>>
>> With this patch, during migration, we have:
>> 1. Less atomic operations as only a single atomic OR operation
>> per 32 pages.
> 
> Why not do it per 64 pages?

I wasn't sure it would be supported on 32bits CPU, but [0] seems do
indicate that only 128bits atomic operations aren't supported on all
architectures.

I will change to do it per 64 pages.

[0]: 
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
>> 2. Less atomic operations as during a burst, the same page will
>> be marked dirty only once.
>> 3. Less write memory barriers.
>>
>> Fixes: 897f13a1f726 ("vhost: make page logging atomic")
>>
>> Cc: stable@dpdk.org
>>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>> ---
>>
>> Hi,
>>
>> This series was tested with migrating a guest while running PVP
>> benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.
> 
> If the throughput is higher (e.g. by adding more cores
> and queues), will the live migration fail due to the
> higher dirty page generating speed?

I haven't done the check, but I agree that the higher is the throughput,
the longer will be the migration duration and the higher the risk it
never converges.

In this case of scenario, postcopy live-migration way be a better fit,
as the hot pages will be copied only once. Postcopy support for
vhost-user have been added to QEMU in v2.12, and I have a prototype for
DPDK that I plan to propose for next release.

> 
>>
>> With this patch we recover the packet drops regressions seen since
>> the use of atomic operations to log dirty pages.
> [...]
>>   
>> +static __rte_always_inline void
>> +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
>> +{
>> +	uint32_t *log_base;
>> +	int i;
>> +
>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>> +		   !dev->log_base))
>> +		return;
>> +
>> +	log_base = (uint32_t *)(uintptr_t)dev->log_base;
>> +
>> +	/* To make sure guest memory updates are committed before logging */
>> +	rte_smp_wmb();
> 
> It seems that __sync_fetch_and_or() can be considered a full
> barrier [1]. So do we really need this rte_smp_wmb()?

That's a good point, thanks for the pointer.

> Besides, based on the same doc [1], it seems that the __sync_
> version is deprecated in favor of the __atomic_ one.

I will change to __atomic_. For the memory model, do you agree I should
use __ATOMIC_SEQ_CST?

> [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> 
>> +
>> +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
>> +		struct log_cache_entry *elem = vq->log_cache + i;
>> +
>> +		__sync_fetch_and_or(log_base + elem->offset, elem->val);
>> +	}
>> +
>> +	vq->log_cache_nb_elem = 0;
>> +}
>> +
> [...]
> 

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-05-04 15:48   ` Maxime Coquelin
@ 2018-05-04 18:54     ` Michael S. Tsirkin
  2018-05-07  3:49     ` Tiwei Bie
  1 sibling, 0 replies; 9+ messages in thread
From: Michael S. Tsirkin @ 2018-05-04 18:54 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: Tiwei Bie, dev, jianfeng.tan, stable

On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote:
> Hi Tiwei,
> 
> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> > > This patch caches all dirty pages logging until the used ring index
> > > is updated. These dirty pages won't be accessed by the guest as
> > > long as the host doesn't give them back to it by updating the
> > > index.
> > 
> > Below sentence in above commit message isn't the reason why
> > we can cache the dirty page logging. Right?
> > 
> > """
> > These dirty pages won't be accessed by the guest as
> > long as the host doesn't give them back to it by updating the
> > index.
> > """
> > 
> > > 
> > > The goal of this optimization is to fix a performance regression
> > > introduced when the vhost library started to use atomic operations
> > > to set bits in the shared dirty log map. While the fix was valid
> > > as previous implementation wasn't safe against concurent accesses,
> > > contention was induced.
> > > 
> > > With this patch, during migration, we have:
> > > 1. Less atomic operations as only a single atomic OR operation
> > > per 32 pages.
> > 
> > Why not do it per 64 pages?
> 
> I wasn't sure it would be supported on 32bits CPU, but [0] seems do
> indicate that only 128bits atomic operations aren't supported on all
> architectures.
> 
> I will change to do it per 64 pages.

""
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in function
if the specific data type size makes that possible; otherwise, an
external call is left to be resolved at run time. This external call is
the same format with the addition of a ‘size_t’ parameter inserted as
the first parameter indicating the size of the object being pointed to.
All objects must be the same size.
""

So I suspect it isn't well supported on all 32 bit CPUs, but
can't you do an atomic using long? That will do the right
thing depending on the CPU.

If not it's pretty easy to something along the lines of
#if BIT_PER_LONG == 64

> [0]: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins
> > > 2. Less atomic operations as during a burst, the same page will
> > > be marked dirty only once.
> > > 3. Less write memory barriers.
> > > 
> > > Fixes: 897f13a1f726 ("vhost: make page logging atomic")
> > > 
> > > Cc: stable@dpdk.org
> > > 
> > > Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> > > Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > ---
> > > 
> > > Hi,
> > > 
> > > This series was tested with migrating a guest while running PVP
> > > benchmark at 1Mpps with both ovs-dpdk and testpmd as vswitch.
> > 
> > If the throughput is higher (e.g. by adding more cores
> > and queues), will the live migration fail due to the
> > higher dirty page generating speed?
> 
> I haven't done the check, but I agree that the higher is the throughput,
> the longer will be the migration duration and the higher the risk it
> never converges.

Right. It's not a good reason to just randomly slow down networking :)
In fact if data ends up on the same page, there is a chance that instead
of dirtying the same page multiple times we will dirty it once, which
will help migration converge.

> In this case of scenario, postcopy live-migration way be a better fit,
> as the hot pages will be copied only once. Postcopy support for
> vhost-user have been added to QEMU in v2.12, and I have a prototype for
> DPDK that I plan to propose for next release.
> 
> > 
> > > 
> > > With this patch we recover the packet drops regressions seen since
> > > the use of atomic operations to log dirty pages.
> > [...]
> > > +static __rte_always_inline void
> > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > > +{
> > > +	uint32_t *log_base;
> > > +	int i;
> > > +
> > > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > > +		   !dev->log_base))
> > > +		return;
> > > +
> > > +	log_base = (uint32_t *)(uintptr_t)dev->log_base;
> > > +
> > > +	/* To make sure guest memory updates are committed before logging */
> > > +	rte_smp_wmb();
> > 
> > It seems that __sync_fetch_and_or() can be considered a full
> > barrier [1]. So do we really need this rte_smp_wmb()?
> 
> That's a good point, thanks for the pointer.
> 
> > Besides, based on the same doc [1], it seems that the __sync_
> > version is deprecated in favor of the __atomic_ one.
> 
> I will change to __atomic_. For the memory model, do you agree I should
> use __ATOMIC_SEQ_CST?
> 
> > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> > 
> > > +
> > > +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
> > > +		struct log_cache_entry *elem = vq->log_cache + i;
> > > +
> > > +		__sync_fetch_and_or(log_base + elem->offset, elem->val);
> > > +	}
> > > +
> > > +	vq->log_cache_nb_elem = 0;
> > > +}
> > > +
> > [...]
> > 
> 
> Thanks,
> Maxime

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-05-04 15:48   ` Maxime Coquelin
  2018-05-04 18:54     ` Michael S. Tsirkin
@ 2018-05-07  3:49     ` Tiwei Bie
  2018-05-07  3:58       ` Michael S. Tsirkin
  1 sibling, 1 reply; 9+ messages in thread
From: Tiwei Bie @ 2018-05-07  3:49 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev, jianfeng.tan, mst, stable

On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote:
> Hi Tiwei,
> 
> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
[...]
> > > +static __rte_always_inline void
> > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > > +{
> > > +	uint32_t *log_base;
> > > +	int i;
> > > +
> > > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > > +		   !dev->log_base))
> > > +		return;
> > > +
> > > +	log_base = (uint32_t *)(uintptr_t)dev->log_base;
> > > +
> > > +	/* To make sure guest memory updates are committed before logging */
> > > +	rte_smp_wmb();
> > 
> > It seems that __sync_fetch_and_or() can be considered a full
> > barrier [1]. So do we really need this rte_smp_wmb()?
> 
> That's a good point, thanks for the pointer.
> 
> > Besides, based on the same doc [1], it seems that the __sync_
> > version is deprecated in favor of the __atomic_ one.
> 
> I will change to __atomic_. For the memory model, do you agree I should
> use __ATOMIC_SEQ_CST?

Maybe we can use __ATOMIC_RELAXED and keep rte_smp_wmb().

Best regards,
Tiwei Bie

> 
> > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> > 
> > > +
> > > +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
> > > +		struct log_cache_entry *elem = vq->log_cache + i;
> > > +
> > > +		__sync_fetch_and_or(log_base + elem->offset, elem->val);
> > > +	}
> > > +
> > > +	vq->log_cache_nb_elem = 0;
> > > +}
> > > +
> > [...]
> > 
> 
> Thanks,
> Maxime

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-05-07  3:49     ` Tiwei Bie
@ 2018-05-07  3:58       ` Michael S. Tsirkin
  0 siblings, 0 replies; 9+ messages in thread
From: Michael S. Tsirkin @ 2018-05-07  3:58 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: Maxime Coquelin, dev, jianfeng.tan, stable

On Mon, May 07, 2018 at 11:49:49AM +0800, Tiwei Bie wrote:
> On Fri, May 04, 2018 at 05:48:05PM +0200, Maxime Coquelin wrote:
> > Hi Tiwei,
> > 
> > On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> [...]
> > > > +static __rte_always_inline void
> > > > +vhost_log_cache_sync(struct virtio_net *dev, struct vhost_virtqueue *vq)
> > > > +{
> > > > +	uint32_t *log_base;
> > > > +	int i;
> > > > +
> > > > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > > > +		   !dev->log_base))
> > > > +		return;
> > > > +
> > > > +	log_base = (uint32_t *)(uintptr_t)dev->log_base;
> > > > +
> > > > +	/* To make sure guest memory updates are committed before logging */
> > > > +	rte_smp_wmb();
> > > 
> > > It seems that __sync_fetch_and_or() can be considered a full
> > > barrier [1]. So do we really need this rte_smp_wmb()?
> > 
> > That's a good point, thanks for the pointer.
> > 
> > > Besides, based on the same doc [1], it seems that the __sync_
> > > version is deprecated in favor of the __atomic_ one.
> > 
> > I will change to __atomic_. For the memory model, do you agree I should
> > use __ATOMIC_SEQ_CST?
> 
> Maybe we can use __ATOMIC_RELAXED and keep rte_smp_wmb().
> 
> Best regards,
> Tiwei Bie

Yes I think if you do your own barriers, you can use __ATOMIC_RELAXED
in theory.

One issue is that as one lwn article noted
"there will be some seriously suboptimal code production before
gcc-7.1".  So if you are using these new APIs, you should also check
that a relatively new compiler is used.


> > 
> > > [1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html
> > > 
> > > > +
> > > > +	for (i = 0; i < vq->log_cache_nb_elem; i++) {
> > > > +		struct log_cache_entry *elem = vq->log_cache + i;
> > > > +
> > > > +		__sync_fetch_and_or(log_base + elem->offset, elem->val);
> > > > +	}
> > > > +
> > > > +	vq->log_cache_nb_elem = 0;
> > > > +}
> > > > +
> > > [...]
> > > 
> > 
> > Thanks,
> > Maxime

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-05-03 11:56 ` Tiwei Bie
  2018-05-04 15:48   ` Maxime Coquelin
@ 2018-05-15 13:50   ` Maxime Coquelin
  2018-05-16  6:10     ` Tiwei Bie
  1 sibling, 1 reply; 9+ messages in thread
From: Maxime Coquelin @ 2018-05-15 13:50 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: dev, jianfeng.tan, mst, stable

Hi Tiwei,

I just see I missed to reply to your comment on my commit message:

On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
>> This patch caches all dirty pages logging until the used ring index
>> is updated. These dirty pages won't be accessed by the guest as
>> long as the host doesn't give them back to it by updating the
>> index.
> Below sentence in above commit message isn't the reason why
> we can cache the dirty page logging. Right?
> 
> """
> These dirty pages won't be accessed by the guest as
> long as the host doesn't give them back to it by updating the
> index.

That's my understanding.
As long as the used index is not updated, the guest will not process
the descs.
If the migration converges between the time the descs are written,
and the time the used index is updated on source side. Then the guest
running on destination will not see the descriptors as used but as
available, and so will be overwritten by the vhost backend on
destination.

Regards,
Maxime

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-05-15 13:50   ` Maxime Coquelin
@ 2018-05-16  6:10     ` Tiwei Bie
  2018-05-16 15:00       ` Maxime Coquelin
  0 siblings, 1 reply; 9+ messages in thread
From: Tiwei Bie @ 2018-05-16  6:10 UTC (permalink / raw)
  To: Maxime Coquelin; +Cc: dev, jianfeng.tan, mst, stable

On Tue, May 15, 2018 at 03:50:54PM +0200, Maxime Coquelin wrote:
> Hi Tiwei,
> 
> I just see I missed to reply to your comment on my commit message:
> 
> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
> > On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
> > > This patch caches all dirty pages logging until the used ring index
> > > is updated. These dirty pages won't be accessed by the guest as
> > > long as the host doesn't give them back to it by updating the
> > > index.
> > Below sentence in above commit message isn't the reason why
> > we can cache the dirty page logging. Right?
> > 
> > """
> > These dirty pages won't be accessed by the guest as
> > long as the host doesn't give them back to it by updating the
> > index.
> 
> That's my understanding.
> As long as the used index is not updated, the guest will not process
> the descs.
> If the migration converges between the time the descs are written,
> and the time the used index is updated on source side. Then the guest
> running on destination will not see the descriptors as used but as
> available, and so will be overwritten by the vhost backend on
> destination.

If my understanding is correct, theoretically the vhost
backend can cache all the dirty page loggings before it
responds to the GET_VRING_BASE messages. Below are the
steps how QEMU live migration works (w/o postcopy):

1. Syncing dirty pages between src and dst;
2. The dirty page sync converges;
3. The src QEMU sends GET_VRING_BASE to vhost backend;
4. Vhost backend still has a chance to log some dirty
   pages before responding the GET_VRING_BASE messages;
5. The src QEMU receives GET_VRING_BASE response (which
   means the device has stopped);
6. QEMU sync the remaining dirty pages;
7. QEMU on the dst starts running.

(The steps 3~6 are the downtime which we want to minimize)

So I think the words in commit log isn't really related
to why we can cache the dirty page loggings.

Best regards,
Tiwei Bie

> 
> Regards,
> Maxime

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] vhost: improve dirty pages logging performance
  2018-05-16  6:10     ` Tiwei Bie
@ 2018-05-16 15:00       ` Maxime Coquelin
  0 siblings, 0 replies; 9+ messages in thread
From: Maxime Coquelin @ 2018-05-16 15:00 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: dev, jianfeng.tan, mst, stable



On 05/16/2018 08:10 AM, Tiwei Bie wrote:
> On Tue, May 15, 2018 at 03:50:54PM +0200, Maxime Coquelin wrote:
>> Hi Tiwei,
>>
>> I just see I missed to reply to your comment on my commit message:
>>
>> On 05/03/2018 01:56 PM, Tiwei Bie wrote:
>>> On Mon, Apr 30, 2018 at 05:59:54PM +0200, Maxime Coquelin wrote:
>>>> This patch caches all dirty pages logging until the used ring index
>>>> is updated. These dirty pages won't be accessed by the guest as
>>>> long as the host doesn't give them back to it by updating the
>>>> index.
>>> Below sentence in above commit message isn't the reason why
>>> we can cache the dirty page logging. Right?
>>>
>>> """
>>> These dirty pages won't be accessed by the guest as
>>> long as the host doesn't give them back to it by updating the
>>> index.
>>
>> That's my understanding.
>> As long as the used index is not updated, the guest will not process
>> the descs.
>> If the migration converges between the time the descs are written,
>> and the time the used index is updated on source side. Then the guest
>> running on destination will not see the descriptors as used but as
>> available, and so will be overwritten by the vhost backend on
>> destination.
> 
> If my understanding is correct, theoretically the vhost
> backend can cache all the dirty page loggings before it
> responds to the GET_VRING_BASE messages. Below are the
> steps how QEMU live migration works (w/o postcopy):
> 
> 1. Syncing dirty pages between src and dst;
> 2. The dirty page sync converges;
> 3. The src QEMU sends GET_VRING_BASE to vhost backend;
> 4. Vhost backend still has a chance to log some dirty
>     pages before responding the GET_VRING_BASE messages;
> 5. The src QEMU receives GET_VRING_BASE response (which
>     means the device has stopped);
> 6. QEMU sync the remaining dirty pages;
> 7. QEMU on the dst starts running.

Thanks for the clarification.

> (The steps 3~6 are the downtime which we want to minimize)

Right, we want to minimize the downtime, but in the same time,
be able to converge.

> So I think the words in commit log isn't really related
> to why we can cache the dirty page loggings.

I'll try to improve it in v3.

Thanks,
Maxime

> Best regards,
> Tiwei Bie
> 
>>
>> Regards,
>> Maxime

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-05-16 15:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-30 15:59 [PATCH] vhost: improve dirty pages logging performance Maxime Coquelin
2018-05-03 11:56 ` Tiwei Bie
2018-05-04 15:48   ` Maxime Coquelin
2018-05-04 18:54     ` Michael S. Tsirkin
2018-05-07  3:49     ` Tiwei Bie
2018-05-07  3:58       ` Michael S. Tsirkin
2018-05-15 13:50   ` Maxime Coquelin
2018-05-16  6:10     ` Tiwei Bie
2018-05-16 15:00       ` Maxime Coquelin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.