All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4 for 2.3] vhost-user live migration support
@ 2015-12-02  3:43 Yuanhan Liu
  2015-12-02  3:43 ` [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

This patch set adds the initial vhost-user live migration support.

The major task behind that is to log pages we touched during
live migration. So, this patch is basically about adding vhost
log support, and using it.

Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
  the dirty memory bitmap is.
    
- Patch 2 introduces a vhost_log_write() helper function to log
  pages we are gonna change.

- Patch 3 logs changes we made to used vring.

- Patch 4 sets log_fhmfd protocol feature bit, which actually
  enables the vhost-user live migration support.

A simple test guide (on same host)
==================================

The following test is based on OVS + DPDK. And here is guide
to setup OVS + DPDK:

    http://wiki.qemu.org/Features/vhost-user-ovs-dpdk

1. start ovs-vswitchd

2. Add two ovs vhost-user port, say vhost0 and vhost1

3. Start a VM1 to connect to vhost0. Here is my example:

   $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3333,server,nowait -curses

4. run "ping $host" inside VM1

5. Start VM2 to connect to vhost0, and marking it as the target
   of live migration (by adding -incoming tcp:0:4444 option)

   $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3334,server,nowait -curses \
       -incoming tcp:0:4444 

6. connect to VM1 monitor, and start migration:

   > migrate tcp:0:4444

7. After a while, you will find that VM1 has been migrated to VM2,
   and the "ping" command continues running, perfectly.


Note: this patch set has mostly been based on Victor Kaplansk's demo
work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
as the co-author. Victor, what do you think of that? :)

Comments are welcome!

---
Yuanhan Liu (4):
  vhost: handle VHOST_USER_SET_LOG_BASE request
  vhost: introduce vhost_log_write
  vhost: log vring changes
  vhost: enable log_shmfd protocol feature

 lib/librte_vhost/rte_virtio_net.h             | 35 ++++++++++++++
 lib/librte_vhost/vhost_rxtx.c                 | 70 ++++++++++++++++++---------
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 +++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
 lib/librte_vhost/virtio-net.c                 |  4 ++
 7 files changed, 145 insertions(+), 26 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 13:53   ` Panu Matilainen
  2015-12-08  5:57   ` Xie, Huawei
  2015-12-02  3:43 ` [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

	typedef struct VhostUserLog {
		uint64_t mmap_size;
		uint64_t mmap_offset;
	} VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/rte_virtio_net.h             |  2 ++
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
 5 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5687452..416dac2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -127,6 +127,8 @@ struct virtio_net {
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
 	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
+	uint64_t		log_size;	/**< Size of log area */
+	uint8_t			*log_base;	/**< Where dirty pages are logged */
 	void			*priv;		/**< private context */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 2dc0547..76bcac2 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,7 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		break;
 
 	case VHOST_USER_SET_LOG_BASE:
-		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
+		user_set_log_base(ctx, &msg);
+
+		/* it needs a reply */
+		msg.size = sizeof(msg.payload.u64);
+		send_vhost_message(connfd, &msg);
+		break;
 	case VHOST_USER_SET_LOG_FD:
 		close(msg.fds[0]);
 		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
 	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
 } VhostUserMemory;
 
+typedef struct VhostUserLog {
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
+} VhostUserLog;
+
 typedef struct VhostUserMsg {
 	VhostUserRequest request;
 
@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
 		struct vhost_vring_state state;
 		struct vhost_vring_addr addr;
 		VhostUserMemory memory;
+		VhostUserLog    log;
 	} payload;
 	int fds[VHOST_MEMORY_MAX_NREGIONS];
 } __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..1d705fd 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,47 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
 
 	dev->protocol_features = protocol_features;
 }
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+		 struct VhostUserMsg *msg)
+{
+	struct virtio_net *dev;
+	int fd = msg->fds[0];
+	uint64_t size, off;
+	void *addr;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
+
+	if (fd < 0) {
+		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+		return -1;
+	}
+
+	if (msg->size != sizeof(VhostUserLog)) {
+		RTE_LOG(ERR, VHOST_CONFIG,
+			"invalid log base msg size: %"PRId32" != %d\n",
+			msg->size, (int)sizeof(VhostUserLog));
+		return -1;
+	}
+
+	size = msg->payload.log.mmap_size;
+	off  = msg->payload.log.mmap_offset;
+	RTE_LOG(INFO, VHOST_CONFIG,
+		"log mmap size: %"PRId64", offset: %"PRId64"\n",
+		size, off);
+
+	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
+	if (addr == MAP_FAILED) {
+		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+		return -1;
+	}
+
+	/* TODO: unmap on stop */
+	dev->log_base = addr;
+	dev->log_size = size;
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
  2015-12-02  3:43 ` [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 13:53   ` Victor Kaplansky
  2015-12-02  3:43 ` [PATCH 3/4] vhost: log vring changes Yuanhan Liu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

The page address is biased by log_guest_addr, which is derived from
SET_VRING_ADDR request as part of the vring related addresses.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
 lib/librte_vhost/virtio-net.c     |  4 ++++
 2 files changed, 38 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 416dac2..191c1be 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
  */
 
 #include <stdint.h>
+#include <linux/vhost.h>
 #include <linux/virtio_ring.h>
 #include <linux/virtio_net.h>
 #include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
 /* Backend value set by guest. */
 #define VIRTIO_DEV_STOPPED -1
 
+#define VHOST_LOG_PAGE	4096
+
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -82,6 +85,7 @@ struct vhost_virtqueue {
 	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
 	struct vring_avail	*avail;			/**< Virtqueue available ring. */
 	struct vring_used	*used;			/**< Virtqueue used ring. */
+	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
 	uint32_t		size;			/**< Size of descriptor ring. */
 	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
 	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
@@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
 	return vhost_va;
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	/* TODO: to make it atomic? */
+	log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		uint64_t offset, uint64_t len)
+{
+	uint64_t addr = vq->log_guest_addr;
+	uint64_t page;
+
+	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		     !dev->log_base || !len))
+		return;
+
+	addr += offset;
+	if (dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page(dev->log_base, page);
+		page += VHOST_LOG_PAGE;
+	}
+}
+
+
 /**
  *  Disable features in feature_mask. Returns 0 on success.
  */
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 8364938..4481827 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
 		return -1;
 	}
 
+	vq->log_guest_addr = addr->log_guest_addr;
+
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
 			dev->device_fh, vq->desc);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
 			dev->device_fh, vq->avail);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
 			dev->device_fh, vq->used);
+	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %p\n",
+			dev->device_fh, (void *)(uintptr_t)vq->log_guest_addr);
 
 	return 0;
 }
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 3/4] vhost: log vring changes
  2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
  2015-12-02  3:43 ` [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2015-12-02  3:43 ` [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 14:07   ` Victor Kaplansky
  2015-12-02  3:43 ` [PATCH 4/4] vhost: enable log_shmfd protocol feature Yuanhan Liu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Invoking vhost_log_write() to mark corresponding page as dirty while
updating used vring.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_rxtx.c | 74 +++++++++++++++++++++++++++++--------------
 1 file changed, 50 insertions(+), 24 deletions(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 9322ce6..d4805d8 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -129,6 +129,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t offset = 0, vb_offset = 0;
 		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
 		uint8_t hdr = 0, uncompleted_pkt = 0;
+		uint16_t idx;
 
 		/* Get descriptor from available ring */
 		desc = &vq->desc[head[packet_success]];
@@ -200,16 +201,22 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		}
 
 		/* Update used ring with desc information */
-		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
-							head[packet_success];
+		idx = res_cur_idx & (vq->size - 1);
+		vq->used->ring[idx].id = head[packet_success];
 
 		/* Drop the packet if it is uncompleted */
 		if (unlikely(uncompleted_pkt == 1))
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							vq->vhost_hlen;
+			vq->used->ring[idx].len = vq->vhost_hlen;
 		else
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							pkt_len + vq->vhost_hlen;
+			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+		/*
+		 * to defer the update to when updating used->idx,
+		 * and batch them?
+		 */
+		vhost_log_write(dev, vq,
+			offsetof(struct vring_used, ring[idx]),
+			sizeof(vq->used->ring[idx]));
 
 		res_cur_idx++;
 		packet_success++;
@@ -236,6 +243,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx = res_end_idx;
+	vhost_log_write(dev, vq,
+		offsetof(struct vring_used, idx),
+		sizeof(vq->used->idx));
 
 	/* flush used->idx update before we read avail->flags. */
 	rte_mb();
@@ -265,6 +275,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	uint32_t seg_avail;
 	uint32_t vb_avail;
 	uint32_t cpy_len, entry_len;
+	uint16_t idx;
 
 	if (pkt == NULL)
 		return 0;
@@ -302,16 +313,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	entry_len = vq->vhost_hlen;
 
 	if (vb_avail == 0) {
-		uint32_t desc_idx =
-			vq->buf_vec[vec_idx].desc_idx;
+		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+			idx = cur_idx & (vq->size - 1);
 
-		if ((vq->desc[desc_idx].flags
-			& VRING_DESC_F_NEXT) == 0) {
 			/* Update used ring with desc information */
-			vq->used->ring[cur_idx & (vq->size - 1)].id
-				= vq->buf_vec[vec_idx].desc_idx;
-			vq->used->ring[cur_idx & (vq->size - 1)].len
-				= entry_len;
+			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+			vq->used->ring[idx].len = entry_len;
+
+			vhost_log_write(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 
 			entry_len = 0;
 			cur_idx++;
@@ -354,10 +367,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
 				VRING_DESC_F_NEXT) == 0) {
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_write(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_len = 0;
 				cur_idx++;
 				entry_success++;
@@ -390,16 +406,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 					if ((vq->desc[desc_idx].flags &
 						VRING_DESC_F_NEXT) == 0) {
-						uint16_t wrapped_idx =
-							cur_idx & (vq->size - 1);
+						idx = cur_idx & (vq->size - 1);
 						/*
 						 * Update used ring with the
 						 * descriptor information
 						 */
-						vq->used->ring[wrapped_idx].id
+						vq->used->ring[idx].id
 							= desc_idx;
-						vq->used->ring[wrapped_idx].len
+						vq->used->ring[idx].len
 							= entry_len;
+						vhost_log_write(dev, vq,
+							offsetof(struct vring_used, ring[idx]),
+							sizeof(vq->used->ring[idx]));
 						entry_success++;
 						entry_len = 0;
 						cur_idx++;
@@ -422,10 +440,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 				 * This whole packet completes.
 				 */
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_write(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_success++;
 				break;
 			}
@@ -658,6 +679,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		/* Update used index buffer information. */
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
+		vhost_log_write(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
 
 		/* Allocate an mbuf and populate the structure. */
 		m = rte_pktmbuf_alloc(mbuf_pool);
@@ -778,6 +802,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
+	vhost_log_write(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH 4/4] vhost: enable log_shmfd protocol feature
  2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (2 preceding siblings ...)
  2015-12-02  3:43 ` [PATCH 3/4] vhost: log vring changes Yuanhan Liu
@ 2015-12-02  3:43 ` Yuanhan Liu
  2015-12-02 14:10 ` [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02  3:43 UTC (permalink / raw)
  To: dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_user/virtio-net-user.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..a3a889d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD	1
 
-#define VHOST_USER_PROTOCOL_FEATURES	(1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02  3:43 ` [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-02 13:53   ` Victor Kaplansky
  2015-12-02 14:39     ` Yuanhan Liu
  2015-12-09  3:33     ` Xie, Huawei
  0 siblings, 2 replies; 98+ messages in thread
From: Victor Kaplansky @ 2015-12-02 13:53 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 11:43:11AM +0800, Yuanhan Liu wrote:
> Introduce vhost_log_write() helper function to log the dirty pages we
> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> log is presented by 1 bit.
> 
> Therefore, vhost_log_write() simply finds the right bit for related
> page we are gonna change, and set it to 1. dev->log_base denotes the
> start of the dirty page bitmap.
> 
> The page address is biased by log_guest_addr, which is derived from
> SET_VRING_ADDR request as part of the vring related addresses.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
>  lib/librte_vhost/virtio-net.c     |  4 ++++
>  2 files changed, 38 insertions(+)
> 
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 416dac2..191c1be 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -40,6 +40,7 @@
>   */
>  
>  #include <stdint.h>
> +#include <linux/vhost.h>
>  #include <linux/virtio_ring.h>
>  #include <linux/virtio_net.h>
>  #include <sys/eventfd.h>
> @@ -59,6 +60,8 @@ struct rte_mbuf;
>  /* Backend value set by guest. */
>  #define VIRTIO_DEV_STOPPED -1
>  
> +#define VHOST_LOG_PAGE	4096
> +
>  
>  /* Enum for virtqueue management. */
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> @@ -82,6 +85,7 @@ struct vhost_virtqueue {
>  	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
>  	struct vring_avail	*avail;			/**< Virtqueue available ring. */
>  	struct vring_used	*used;			/**< Virtqueue used ring. */
> +	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
>  	uint32_t		size;			/**< Size of descriptor ring. */
>  	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
>  	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
> @@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>  	return vhost_va;
>  }
>  
> +static inline void __attribute__((always_inline))
> +vhost_log_page(uint8_t *log_base, uint64_t page)
> +{
> +	/* TODO: to make it atomic? */
> +	log_base[page / 8] |= 1 << (page % 8);

I think the atomic OR operation is necessary only if there can be
more than one vhost-user back-end updating the guest's memory
simultaneously. However probably it is pretty safe to perform
regular OR operation, since rings are not shared between
back-end. What about buffers pointed by descriptors?  To be on
the safe side, I would use a GCC built-in function
__sync_fetch_and_or(). 

> +}
> +
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		uint64_t offset, uint64_t len)
> +{
> +	uint64_t addr = vq->log_guest_addr;
> +	uint64_t page;
> +
> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +		     !dev->log_base || !len))
> +		return;

Isn't "likely" more appropriate in above, since the whole
expression is expected to be true most of the time?

> +
> +	addr += offset;
> +	if (dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8))
> +		return;
> +
> +	page = addr / VHOST_LOG_PAGE;
> +	while (page * VHOST_LOG_PAGE < addr + len) {
> +		vhost_log_page(dev->log_base, page);
> +		page += VHOST_LOG_PAGE;
> +	}
> +}
> +
> +
>  /**
>   *  Disable features in feature_mask. Returns 0 on success.
>   */
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 8364938..4481827 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
>  		return -1;
>  	}
>  
> +	vq->log_guest_addr = addr->log_guest_addr;
> +
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
>  			dev->device_fh, vq->desc);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
>  			dev->device_fh, vq->avail);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
>  			dev->device_fh, vq->used);
> +	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %p\n",
> +			dev->device_fh, (void *)(uintptr_t)vq->log_guest_addr);
>  
>  	return 0;
>  }
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02  3:43 ` [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-02 13:53   ` Panu Matilainen
  2015-12-02 14:31     ` Yuanhan Liu
  2015-12-06 23:07     ` Thomas Monjalon
  2015-12-08  5:57   ` Xie, Huawei
  1 sibling, 2 replies; 98+ messages in thread
From: Panu Matilainen @ 2015-12-02 13:53 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/02/2015 05:43 AM, Yuanhan Liu wrote:
> VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
> vhost-user) where we should log dirty pages, and how big the log
> buffer is.
>
> This request introduces a new payload:
>
> 	typedef struct VhostUserLog {
> 		uint64_t mmap_size;
> 		uint64_t mmap_offset;
> 	} VhostUserLog;
>
> Also, a fd is delivered from QEMU by ancillary data.
>
> With those info given, an area of memory is mmaped, assigned
> to dev->log_base, for logging dirty pages.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>   lib/librte_vhost/rte_virtio_net.h             |  2 ++
>   lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
>   lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
>   lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
>   lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
>   5 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 5687452..416dac2 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -127,6 +127,8 @@ struct virtio_net {
>   #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
>   	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>   	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> +	uint64_t		log_size;	/**< Size of log area */
> +	uint8_t			*log_base;	/**< Where dirty pages are logged */
>   	void			*priv;		/**< private context */
>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>   } __rte_cache_aligned;

This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
you'd need to at least add a deprecation note to 2.2 to be able to do it 
in 2.3 at all according to the ABI policy.

Perhaps a better option would be adding some padding to the structs now 
for 2.2 since the vhost ABI is broken there anyway. That would at least 
give a chance to keep it compatible from 2.2 to 2.3.

	- Panu -

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/4] vhost: log vring changes
  2015-12-02  3:43 ` [PATCH 3/4] vhost: log vring changes Yuanhan Liu
@ 2015-12-02 14:07   ` Victor Kaplansky
  2015-12-02 14:38     ` Yuanhan Liu
  2015-12-09  2:45     ` Xie, Huawei
  0 siblings, 2 replies; 98+ messages in thread
From: Victor Kaplansky @ 2015-12-02 14:07 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> Invoking vhost_log_write() to mark corresponding page as dirty while
> updating used vring.

Looks good, thanks!

I didn't find where you log the dirty pages in result of data
written to the buffers pointed by the descriptors in RX vring.
AFAIU, the buffers of RX queue reside in guest's memory and have
to be marked as dirty if they are written. What do you say?

-- Victor

> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/vhost_rxtx.c | 74 +++++++++++++++++++++++++++++--------------
>  1 file changed, 50 insertions(+), 24 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 9322ce6..d4805d8 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -129,6 +129,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		uint32_t offset = 0, vb_offset = 0;
>  		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
>  		uint8_t hdr = 0, uncompleted_pkt = 0;
> +		uint16_t idx;
>  
>  		/* Get descriptor from available ring */
>  		desc = &vq->desc[head[packet_success]];
> @@ -200,16 +201,22 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		}
>  
>  		/* Update used ring with desc information */
> -		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> -							head[packet_success];
> +		idx = res_cur_idx & (vq->size - 1);
> +		vq->used->ring[idx].id = head[packet_success];
>  
>  		/* Drop the packet if it is uncompleted */
>  		if (unlikely(uncompleted_pkt == 1))
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							vq->vhost_hlen;
> +			vq->used->ring[idx].len = vq->vhost_hlen;
>  		else
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							pkt_len + vq->vhost_hlen;
> +			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> +
> +		/*
> +		 * to defer the update to when updating used->idx,
> +		 * and batch them?
> +		 */
> +		vhost_log_write(dev, vq,
> +			offsetof(struct vring_used, ring[idx]),
> +			sizeof(vq->used->ring[idx]));
>  
>  		res_cur_idx++;
>  		packet_success++;
> @@ -236,6 +243,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  
>  	*(volatile uint16_t *)&vq->used->idx += count;
>  	vq->last_used_idx = res_end_idx;
> +	vhost_log_write(dev, vq,
> +		offsetof(struct vring_used, idx),
> +		sizeof(vq->used->idx));
>  
>  	/* flush used->idx update before we read avail->flags. */
>  	rte_mb();
> @@ -265,6 +275,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	uint32_t seg_avail;
>  	uint32_t vb_avail;
>  	uint32_t cpy_len, entry_len;
> +	uint16_t idx;
>  
>  	if (pkt == NULL)
>  		return 0;
> @@ -302,16 +313,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	entry_len = vq->vhost_hlen;
>  
>  	if (vb_avail == 0) {
> -		uint32_t desc_idx =
> -			vq->buf_vec[vec_idx].desc_idx;
> +		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
> +
> +		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
> +			idx = cur_idx & (vq->size - 1);
>  
> -		if ((vq->desc[desc_idx].flags
> -			& VRING_DESC_F_NEXT) == 0) {
>  			/* Update used ring with desc information */
> -			vq->used->ring[cur_idx & (vq->size - 1)].id
> -				= vq->buf_vec[vec_idx].desc_idx;
> -			vq->used->ring[cur_idx & (vq->size - 1)].len
> -				= entry_len;
> +			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
> +			vq->used->ring[idx].len = entry_len;
> +
> +			vhost_log_write(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  
>  			entry_len = 0;
>  			cur_idx++;
> @@ -354,10 +367,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
>  				VRING_DESC_F_NEXT) == 0) {
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_write(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_len = 0;
>  				cur_idx++;
>  				entry_success++;
> @@ -390,16 +406,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  
>  					if ((vq->desc[desc_idx].flags &
>  						VRING_DESC_F_NEXT) == 0) {
> -						uint16_t wrapped_idx =
> -							cur_idx & (vq->size - 1);
> +						idx = cur_idx & (vq->size - 1);
>  						/*
>  						 * Update used ring with the
>  						 * descriptor information
>  						 */
> -						vq->used->ring[wrapped_idx].id
> +						vq->used->ring[idx].id
>  							= desc_idx;
> -						vq->used->ring[wrapped_idx].len
> +						vq->used->ring[idx].len
>  							= entry_len;
> +						vhost_log_write(dev, vq,
> +							offsetof(struct vring_used, ring[idx]),
> +							sizeof(vq->used->ring[idx]));
>  						entry_success++;
>  						entry_len = 0;
>  						cur_idx++;
> @@ -422,10 +440,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  				 * This whole packet completes.
>  				 */
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_write(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_success++;
>  				break;
>  			}
> @@ -658,6 +679,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  		/* Update used index buffer information. */
>  		vq->used->ring[used_idx].id = head[entry_success];
>  		vq->used->ring[used_idx].len = 0;
> +		vhost_log_write(dev, vq,
> +				offsetof(struct vring_used, ring[used_idx]),
> +				sizeof(vq->used->ring[used_idx]));
>  
>  		/* Allocate an mbuf and populate the structure. */
>  		m = rte_pktmbuf_alloc(mbuf_pool);
> @@ -778,6 +802,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  
>  	rte_compiler_barrier();
>  	vq->used->idx += entry_success;
> +	vhost_log_write(dev, vq, offsetof(struct vring_used, idx),
> +			sizeof(vq->used->idx));
>  	/* Kick guest if required. */
>  	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
>  		eventfd_write(vq->callfd, (eventfd_t)1);
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (3 preceding siblings ...)
  2015-12-02  3:43 ` [PATCH 4/4] vhost: enable log_shmfd protocol feature Yuanhan Liu
@ 2015-12-02 14:10 ` Victor Kaplansky
  2015-12-02 14:33   ` Yuanhan Liu
  2015-12-09  3:41 ` Xie, Huawei
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
  6 siblings, 1 reply; 98+ messages in thread
From: Victor Kaplansky @ 2015-12-02 14:10 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 11:43:09AM +0800, Yuanhan Liu wrote:
> This patch set adds the initial vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during
> live migration. So, this patch is basically about adding vhost
> log support, and using it.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
>     
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 sets log_fhmfd protocol feature bit, which actually
>   enables the vhost-user live migration support.
> 
> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK. And here is guide
> to setup OVS + DPDK:
> 
>     http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444 
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 
> 
> Note: this patch set has mostly been based on Victor Kaplansk's demo
> work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
> as the co-author. Victor, what do you think of that? :)

Thanks for adding me to credits list!
-- Victor

> 
> Comments are welcome!
> 
> ---
> Yuanhan Liu (4):
>   vhost: handle VHOST_USER_SET_LOG_BASE request
>   vhost: introduce vhost_log_write
>   vhost: log vring changes
>   vhost: enable log_shmfd protocol feature
> 
>  lib/librte_vhost/rte_virtio_net.h             | 35 ++++++++++++++
>  lib/librte_vhost/vhost_rxtx.c                 | 70 ++++++++++++++++++---------
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 +++
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
>  lib/librte_vhost/virtio-net.c                 |  4 ++
>  7 files changed, 145 insertions(+), 26 deletions(-)
> 
> -- 
> 1.9.0

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 13:53   ` Panu Matilainen
@ 2015-12-02 14:31     ` Yuanhan Liu
  2015-12-02 14:48       ` Panu Matilainen
  2015-12-02 16:38       ` Thomas Monjalon
  2015-12-06 23:07     ` Thomas Monjalon
  1 sibling, 2 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:31 UTC (permalink / raw)
  To: Panu Matilainen, Thomas Monjalon
  Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 03:53:45PM +0200, Panu Matilainen wrote:
> On 12/02/2015 05:43 AM, Yuanhan Liu wrote:
> >VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
> >vhost-user) where we should log dirty pages, and how big the log
> >buffer is.
> >
> >This request introduces a new payload:
> >
> >	typedef struct VhostUserLog {
> >		uint64_t mmap_size;
> >		uint64_t mmap_offset;
> >	} VhostUserLog;
> >
> >Also, a fd is delivered from QEMU by ancillary data.
> >
> >With those info given, an area of memory is mmaped, assigned
> >to dev->log_base, for logging dirty pages.
> >
> >Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >---
> >  lib/librte_vhost/rte_virtio_net.h             |  2 ++
> >  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
> >  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
> >  lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
> >  lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
> >  5 files changed, 59 insertions(+), 1 deletion(-)
> >
> >diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> >index 5687452..416dac2 100644
> >--- a/lib/librte_vhost/rte_virtio_net.h
> >+++ b/lib/librte_vhost/rte_virtio_net.h
> >@@ -127,6 +127,8 @@ struct virtio_net {
> >  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
> >  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> >+	uint64_t		log_size;	/**< Size of log area */
> >+	uint8_t			*log_base;	/**< Where dirty pages are logged */
> >  	void			*priv;		/**< private context */
> >  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
> >  } __rte_cache_aligned;
> 
> This (and other changes in patch 2 breaks the librte_vhost ABI
> again, so you'd need to at least add a deprecation note to 2.2 to be
> able to do it in 2.3 at all according to the ABI policy.

I was thinking that adding a new field (instead of renaming it or
removing it) isn't an ABI break. So, I was wrong?

> 
> Perhaps a better option would be adding some padding to the structs
> now for 2.2 since the vhost ABI is broken there anyway. That would
> at least give a chance to keep it compatible from 2.2 to 2.3.

It will not be compatible, unless we add exact same fields (not
something like uint8_t pad[xx]). Otherwise, the pad field renaming
is also an ABI break, right?

Thomas, should I write an ABI deprecation note? Can I make it for
v2.2 release If I make one tomorrow? (Sorry that I'm not awared
of that it would be an ABI break).

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-02 14:10 ` [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
@ 2015-12-02 14:33   ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:33 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 04:10:56PM +0200, Victor Kaplansky wrote:
...
> > Note: this patch set has mostly been based on Victor Kaplansk's demo
> > work (vhost-user-bridge) at QEMU project. I was thinking to add Victor
> > as the co-author. Victor, what do you think of that? :)
> 
> Thanks for adding me to credits list!

Great, I will add your signed-off-by since v2. Will that be okay to you?

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/4] vhost: log vring changes
  2015-12-02 14:07   ` Victor Kaplansky
@ 2015-12-02 14:38     ` Yuanhan Liu
  2015-12-02 15:58       ` Victor Kaplansky
  2015-12-09  2:45     ` Xie, Huawei
  1 sibling, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:38 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > Invoking vhost_log_write() to mark corresponding page as dirty while
> > updating used vring.
> 
> Looks good, thanks!
> 
> I didn't find where you log the dirty pages in result of data
> written to the buffers pointed by the descriptors in RX vring.
> AFAIU, the buffers of RX queue reside in guest's memory and have
> to be marked as dirty if they are written. What do you say?

Yeah, we should. I got a question then: why log_guest_addr is set
to the physical address of used vring in guest? I mean, apparently,
we need log more changes other than used vring only.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02 13:53   ` Victor Kaplansky
@ 2015-12-02 14:39     ` Yuanhan Liu
  2015-12-09  3:33     ` Xie, Huawei
  1 sibling, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02 14:39 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 03:53:01PM +0200, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:11AM +0800, Yuanhan Liu wrote:
> > Introduce vhost_log_write() helper function to log the dirty pages we
> > touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> > log is presented by 1 bit.
> > 
> > Therefore, vhost_log_write() simply finds the right bit for related
> > page we are gonna change, and set it to 1. dev->log_base denotes the
> > start of the dirty page bitmap.
> > 
> > The page address is biased by log_guest_addr, which is derived from
> > SET_VRING_ADDR request as part of the vring related addresses.
> > 
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > ---
> >  lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
> >  lib/librte_vhost/virtio-net.c     |  4 ++++
> >  2 files changed, 38 insertions(+)
> > 
> > diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> > index 416dac2..191c1be 100644
> > --- a/lib/librte_vhost/rte_virtio_net.h
> > +++ b/lib/librte_vhost/rte_virtio_net.h
> > @@ -40,6 +40,7 @@
> >   */
> >  
> >  #include <stdint.h>
> > +#include <linux/vhost.h>
> >  #include <linux/virtio_ring.h>
> >  #include <linux/virtio_net.h>
> >  #include <sys/eventfd.h>
> > @@ -59,6 +60,8 @@ struct rte_mbuf;
> >  /* Backend value set by guest. */
> >  #define VIRTIO_DEV_STOPPED -1
> >  
> > +#define VHOST_LOG_PAGE	4096
> > +
> >  
> >  /* Enum for virtqueue management. */
> >  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > @@ -82,6 +85,7 @@ struct vhost_virtqueue {
> >  	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
> >  	struct vring_avail	*avail;			/**< Virtqueue available ring. */
> >  	struct vring_used	*used;			/**< Virtqueue used ring. */
> > +	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
> >  	uint32_t		size;			/**< Size of descriptor ring. */
> >  	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
> >  	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
> > @@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
> >  	return vhost_va;
> >  }
> >  
> > +static inline void __attribute__((always_inline))
> > +vhost_log_page(uint8_t *log_base, uint64_t page)
> > +{
> > +	/* TODO: to make it atomic? */
> > +	log_base[page / 8] |= 1 << (page % 8);
> 
> I think the atomic OR operation is necessary only if there can be
> more than one vhost-user back-end updating the guest's memory
> simultaneously. However probably it is pretty safe to perform
> regular OR operation, since rings are not shared between
> back-end. What about buffers pointed by descriptors?  To be on
> the safe side, I would use a GCC built-in function
> __sync_fetch_and_or(). 

The build has to be passed not only for gcc, but for icc and clang as
well.

> 
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +		uint64_t offset, uint64_t len)
> > +{
> > +	uint64_t addr = vq->log_guest_addr;
> > +	uint64_t page;
> > +
> > +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +		     !dev->log_base || !len))
> > +		return;
> 
> Isn't "likely" more appropriate in above, since the whole
> expression is expected to be true most of the time?

Sorry, it's a typo, and thanks for the catching.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 14:31     ` Yuanhan Liu
@ 2015-12-02 14:48       ` Panu Matilainen
  2015-12-02 15:09         ` Yuanhan Liu
  2015-12-02 16:38       ` Thomas Monjalon
  1 sibling, 1 reply; 98+ messages in thread
From: Panu Matilainen @ 2015-12-02 14:48 UTC (permalink / raw)
  To: Yuanhan Liu, Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/02/2015 04:31 PM, Yuanhan Liu wrote:
> On Wed, Dec 02, 2015 at 03:53:45PM +0200, Panu Matilainen wrote:
>> On 12/02/2015 05:43 AM, Yuanhan Liu wrote:
>>> VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
>>> vhost-user) where we should log dirty pages, and how big the log
>>> buffer is.
>>>
>>> This request introduces a new payload:
>>>
>>> 	typedef struct VhostUserLog {
>>> 		uint64_t mmap_size;
>>> 		uint64_t mmap_offset;
>>> 	} VhostUserLog;
>>>
>>> Also, a fd is delivered from QEMU by ancillary data.
>>>
>>> With those info given, an area of memory is mmaped, assigned
>>> to dev->log_base, for logging dirty pages.
>>>
>>> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>>> ---
>>>   lib/librte_vhost/rte_virtio_net.h             |  2 ++
>>>   lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++++-
>>>   lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
>>>   lib/librte_vhost/vhost_user/virtio-net-user.c | 44 +++++++++++++++++++++++++++
>>>   lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
>>>   5 files changed, 59 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>>> index 5687452..416dac2 100644
>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>> @@ -127,6 +127,8 @@ struct virtio_net {
>>>   #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
>>>   	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>>>   	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
>>> +	uint64_t		log_size;	/**< Size of log area */
>>> +	uint8_t			*log_base;	/**< Where dirty pages are logged */
>>>   	void			*priv;		/**< private context */
>>>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>>>   } __rte_cache_aligned;
>>
>> This (and other changes in patch 2 breaks the librte_vhost ABI
>> again, so you'd need to at least add a deprecation note to 2.2 to be
>> able to do it in 2.3 at all according to the ABI policy.
>
> I was thinking that adding a new field (instead of renaming it or
> removing it) isn't an ABI break. So, I was wrong?

Adding or removing a field in the middle of a public struct is always an 
ABI break. Adding to the end often is too, but not always.
Renaming a field is an API break but not an ABI break - the compiler 
cares but the cpu does not.

>>
>> Perhaps a better option would be adding some padding to the structs
>> now for 2.2 since the vhost ABI is broken there anyway. That would
>> at least give a chance to keep it compatible from 2.2 to 2.3.
>
> It will not be compatible, unless we add exact same fields (not
> something like uint8_t pad[xx]). Otherwise, the pad field renaming
> is also an ABI break, right?

There's no ABI (or API) break in changing reserved unused fields to 
something else, as long as care is taken with sizes and alignment. In 
any case padding is best added to the end of a struct to minimize risks 
and keep things simple.

	- Panu -

>
> Thomas, should I write an ABI deprecation note? Can I make it for
> v2.2 release If I make one tomorrow? (Sorry that I'm not awared
> of that it would be an ABI break).
>
> 	--yliu
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 14:48       ` Panu Matilainen
@ 2015-12-02 15:09         ` Yuanhan Liu
  2015-12-02 16:58           ` Panu Matilainen
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-02 15:09 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
...
> >>>diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> >>>index 5687452..416dac2 100644
> >>>--- a/lib/librte_vhost/rte_virtio_net.h
> >>>+++ b/lib/librte_vhost/rte_virtio_net.h
> >>>@@ -127,6 +127,8 @@ struct virtio_net {
> >>>  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >>>  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
> >>>  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> >>>+	uint64_t		log_size;	/**< Size of log area */
> >>>+	uint8_t			*log_base;	/**< Where dirty pages are logged */
> >>>  	void			*priv;		/**< private context */
> >>>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
> >>>  } __rte_cache_aligned;
> >>
> >>This (and other changes in patch 2 breaks the librte_vhost ABI
> >>again, so you'd need to at least add a deprecation note to 2.2 to be
> >>able to do it in 2.3 at all according to the ABI policy.
> >
> >I was thinking that adding a new field (instead of renaming it or
> >removing it) isn't an ABI break. So, I was wrong?
> 
> Adding or removing a field in the middle of a public struct is
> always an ABI break. Adding to the end often is too, but not always.
> Renaming a field is an API break but not an ABI break - the compiler
> cares but the cpu does not.

Good to know. Thanks.

> 
> >>
> >>Perhaps a better option would be adding some padding to the structs
> >>now for 2.2 since the vhost ABI is broken there anyway. That would
> >>at least give a chance to keep it compatible from 2.2 to 2.3.
> >
> >It will not be compatible, unless we add exact same fields (not
> >something like uint8_t pad[xx]). Otherwise, the pad field renaming
> >is also an ABI break, right?
> 
> There's no ABI (or API) break in changing reserved unused fields to
> something else, as long as care is taken with sizes and alignment.

as long as we don't reference the reserved unused fields?

> In any case padding is best added to the end of a struct to minimize
> risks and keep things simple.

The thing is that isn't it a bit aweful to (always) add pads to
the end of a struct, especially when you don't know how many
need to be padded?

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/4] vhost: log vring changes
  2015-12-02 14:38     ` Yuanhan Liu
@ 2015-12-02 15:58       ` Victor Kaplansky
  2015-12-02 16:26         ` Michael S. Tsirkin
  0 siblings, 1 reply; 98+ messages in thread
From: Victor Kaplansky @ 2015-12-02 15:58 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 10:38:02PM +0800, Yuanhan Liu wrote:
> On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> > On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > > Invoking vhost_log_write() to mark corresponding page as dirty while
> > > updating used vring.
> > 
> > Looks good, thanks!
> > 
> > I didn't find where you log the dirty pages in result of data
> > written to the buffers pointed by the descriptors in RX vring.
> > AFAIU, the buffers of RX queue reside in guest's memory and have
> > to be marked as dirty if they are written. What do you say?
> 
> Yeah, we should. I got a question then: why log_guest_addr is set
> to the physical address of used vring in guest? I mean, apparently,
> we need log more changes other than used vring only.

The physical address of used vring sent to the back-end, since
otherwise back-end has to perform virtual to physical
translation, and we want to avoid this. The dirty buffers has to
be marked as well, but their guest's physical address is known
directly from the descriptors.

> 
> 	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/4] vhost: log vring changes
  2015-12-02 15:58       ` Victor Kaplansky
@ 2015-12-02 16:26         ` Michael S. Tsirkin
  2015-12-03  2:31           ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Michael S. Tsirkin @ 2015-12-02 16:26 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: dev

On Wed, Dec 02, 2015 at 05:58:24PM +0200, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 10:38:02PM +0800, Yuanhan Liu wrote:
> > On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> > > On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > > > Invoking vhost_log_write() to mark corresponding page as dirty while
> > > > updating used vring.
> > > 
> > > Looks good, thanks!
> > > 
> > > I didn't find where you log the dirty pages in result of data
> > > written to the buffers pointed by the descriptors in RX vring.
> > > AFAIU, the buffers of RX queue reside in guest's memory and have
> > > to be marked as dirty if they are written. What do you say?
> > 
> > Yeah, we should. I got a question then: why log_guest_addr is set
> > to the physical address of used vring in guest? I mean, apparently,
> > we need log more changes other than used vring only.
> 
> The physical address of used vring sent to the back-end, since
> otherwise back-end has to perform virtual to physical
> translation, and we want to avoid this. The dirty buffers has to
> be marked as well, but their guest's physical address is known
> directly from the descriptors.

Yes, people wanted to be able to do multiple physical
addresses to one virtual so you do not want to translate
virt to phys.

> > 
> > 	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 14:31     ` Yuanhan Liu
  2015-12-02 14:48       ` Panu Matilainen
@ 2015-12-02 16:38       ` Thomas Monjalon
  2015-12-03  1:49         ` Yuanhan Liu
  1 sibling, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2015-12-02 16:38 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-02 22:31, Yuanhan Liu:
> Thomas, should I write an ABI deprecation note? Can I make it for
> v2.2 release If I make one tomorrow? (Sorry that I'm not awared
> of that it would be an ABI break).

As Panu suggested, it would be better to reserve some room now
in 2.2 which already breaks vhost ABI.
Maybe we have a chance to keep the same vhost ABI in 2.3.

The 2.2 release will probably be closed in less than 2 weeks.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 15:09         ` Yuanhan Liu
@ 2015-12-02 16:58           ` Panu Matilainen
  2015-12-02 17:24             ` Michael S. Tsirkin
  0 siblings, 1 reply; 98+ messages in thread
From: Panu Matilainen @ 2015-12-02 16:58 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/02/2015 05:09 PM, Yuanhan Liu wrote:
> On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
> ...
>>>>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>>>>> index 5687452..416dac2 100644
>>>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>>>> @@ -127,6 +127,8 @@ struct virtio_net {
>>>>>   #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
>>>>>   	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>>>>>   	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
>>>>> +	uint64_t		log_size;	/**< Size of log area */
>>>>> +	uint8_t			*log_base;	/**< Where dirty pages are logged */
>>>>>   	void			*priv;		/**< private context */
>>>>>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>>>>>   } __rte_cache_aligned;
>>>>
>>>> This (and other changes in patch 2 breaks the librte_vhost ABI
>>>> again, so you'd need to at least add a deprecation note to 2.2 to be
>>>> able to do it in 2.3 at all according to the ABI policy.
>>>
>>> I was thinking that adding a new field (instead of renaming it or
>>> removing it) isn't an ABI break. So, I was wrong?
>>
>> Adding or removing a field in the middle of a public struct is
>> always an ABI break. Adding to the end often is too, but not always.
>> Renaming a field is an API break but not an ABI break - the compiler
>> cares but the cpu does not.
>
> Good to know. Thanks.
>
>>
>>>>
>>>> Perhaps a better option would be adding some padding to the structs
>>>> now for 2.2 since the vhost ABI is broken there anyway. That would
>>>> at least give a chance to keep it compatible from 2.2 to 2.3.
>>>
>>> It will not be compatible, unless we add exact same fields (not
>>> something like uint8_t pad[xx]). Otherwise, the pad field renaming
>>> is also an ABI break, right?
>>
>> There's no ABI (or API) break in changing reserved unused fields to
>> something else, as long as care is taken with sizes and alignment.
>
> as long as we don't reference the reserved unused fields?

That would be the definition of an unused field I think :)
Call it "reserved" if you want, it doesn't really matter as long as its 
clear its something you shouldn't be using.

>
>> In any case padding is best added to the end of a struct to minimize
>> risks and keep things simple.
>
> The thing is that isn't it a bit aweful to (always) add pads to
> the end of a struct, especially when you don't know how many
> need to be padded?

Then you pad for what you think you need, plus a bit extra, and maybe 
some more for others who might want to extend it. What is a reasonable 
amount needs deciding case by case - if a struct is alloced in the 
millions then be (very) conservative, but if there are one or 50 such 
structs within an app lifetime then who cares if its bit larger?

And yeah padding may be annoying, but that's pretty much the only option 
in a project where most of the structs are out in the open.

	- Panu -

>
> 	--yliu
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 16:58           ` Panu Matilainen
@ 2015-12-02 17:24             ` Michael S. Tsirkin
  0 siblings, 0 replies; 98+ messages in thread
From: Michael S. Tsirkin @ 2015-12-02 17:24 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky

On Wed, Dec 02, 2015 at 06:58:03PM +0200, Panu Matilainen wrote:
> On 12/02/2015 05:09 PM, Yuanhan Liu wrote:
> >On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
> >...
> >>>>>diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> >>>>>index 5687452..416dac2 100644
> >>>>>--- a/lib/librte_vhost/rte_virtio_net.h
> >>>>>+++ b/lib/librte_vhost/rte_virtio_net.h
> >>>>>@@ -127,6 +127,8 @@ struct virtio_net {
> >>>>>  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> >>>>>  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
> >>>>>  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
> >>>>>+	uint64_t		log_size;	/**< Size of log area */
> >>>>>+	uint8_t			*log_base;	/**< Where dirty pages are logged */
> >>>>>  	void			*priv;		/**< private context */
> >>>>>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
> >>>>>  } __rte_cache_aligned;
> >>>>
> >>>>This (and other changes in patch 2 breaks the librte_vhost ABI
> >>>>again, so you'd need to at least add a deprecation note to 2.2 to be
> >>>>able to do it in 2.3 at all according to the ABI policy.
> >>>
> >>>I was thinking that adding a new field (instead of renaming it or
> >>>removing it) isn't an ABI break. So, I was wrong?
> >>
> >>Adding or removing a field in the middle of a public struct is
> >>always an ABI break. Adding to the end often is too, but not always.
> >>Renaming a field is an API break but not an ABI break - the compiler
> >>cares but the cpu does not.
> >
> >Good to know. Thanks.
> >
> >>
> >>>>
> >>>>Perhaps a better option would be adding some padding to the structs
> >>>>now for 2.2 since the vhost ABI is broken there anyway. That would
> >>>>at least give a chance to keep it compatible from 2.2 to 2.3.
> >>>
> >>>It will not be compatible, unless we add exact same fields (not
> >>>something like uint8_t pad[xx]). Otherwise, the pad field renaming
> >>>is also an ABI break, right?
> >>
> >>There's no ABI (or API) break in changing reserved unused fields to
> >>something else, as long as care is taken with sizes and alignment.
> >
> >as long as we don't reference the reserved unused fields?
> 
> That would be the definition of an unused field I think :)
> Call it "reserved" if you want, it doesn't really matter as long as its
> clear its something you shouldn't be using.
> 
> >
> >>In any case padding is best added to the end of a struct to minimize
> >>risks and keep things simple.
> >
> >The thing is that isn't it a bit aweful to (always) add pads to
> >the end of a struct, especially when you don't know how many
> >need to be padded?
> 
> Then you pad for what you think you need, plus a bit extra, and maybe some
> more for others who might want to extend it. What is a reasonable amount
> needs deciding case by case - if a struct is alloced in the millions then be
> (very) conservative, but if there are one or 50 such structs within an app
> lifetime then who cares if its bit larger?
> 
> And yeah padding may be annoying, but that's pretty much the only option in
> a project where most of the structs are out in the open.
> 
> 	- Panu -

Functions versioning is another option.
For a sufficiently widely used struct, it's a lot of work, padding
is easier.  But it might be better than breaking ABI
if e.g. you didn't pad enough.

> >
> >	--yliu
> >

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 16:38       ` Thomas Monjalon
@ 2015-12-03  1:49         ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-03  1:49 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 02, 2015 at 05:38:10PM +0100, Thomas Monjalon wrote:
> 2015-12-02 22:31, Yuanhan Liu:
> > Thomas, should I write an ABI deprecation note? Can I make it for
> > v2.2 release If I make one tomorrow? (Sorry that I'm not awared
> > of that it would be an ABI break).
> 
> As Panu suggested, it would be better to reserve some room now
> in 2.2 which already breaks vhost ABI.
> Maybe we have a chance to keep the same vhost ABI in 2.3.


Got it. I will cook up one now.

	--yliu
> 
> The 2.2 release will probably be closed in less than 2 weeks.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/4] vhost: log vring changes
  2015-12-02 16:26         ` Michael S. Tsirkin
@ 2015-12-03  2:31           ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-03  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: dev, Victor Kaplansky

On Wed, Dec 02, 2015 at 06:26:37PM +0200, Michael S. Tsirkin wrote:
> On Wed, Dec 02, 2015 at 05:58:24PM +0200, Victor Kaplansky wrote:
> > On Wed, Dec 02, 2015 at 10:38:02PM +0800, Yuanhan Liu wrote:
> > > On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> > > > On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > > > > Invoking vhost_log_write() to mark corresponding page as dirty while
> > > > > updating used vring.
> > > > 
> > > > Looks good, thanks!
> > > > 
> > > > I didn't find where you log the dirty pages in result of data
> > > > written to the buffers pointed by the descriptors in RX vring.
> > > > AFAIU, the buffers of RX queue reside in guest's memory and have
> > > > to be marked as dirty if they are written. What do you say?
> > > 
> > > Yeah, we should. I got a question then: why log_guest_addr is set
> > > to the physical address of used vring in guest? I mean, apparently,
> > > we need log more changes other than used vring only.
> > 
> > The physical address of used vring sent to the back-end, since
> > otherwise back-end has to perform virtual to physical
> > translation, and we want to avoid this. The dirty buffers has to
> > be marked as well, but their guest's physical address is known
> > directly from the descriptors.
> 
> Yes, people wanted to be able to do multiple physical
> addresses to one virtual so you do not want to translate
> virt to phys.

Thank you both, it's clear to me then.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02 13:53   ` Panu Matilainen
  2015-12-02 14:31     ` Yuanhan Liu
@ 2015-12-06 23:07     ` Thomas Monjalon
  2015-12-07  2:00       ` Yuanhan Liu
  2015-12-07  6:29       ` Panu Matilainen
  1 sibling, 2 replies; 98+ messages in thread
From: Thomas Monjalon @ 2015-12-06 23:07 UTC (permalink / raw)
  To: Panu Matilainen, Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-02 15:53, Panu Matilainen:
> This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> you'd need to at least add a deprecation note to 2.2 to be able to do it 
> in 2.3 at all according to the ABI policy.
> 
> Perhaps a better option would be adding some padding to the structs now 
> for 2.2 since the vhost ABI is broken there anyway. That would at least 
> give a chance to keep it compatible from 2.2 to 2.3.

Please could you point where the vhost ABI is broken in 2.2?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-06 23:07     ` Thomas Monjalon
@ 2015-12-07  2:00       ` Yuanhan Liu
  2015-12-07  2:03         ` Thomas Monjalon
  2015-12-07  6:29       ` Panu Matilainen
  1 sibling, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-07  2:00 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> 2015-12-02 15:53, Panu Matilainen:
> > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > in 2.3 at all according to the ABI policy.
> > 
> > Perhaps a better option would be adding some padding to the structs now 
> > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > give a chance to keep it compatible from 2.2 to 2.3.
> 
> Please could you point where the vhost ABI is broken in 2.2?

Thomas, here are the changes to rte_virtio_net.h:


$ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
      -- lib/librte_vhost/rte_virtio_net.h >
diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index e3a21e5..426a70d 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -89,6 +89,7 @@ struct vhost_virtqueue {
 	volatile uint16_t	last_used_idx_res;	/**< Used for multiple devices reserving buffers. */
 	int			callfd;			/**< Used to notify the guest (trigger interrupt). */
 	int			kickfd;			/**< Currently unused as polling mode is enabled. */
+	int			enabled;
 	struct buf_vector	buf_vec[BUF_VECTOR_MAX];	/**< for scatter RX. */
 } __rte_cache_aligned;
 
@@ -96,7 +97,6 @@ struct vhost_virtqueue {
  * Device structure contains all configuration information relating to the device.
  */
 struct virtio_net {
-	struct vhost_virtqueue	*virtqueue[VIRTIO_QNUM];	/**< Contains all virtqueue information. */
 	struct virtio_memory	*mem;		/**< QEMU memory and memory region information. */
 	uint64_t		features;	/**< Negotiated feature set. */
 	uint64_t		protocol_features;	/**< Negotiated protocol feature set. */
@@ -104,7 +104,9 @@ struct virtio_net {
 	uint32_t		flags;		/**< Device flags. Only used to check if device is running on data core. */
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
+	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
 	void			*priv;		/**< private context */
+	struct vhost_virtqueue	*virtqueue[VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
 /**
@@ -131,7 +133,7 @@ struct virtio_memory {
 };
 
 /**
- * Device operations to add/remove device.
+ * Device and vring operations.
  *
  * Make sure to set VIRTIO_DEV_RUNNING to the device flags in new_device and
  * remove it in destroy_device.
@@ -140,12 +142,18 @@ struct virtio_memory {
 struct virtio_net_device_ops {
 	int (*new_device)(struct virtio_net *);	/**< Add device. */
 	void (*destroy_device)(volatile struct virtio_net *);	/**< Remove device. */
+
+	int (*vring_state_changed)(struct virtio_net *dev, uint16_t queue_id, int enable);	/**< triggered when a vring is enabled or disabled */
 };
 
 static inline uint16_t __attribute__((always_inline))
 rte_vring_available_entries(struct virtio_net *dev, uint16_t queue_id)
 {
 	struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
+
+	if (!vq->enabled)
+		return 0;
+
 	return *(volatile uint16_t *)&vq->avail->idx - vq->last_used_idx_res;
 }
 

	--yliu 

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  2:00       ` Yuanhan Liu
@ 2015-12-07  2:03         ` Thomas Monjalon
  2015-12-07  2:18           ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2015-12-07  2:03 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 10:00, Yuanhan Liu:
> On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> > 2015-12-02 15:53, Panu Matilainen:
> > > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > > in 2.3 at all according to the ABI policy.
> > > 
> > > Perhaps a better option would be adding some padding to the structs now 
> > > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > > give a chance to keep it compatible from 2.2 to 2.3.
> > 
> > Please could you point where the vhost ABI is broken in 2.2?
> 
> Thomas, here are the changes to rte_virtio_net.h:
> 
> 
> $ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
>       -- lib/librte_vhost/rte_virtio_net.h >
[...]

The problem is that the changes are not noticed in the release notes
and the LIBABIVER is still 1.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  2:03         ` Thomas Monjalon
@ 2015-12-07  2:18           ` Yuanhan Liu
  2015-12-07  2:49             ` Thomas Monjalon
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-07  2:18 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Dec 07, 2015 at 03:03:24AM +0100, Thomas Monjalon wrote:
> 2015-12-07 10:00, Yuanhan Liu:
> > On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> > > 2015-12-02 15:53, Panu Matilainen:
> > > > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > > > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > > > in 2.3 at all according to the ABI policy.
> > > > 
> > > > Perhaps a better option would be adding some padding to the structs now 
> > > > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > > > give a chance to keep it compatible from 2.2 to 2.3.
> > > 
> > > Please could you point where the vhost ABI is broken in 2.2?
> > 
> > Thomas, here are the changes to rte_virtio_net.h:
> > 
> > 
> > $ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
> >       -- lib/librte_vhost/rte_virtio_net.h >
> [...]
> 
> The problem is that the changes are not noticed in the release notes
> and the LIBABIVER is still 1.

Yeah, my bad. Firstly, I was not aware of it's an ABI change. Secondly,
I was landed to this team in the middle of v2.2 release, so that I have
limited experience of how those works in DPDK community.

Anyway, it's my fault. I should have realized that in the first time.
Should I send a patch to update LIBABIVER to 2 and update release note
now?

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  2:18           ` Yuanhan Liu
@ 2015-12-07  2:49             ` Thomas Monjalon
  0 siblings, 0 replies; 98+ messages in thread
From: Thomas Monjalon @ 2015-12-07  2:49 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 10:18, Yuanhan Liu:
> On Mon, Dec 07, 2015 at 03:03:24AM +0100, Thomas Monjalon wrote:
> > 2015-12-07 10:00, Yuanhan Liu:
> > > On Mon, Dec 07, 2015 at 12:07:28AM +0100, Thomas Monjalon wrote:
> > > > 2015-12-02 15:53, Panu Matilainen:
> > > > > This (and other changes in patch 2 breaks the librte_vhost ABI again, so 
> > > > > you'd need to at least add a deprecation note to 2.2 to be able to do it 
> > > > > in 2.3 at all according to the ABI policy.
> > > > > 
> > > > > Perhaps a better option would be adding some padding to the structs now 
> > > > > for 2.2 since the vhost ABI is broken there anyway. That would at least 
> > > > > give a chance to keep it compatible from 2.2 to 2.3.
> > > > 
> > > > Please could you point where the vhost ABI is broken in 2.2?
> > > 
> > > Thomas, here are the changes to rte_virtio_net.h:
> > > 
> > > 
> > > $ git diff 381316f6a225139d22d39b5ab8d50c40607924ca..19d4d7ef2a216b5418d8edb5b004d1a58bba3cc1 \
> > >       -- lib/librte_vhost/rte_virtio_net.h >
> > [...]
> > 
> > The problem is that the changes are not noticed in the release notes
> > and the LIBABIVER is still 1.
> 
> Yeah, my bad. Firstly, I was not aware of it's an ABI change. Secondly,
> I was landed to this team in the middle of v2.2 release, so that I have
> limited experience of how those works in DPDK community.
> 
> Anyway, it's my fault. I should have realized that in the first time.

No it's not your fault, and it does not matter who is responsible.

> Should I send a patch to update LIBABIVER to 2 and update release note
> now?

Yes today or tomorrow please.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-06 23:07     ` Thomas Monjalon
  2015-12-07  2:00       ` Yuanhan Liu
@ 2015-12-07  6:29       ` Panu Matilainen
  2015-12-07 11:28         ` Thomas Monjalon
  1 sibling, 1 reply; 98+ messages in thread
From: Panu Matilainen @ 2015-12-07  6:29 UTC (permalink / raw)
  To: Thomas Monjalon, Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> 2015-12-02 15:53, Panu Matilainen:
>> This (and other changes in patch 2 breaks the librte_vhost ABI again, so
>> you'd need to at least add a deprecation note to 2.2 to be able to do it
>> in 2.3 at all according to the ABI policy.
>>
>> Perhaps a better option would be adding some padding to the structs now
>> for 2.2 since the vhost ABI is broken there anyway. That would at least
>> give a chance to keep it compatible from 2.2 to 2.3.
>
> Please could you point where the vhost ABI is broken in 2.2?
>

The vhost ABI break was announced for DPDK 2.2 in commit 
3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:

> commit 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d
> Author: Ouyang Changchun <changchun.ouyang@intel.com>
> Date:   Tue Jun 16 09:38:43 2015 +0800
>
>     doc: announce ABI changes for vhost-user multiple queues
>
>     It announces the planned ABI changes for vhost-user multiple
>     queues feature on v2.2.
>
>     Signed-off-by: Changchun Ouyang <changchun.ouyang@intel.com>
>     Acked-by: Neil Horman <nhorman@tuxdriver.com>

So the ABI process was properly followed, except for actually bumping 
LIBABIVER. Bumping LIBABIVER is mentioned in 
doc/guides/contributing/versioning.rst but it doesn't specify *when* 
this should be done, eg should the first patch breaking the ABI bump it 
or should it done be shortly before the next stable release, or 
something else. As it is, it seems a bit too easy to simply forget.

	- Panu -

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07  6:29       ` Panu Matilainen
@ 2015-12-07 11:28         ` Thomas Monjalon
  2015-12-07 11:41           ` Panu Matilainen
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2015-12-07 11:28 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 08:29, Panu Matilainen:
> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> > 2015-12-02 15:53, Panu Matilainen:
> >> This (and other changes in patch 2 breaks the librte_vhost ABI again, so
> >> you'd need to at least add a deprecation note to 2.2 to be able to do it
> >> in 2.3 at all according to the ABI policy.
> >>
> >> Perhaps a better option would be adding some padding to the structs now
> >> for 2.2 since the vhost ABI is broken there anyway. That would at least
> >> give a chance to keep it compatible from 2.2 to 2.3.
> >
> > Please could you point where the vhost ABI is broken in 2.2?
> >
> 
> The vhost ABI break was announced for DPDK 2.2 in commit 
> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
[...]
> So the ABI process was properly followed, except for actually bumping 
> LIBABIVER. Bumping LIBABIVER is mentioned in 
> doc/guides/contributing/versioning.rst but it doesn't specify *when* 
> this should be done, eg should the first patch breaking the ABI bump it 
> or should it done be shortly before the next stable release, or 
> something else. As it is, it seems a bit too easy to simply forget.

I thought it was not needed to explicitly say that commits must be atomic
and we do not have to wait to do the required changes.
In this case, I've missed it when reviewing the vhost patches breaking the
ABI.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 11:28         ` Thomas Monjalon
@ 2015-12-07 11:41           ` Panu Matilainen
  2015-12-07 13:55             ` Thomas Monjalon
  0 siblings, 1 reply; 98+ messages in thread
From: Panu Matilainen @ 2015-12-07 11:41 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
> 2015-12-07 08:29, Panu Matilainen:
>> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
>>> 2015-12-02 15:53, Panu Matilainen:
>>>> This (and other changes in patch 2 breaks the librte_vhost ABI again, so
>>>> you'd need to at least add a deprecation note to 2.2 to be able to do it
>>>> in 2.3 at all according to the ABI policy.
>>>>
>>>> Perhaps a better option would be adding some padding to the structs now
>>>> for 2.2 since the vhost ABI is broken there anyway. That would at least
>>>> give a chance to keep it compatible from 2.2 to 2.3.
>>>
>>> Please could you point where the vhost ABI is broken in 2.2?
>>>
>>
>> The vhost ABI break was announced for DPDK 2.2 in commit
>> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
> [...]
>> So the ABI process was properly followed, except for actually bumping
>> LIBABIVER. Bumping LIBABIVER is mentioned in
>> doc/guides/contributing/versioning.rst but it doesn't specify *when*
>> this should be done, eg should the first patch breaking the ABI bump it
>> or should it done be shortly before the next stable release, or
>> something else. As it is, it seems a bit too easy to simply forget.
>
> I thought it was not needed to explicitly say that commits must be atomic
> and we do not have to wait to do the required changes.

The "problem" is that during a development cycle, an ABI could be broken 
several times but LIBABIVER should only be bumped once. So ABI breaking 
commits will often not be atomic wrt LIBABIVER, no matter which way its 
done.

For example libtool recommendation is that library versions are updated 
only just before public releases:
https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info

	- Panu -

> In this case, I've missed it when reviewing the vhost patches breaking the
> ABI.
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 11:41           ` Panu Matilainen
@ 2015-12-07 13:55             ` Thomas Monjalon
  2015-12-07 16:48               ` Panu Matilainen
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2015-12-07 13:55 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 13:41, Panu Matilainen:
> On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
> > 2015-12-07 08:29, Panu Matilainen:
> >> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> >>> 2015-12-02 15:53, Panu Matilainen:
> >> The vhost ABI break was announced for DPDK 2.2 in commit
> >> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
> > [...]
> >> So the ABI process was properly followed, except for actually bumping
> >> LIBABIVER. Bumping LIBABIVER is mentioned in
> >> doc/guides/contributing/versioning.rst but it doesn't specify *when*
> >> this should be done, eg should the first patch breaking the ABI bump it
> >> or should it done be shortly before the next stable release, or
> >> something else. As it is, it seems a bit too easy to simply forget.
> >
> > I thought it was not needed to explicitly say that commits must be atomic
> > and we do not have to wait to do the required changes.
> 
> The "problem" is that during a development cycle, an ABI could be broken 
> several times but LIBABIVER should only be bumped once. So ABI breaking 
> commits will often not be atomic wrt LIBABIVER, no matter which way its 
> done.

If the ABI version has already been changed, there should be a merge conflict.
I think it's better to manage a conflict than forget to update the version.

> For example libtool recommendation is that library versions are updated 
> only just before public releases:
> https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info

Interesting link. It makes me think that we do not manage ABI break when
downgrading the library (case of only new API keeping the ABI number).

> > In this case, I've missed it when reviewing the vhost patches breaking the
> > ABI.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 13:55             ` Thomas Monjalon
@ 2015-12-07 16:48               ` Panu Matilainen
  2015-12-07 17:47                 ` Thomas Monjalon
  0 siblings, 1 reply; 98+ messages in thread
From: Panu Matilainen @ 2015-12-07 16:48 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/07/2015 03:55 PM, Thomas Monjalon wrote:
> 2015-12-07 13:41, Panu Matilainen:
>> On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
>>> 2015-12-07 08:29, Panu Matilainen:
>>>> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
>>>>> 2015-12-02 15:53, Panu Matilainen:
>>>> The vhost ABI break was announced for DPDK 2.2 in commit
>>>> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
>>> [...]
>>>> So the ABI process was properly followed, except for actually bumping
>>>> LIBABIVER. Bumping LIBABIVER is mentioned in
>>>> doc/guides/contributing/versioning.rst but it doesn't specify *when*
>>>> this should be done, eg should the first patch breaking the ABI bump it
>>>> or should it done be shortly before the next stable release, or
>>>> something else. As it is, it seems a bit too easy to simply forget.
>>>
>>> I thought it was not needed to explicitly say that commits must be atomic
>>> and we do not have to wait to do the required changes.

Heh, now that I look more carefully... it IS documented, line 38 of 
contributing/versioning.rst:

 > ABI versions are set at the time of major release labeling, and the
 > ABI may change multiple times, without warning, between the last
 > release label and the HEAD label of the git tree.

>> The "problem" is that during a development cycle, an ABI could be broken
>> several times but LIBABIVER should only be bumped once. So ABI breaking
>> commits will often not be atomic wrt LIBABIVER, no matter which way its
>> done.
>
> If the ABI version has already been changed, there should be a merge conflict.
> I think it's better to manage a conflict than forget to update the version.

What I'm thinking of is something that would tie LIBABIVER to the 
deprecation announcement in a way that could be easily checked 
(programmatically and manually). As it is now, its quite non-trivial to 
figure what LIBABIVER *should* be for a given library at a given point - 
you need to dig up deprecation.rst history and Makefile history and 
whatnot, and its all quite error-prone.

>> For example libtool recommendation is that library versions are updated
>> only just before public releases:
>> https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html#Updating-version-info
>
> Interesting link. It makes me think that we do not manage ABI break when
> downgrading the library (case of only new API keeping the ABI number).

Hmm, not quite sure what you mean here, but full libtool-style 
versioning is not really needed with symbol versioning.

	- Panu -

>
>>> In this case, I've missed it when reviewing the vhost patches breaking the
>>> ABI.
>

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-07 16:48               ` Panu Matilainen
@ 2015-12-07 17:47                 ` Thomas Monjalon
  0 siblings, 0 replies; 98+ messages in thread
From: Thomas Monjalon @ 2015-12-07 17:47 UTC (permalink / raw)
  To: Panu Matilainen; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2015-12-07 18:48, Panu Matilainen:
> On 12/07/2015 03:55 PM, Thomas Monjalon wrote:
> > 2015-12-07 13:41, Panu Matilainen:
> >> On 12/07/2015 01:28 PM, Thomas Monjalon wrote:
> >>> 2015-12-07 08:29, Panu Matilainen:
> >>>> On 12/07/2015 01:07 AM, Thomas Monjalon wrote:
> >>>>> 2015-12-02 15:53, Panu Matilainen:
> >>>> The vhost ABI break was announced for DPDK 2.2 in commit
> >>>> 3c848bd7b1c6f4f681b833322a748fdefbb5fb2d:
> >>> [...]
> >>>> So the ABI process was properly followed, except for actually bumping
> >>>> LIBABIVER. Bumping LIBABIVER is mentioned in
> >>>> doc/guides/contributing/versioning.rst but it doesn't specify *when*
> >>>> this should be done, eg should the first patch breaking the ABI bump it
> >>>> or should it done be shortly before the next stable release, or
> >>>> something else. As it is, it seems a bit too easy to simply forget.
> >>>
> >>> I thought it was not needed to explicitly say that commits must be atomic
> >>> and we do not have to wait to do the required changes.
> 
> Heh, now that I look more carefully... it IS documented, line 38 of 
> contributing/versioning.rst:
> 
>  > ABI versions are set at the time of major release labeling, and the
>  > ABI may change multiple times, without warning, between the last
>  > release label and the HEAD label of the git tree.

Interesting :)

> >> The "problem" is that during a development cycle, an ABI could be broken
> >> several times but LIBABIVER should only be bumped once. So ABI breaking
> >> commits will often not be atomic wrt LIBABIVER, no matter which way its
> >> done.
> >
> > If the ABI version has already been changed, there should be a merge conflict.
> > I think it's better to manage a conflict than forget to update the version.
> 
> What I'm thinking of is something that would tie LIBABIVER to the 
> deprecation announcement in a way that could be easily checked 
> (programmatically and manually). As it is now, its quite non-trivial to 
> figure what LIBABIVER *should* be for a given library at a given point - 
> you need to dig up deprecation.rst history and Makefile history and 
> whatnot, and its all quite error-prone.

Yes clearly we need a safer process.
You are welcome to suggest one.
I like the idea of being based on some "parse-able" RST changes.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-02  3:43 ` [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2015-12-02 13:53   ` Panu Matilainen
@ 2015-12-08  5:57   ` Xie, Huawei
  2015-12-08  7:25     ` Yuanhan Liu
  1 sibling, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-08  5:57 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/2/2015 11:40 AM, Yuanhan Liu wrote:
[...]
> +
> +	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
> +	if (addr == MAP_FAILED) {
> +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> +		return -1;
> +	}
Yuanhan:
mmap could fail with non-zero offset for huge page based mapping. Check
our workaround in user_set_mem_table.
I think you have done the validation, but i guess off is zero here.
> +
> +	/* TODO: unmap on stop */
> +	dev->log_base = addr;
> +	dev->log_size = size;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> index b82108d..013cf38 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
>  
>  void user_set_protocol_features(struct vhost_device_ctx ctx,
>  				uint64_t protocol_features);
> +int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
>  
>  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
>  


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-08  5:57   ` Xie, Huawei
@ 2015-12-08  7:25     ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-08  7:25 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Tue, Dec 08, 2015 at 05:57:54AM +0000, Xie, Huawei wrote:
> On 12/2/2015 11:40 AM, Yuanhan Liu wrote:
> [...]
> > +
> > +	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, off);
> > +	if (addr == MAP_FAILED) {
> > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > +		return -1;
> > +	}
> Yuanhan:
> mmap could fail with non-zero offset for huge page based mapping. Check
> our workaround in user_set_mem_table.
> I think you have done the validation, but i guess off is zero here.

Yes, off is zero. And thanks for the remind; will fix it in next version.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 3/4] vhost: log vring changes
  2015-12-02 14:07   ` Victor Kaplansky
  2015-12-02 14:38     ` Yuanhan Liu
@ 2015-12-09  2:45     ` Xie, Huawei
  1 sibling, 0 replies; 98+ messages in thread
From: Xie, Huawei @ 2015-12-09  2:45 UTC (permalink / raw)
  To: Victor Kaplansky, Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On 12/2/2015 10:09 PM, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
>> Invoking vhost_log_write() to mark corresponding page as dirty while
>> updating used vring.
> Looks good, thanks!
>
> I didn't find where you log the dirty pages in result of data
> written to the buffers pointed by the descriptors in RX vring.
> AFAIU, the buffers of RX queue reside in guest's memory and have
> to be marked as dirty if they are written. What do you say?
Yes, that is actually the majority of the work.
Besides, in the first version, we could temporarily ignore zero copy,
which is much more complicated, as we have no idea when the page has
been accessed.
-- Huawei
>
> -- Victor
>
>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-02 13:53   ` Victor Kaplansky
  2015-12-02 14:39     ` Yuanhan Liu
@ 2015-12-09  3:33     ` Xie, Huawei
  2015-12-09  3:42       ` Yuanhan Liu
  1 sibling, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-09  3:33 UTC (permalink / raw)
  To: Victor Kaplansky, Yuanhan Liu; +Cc: dev, Michael S. Tsirkin

On 12/2/2015 9:53 PM, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 11:43:11AM +0800, Yuanhan Liu wrote:
>> Introduce vhost_log_write() helper function to log the dirty pages we
>> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
>> log is presented by 1 bit.
>>
>> Therefore, vhost_log_write() simply finds the right bit for related
>> page we are gonna change, and set it to 1. dev->log_base denotes the
>> start of the dirty page bitmap.
>>
>> The page address is biased by log_guest_addr, which is derived from
>> SET_VRING_ADDR request as part of the vring related addresses.
>>
>> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> ---
>>  lib/librte_vhost/rte_virtio_net.h | 34 ++++++++++++++++++++++++++++++++++
>>  lib/librte_vhost/virtio-net.c     |  4 ++++
>>  2 files changed, 38 insertions(+)
>>
>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>> index 416dac2..191c1be 100644
>> --- a/lib/librte_vhost/rte_virtio_net.h
>> +++ b/lib/librte_vhost/rte_virtio_net.h
>> @@ -40,6 +40,7 @@
>>   */
>>  
>>  #include <stdint.h>
>> +#include <linux/vhost.h>
>>  #include <linux/virtio_ring.h>
>>  #include <linux/virtio_net.h>
>>  #include <sys/eventfd.h>
>> @@ -59,6 +60,8 @@ struct rte_mbuf;
>>  /* Backend value set by guest. */
>>  #define VIRTIO_DEV_STOPPED -1
>>  
>> +#define VHOST_LOG_PAGE	4096
>> +
>>  
>>  /* Enum for virtqueue management. */
>>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>> @@ -82,6 +85,7 @@ struct vhost_virtqueue {
>>  	struct vring_desc	*desc;			/**< Virtqueue descriptor ring. */
>>  	struct vring_avail	*avail;			/**< Virtqueue available ring. */
>>  	struct vring_used	*used;			/**< Virtqueue used ring. */
>> +	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
>>  	uint32_t		size;			/**< Size of descriptor ring. */
>>  	uint32_t		backend;		/**< Backend value to determine if device should started/stopped. */
>>  	uint16_t		vhost_hlen;		/**< Vhost header length (varies depending on RX merge buffers. */
>> @@ -203,6 +207,36 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>>  	return vhost_va;
>>  }
>>  
>> +static inline void __attribute__((always_inline))
>> +vhost_log_page(uint8_t *log_base, uint64_t page)
>> +{
>> +	/* TODO: to make it atomic? */
>> +	log_base[page / 8] |= 1 << (page % 8);
> I think the atomic OR operation is necessary only if there can be
> more than one vhost-user back-end updating the guest's memory
> simultaneously. However probably it is pretty safe to perform
> regular OR operation, since rings are not shared between
> back-end. What about buffers pointed by descriptors?  To be on
> the safe side, I would use a GCC built-in function
> __sync_fetch_and_or(). 
>
>> +}
>> +
>> +static inline void __attribute__((always_inline))
>> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> +		uint64_t offset, uint64_t len)
>> +{
>> +	uint64_t addr = vq->log_guest_addr;
>> +	uint64_t page;
>> +
>> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>> +		     !dev->log_base || !len))
>> +		return;
> Isn't "likely" more appropriate in above, since the whole
> expression is expected to be true most of the time?
Victor:
So we are not always logging, what is the message that tells the backend
the migration is started?
[...]


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 0/4 for 2.3] vhost-user live migration support
  2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (4 preceding siblings ...)
  2015-12-02 14:10 ` [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
@ 2015-12-09  3:41 ` Xie, Huawei
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
  6 siblings, 0 replies; 98+ messages in thread
From: Xie, Huawei @ 2015-12-09  3:41 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/2/2015 11:40 AM, Yuanhan Liu wrote:
> This patch set adds the initial vhost-user live migration support.
>
> The major task behind that is to log pages we touched during
> live migration. So, this patch is basically about adding vhost
> log support, and using it.
>
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
>     
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
>
> - Patch 3 logs changes we made to used vring.
>
> - Patch 4 sets log_fhmfd protocol feature bit, which actually
>   enables the vhost-user live migration support.
>
> A simple test guide (on same host)
> ==================================
>
> The following test is based on OVS + DPDK. And here is guide
> to setup OVS + DPDK:
>
>     http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
>
> 1. start ovs-vswitchd
>
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
>
> 3. Start a VM1 to connect to vhost0. Here is my example:
>
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
>
> 4. run "ping $host" inside VM1
>
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
>
>    $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444 
>
> 6. connect to VM1 monitor, and start migration:
>
>    > migrate tcp:0:4444
>
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
Is there some formal verification that migration is truly successful? At
least that the memory we care in our vhost-user case has been migrated
successfully?
For instance, we miss logging guest RX buffers in this patch set, but we
have no idea.

[...]

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-09  3:33     ` Xie, Huawei
@ 2015-12-09  3:42       ` Yuanhan Liu
  2015-12-09  5:44         ` Xie, Huawei
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-09  3:42 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 09, 2015 at 03:33:16AM +0000, Xie, Huawei wrote:
...
> >> +static inline void __attribute__((always_inline))
> >> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> >> +		uint64_t offset, uint64_t len)
> >> +{
> >> +	uint64_t addr = vq->log_guest_addr;
> >> +	uint64_t page;
> >> +
> >> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> >> +		     !dev->log_base || !len))
> >> +		return;
> > Isn't "likely" more appropriate in above, since the whole
> > expression is expected to be true most of the time?
> Victor:
> So we are not always logging, what is the message that tells the backend
> the migration is started?

When log starts, VHOST_USER_SET_FEATURES request will be sent again,
with VHOST_F_LOG_ALL feature bit set.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-09  3:42       ` Yuanhan Liu
@ 2015-12-09  5:44         ` Xie, Huawei
  2015-12-09  8:41           ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-09  5:44 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/9/2015 11:41 AM, Yuanhan Liu wrote:
> On Wed, Dec 09, 2015 at 03:33:16AM +0000, Xie, Huawei wrote:
> ...
>>>> +static inline void __attribute__((always_inline))
>>>> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
>>>> +		uint64_t offset, uint64_t len)
>>>> +{
>>>> +	uint64_t addr = vq->log_guest_addr;
>>>> +	uint64_t page;
>>>> +
>>>> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>>>> +		     !dev->log_base || !len))
>>>> +		return;
>>> Isn't "likely" more appropriate in above, since the whole
>>> expression is expected to be true most of the time?
>> Victor:
>> So we are not always logging, what is the message that tells the backend
>> the migration is started?
> When log starts, VHOST_USER_SET_FEATURES request will be sent again,
> with VHOST_F_LOG_ALL feature bit set.
As the VHOST_USER_SET_FEATURES handling and rx/tx runs asynchronously,
we have to make sure we don't miss logging anything when this feature is
set. For example, I doubt like in virtio_dev_rx, is the dev->features
volatile?
> 	--yliu
>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH 2/4] vhost: introduce vhost_log_write
  2015-12-09  5:44         ` Xie, Huawei
@ 2015-12-09  8:41           ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-09  8:41 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Dec 09, 2015 at 05:44:11AM +0000, Xie, Huawei wrote:
> On 12/9/2015 11:41 AM, Yuanhan Liu wrote:
> > On Wed, Dec 09, 2015 at 03:33:16AM +0000, Xie, Huawei wrote:
> > ...
> >>>> +static inline void __attribute__((always_inline))
> >>>> +vhost_log_write(struct virtio_net *dev, struct vhost_virtqueue *vq,
> >>>> +		uint64_t offset, uint64_t len)
> >>>> +{
> >>>> +	uint64_t addr = vq->log_guest_addr;
> >>>> +	uint64_t page;
> >>>> +
> >>>> +	if (unlikely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> >>>> +		     !dev->log_base || !len))
> >>>> +		return;
> >>> Isn't "likely" more appropriate in above, since the whole
> >>> expression is expected to be true most of the time?
> >> Victor:
> >> So we are not always logging, what is the message that tells the backend
> >> the migration is started?
> > When log starts, VHOST_USER_SET_FEATURES request will be sent again,
> > with VHOST_F_LOG_ALL feature bit set.
> As the VHOST_USER_SET_FEATURES handling and rx/tx runs asynchronously,
> we have to make sure we don't miss logging anything when this feature is
> set.

That's a good remind. Thanks.

> For example, I doubt like in virtio_dev_rx, is the dev->features
> volatile?

No, it is not volatile.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 0/6] vhost-user live migration support
  2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
                   ` (5 preceding siblings ...)
  2015-12-09  3:41 ` Xie, Huawei
@ 2015-12-17  3:11 ` Yuanhan Liu
  2015-12-17  3:11   ` [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
                     ` (8 more replies)
  6 siblings, 9 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Yang Maggie, Victor Kaplansky

This patch set adds the vhost-user live migration support.

The major task behind that is to log pages we touched during
live migration, including used vring and desc buffer. So, this
patch set is basically about adding vhost log support, and
using it.

Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
  the dirty memory bitmap is.
    
- Patch 2 introduces a vhost_log_write() helper function to log
  pages we are gonna change.

- Patch 3 logs changes we made to used vring.

- Patch 4 logs changes we made to vring desc buffer.

- Patch 5 and 6 add some feature bits related to live migration.


A simple test guide (on same host)
==================================

The following test is based on OVS + DPDK (check [0] for
how to setup OVS + DPDK):

    [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk

Here is the rough test guide:

1. start ovs-vswitchd

2. Add two ovs vhost-user port, say vhost0 and vhost1

3. Start a VM1 to connect to vhost0. Here is my example:

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3333,server,nowait -curses

4. run "ping $host" inside VM1

5. Start VM2 to connect to vhost0, and marking it as the target
   of live migration (by adding -incoming tcp:0:4444 option)

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3334,server,nowait -curses \
       -incoming tcp:0:4444 

6. connect to VM1 monitor, and start migration:

   > migrate tcp:0:4444

7. After a while, you will find that VM1 has been migrated to VM2,
   and the "ping" command continues running, perfectly.


Cc: Chen Zhihui <zhihui.chen@intel.com>
Cc: Yang Maggie <maggie.yang@intel.com>
---
Yuanhan Liu (6):
  vhost: handle VHOST_USER_SET_LOG_BASE request
  vhost: introduce vhost_log_write
  vhost: log used vring changes
  vhost: log vring desc buffer changes
  vhost: claim that we support GUEST_ANNOUNCE feature
  vhost: enable log_shmfd protocol feature

 lib/librte_vhost/rte_virtio_net.h             | 36 ++++++++++-
 lib/librte_vhost/vhost_rxtx.c                 | 88 +++++++++++++++++++--------
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
 lib/librte_vhost/virtio-net.c                 |  5 ++
 7 files changed, 165 insertions(+), 30 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-21 15:32     ` Xie, Huawei
  2015-12-17  3:11   ` [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

    typedef struct VhostUserLog {
            uint64_t mmap_size;
            uint64_t mmap_offset;
    } VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---

v2: workaround mmap issue when offset is not zero
---
 lib/librte_vhost/rte_virtio_net.h             |  4 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++--
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
 5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 10dcb90..8acee02 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -129,7 +129,9 @@ struct virtio_net {
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
 	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
 	void			*priv;		/**< private context */
-	uint64_t		reserved[64];	/**< Reserve some spaces for future extension. */
+	uint64_t		log_size;	/**< Size of log area */
+	uint64_t		log_base;	/**< Where dirty pages are logged */
+	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..32ad6f6 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		break;
 
 	case VHOST_USER_SET_LOG_BASE:
-		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
-		break;
+		user_set_log_base(ctx, &msg);
 
+		/* it needs a reply */
+		msg.size = sizeof(msg.payload.u64);
+		send_vhost_message(connfd, &msg);
+		break;
 	case VHOST_USER_SET_LOG_FD:
 		close(msg.fds[0]);
 		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
 	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
 } VhostUserMemory;
 
+typedef struct VhostUserLog {
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
+} VhostUserLog;
+
 typedef struct VhostUserMsg {
 	VhostUserRequest request;
 
@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
 		struct vhost_vring_state state;
 		struct vhost_vring_addr addr;
 		VhostUserMemory memory;
+		VhostUserLog    log;
 	} payload;
 	int fds[VHOST_MEMORY_MAX_NREGIONS];
 } __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..b77c9b3 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
 
 	dev->protocol_features = protocol_features;
 }
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+		 struct VhostUserMsg *msg)
+{
+	struct virtio_net *dev;
+	int fd = msg->fds[0];
+	uint64_t size, off;
+	void *addr;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
+
+	if (fd < 0) {
+		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+		return -1;
+	}
+
+	if (msg->size != sizeof(VhostUserLog)) {
+		RTE_LOG(ERR, VHOST_CONFIG,
+			"invalid log base msg size: %"PRId32" != %d\n",
+			msg->size, (int)sizeof(VhostUserLog));
+		return -1;
+	}
+
+	size = msg->payload.log.mmap_size;
+	off  = msg->payload.log.mmap_offset;
+	RTE_LOG(INFO, VHOST_CONFIG,
+		"log mmap size: %"PRId64", offset: %"PRId64"\n",
+		size, off);
+
+	/*
+	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
+	 * failed when offset is not page size aligned.
+	 */
+	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+		return -1;
+	}
+
+	/* TODO: unmap on stop */
+	dev->log_base = (uint64_t)(uintptr_t)addr + off;
+	dev->log_size = size;
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
  2015-12-17  3:11   ` [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-21 15:06     ` Xie, Huawei
  2015-12-22  5:11     ` Peter Xu
  2015-12-17  3:11   ` [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
                     ` (6 subsequent siblings)
  8 siblings, 2 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---
 lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..5726683 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
  */
 
 #include <stdint.h>
+#include <linux/vhost.h>
 #include <linux/virtio_ring.h>
 #include <linux/virtio_net.h>
 #include <sys/eventfd.h>
@@ -59,6 +60,8 @@ struct rte_mbuf;
 /* Backend value set by guest. */
 #define VIRTIO_DEV_STOPPED -1
 
+#define VHOST_LOG_PAGE	4096
+
 
 /* Enum for virtqueue management. */
 enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
@@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
 	return vhost_va;
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		   !dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+		page += VHOST_LOG_PAGE;
+	}
+}
+
+
 /**
  *  Disable features in feature_mask. Returns 0 on success.
  */
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v2 3/6] vhost: log used vring changes
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
  2015-12-17  3:11   ` [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2015-12-17  3:11   ` [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-22  6:55     ` Peter Xu
  2015-12-17  3:11   ` [PATCH v2 4/6] vhost: log vring desc buffer changes Yuanhan Liu
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Introducing a vhost_log_write() wrapper, vhost_log_used_vring, to
log used vring changes.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---
 lib/librte_vhost/rte_virtio_net.h |  3 +-
 lib/librte_vhost/vhost_rxtx.c     | 80 +++++++++++++++++++++++++++------------
 lib/librte_vhost/virtio-net.c     |  4 ++
 3 files changed, 62 insertions(+), 25 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 5726683..0f83719 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -93,7 +93,8 @@ struct vhost_virtqueue {
 	int			callfd;			/**< Used to notify the guest (trigger interrupt). */
 	int			kickfd;			/**< Currently unused as polling mode is enabled. */
 	int			enabled;
-	uint64_t		reserved[16];		/**< Reserve some spaces for future extension. */
+	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
+	uint64_t		reserved[15];		/**< Reserve some spaces for future extension. */
 	struct buf_vector	buf_vec[BUF_VECTOR_MAX];	/**< for scatter RX. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..f305acd 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -49,6 +49,16 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
 	return (is_tx ^ (idx & 1)) == 0 && idx < qp_nb * VIRTIO_QNUM;
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		     uint64_t offset, uint64_t len)
+{
+	uint64_t addr;
+
+	addr = vq->log_guest_addr + offset;
+	vhost_log_write(dev, addr, len);
+}
+
 /**
  * This function adds buffers to the virtio devices RX virtqueue. Buffers can
  * be received from the physical port or from another virtio device. A packet
@@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t offset = 0, vb_offset = 0;
 		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
 		uint8_t hdr = 0, uncompleted_pkt = 0;
+		uint16_t idx;
 
 		/* Get descriptor from available ring */
 		desc = &vq->desc[head[packet_success]];
@@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		}
 
 		/* Update used ring with desc information */
-		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
-							head[packet_success];
+		idx = res_cur_idx & (vq->size - 1);
+		vq->used->ring[idx].id = head[packet_success];
 
 		/* Drop the packet if it is uncompleted */
 		if (unlikely(uncompleted_pkt == 1))
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							vq->vhost_hlen;
+			vq->used->ring[idx].len = vq->vhost_hlen;
 		else
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							pkt_len + vq->vhost_hlen;
+			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+		vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[idx]),
+			sizeof(vq->used->ring[idx]));
 
 		res_cur_idx++;
 		packet_success++;
@@ -236,6 +249,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx = res_end_idx;
+	vhost_log_used_vring(dev, vq,
+		offsetof(struct vring_used, idx),
+		sizeof(vq->used->idx));
 
 	/* flush used->idx update before we read avail->flags. */
 	rte_mb();
@@ -265,6 +281,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	uint32_t seg_avail;
 	uint32_t vb_avail;
 	uint32_t cpy_len, entry_len;
+	uint16_t idx;
 
 	if (pkt == NULL)
 		return 0;
@@ -302,16 +319,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	entry_len = vq->vhost_hlen;
 
 	if (vb_avail == 0) {
-		uint32_t desc_idx =
-			vq->buf_vec[vec_idx].desc_idx;
+		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+			idx = cur_idx & (vq->size - 1);
 
-		if ((vq->desc[desc_idx].flags
-			& VRING_DESC_F_NEXT) == 0) {
 			/* Update used ring with desc information */
-			vq->used->ring[cur_idx & (vq->size - 1)].id
-				= vq->buf_vec[vec_idx].desc_idx;
-			vq->used->ring[cur_idx & (vq->size - 1)].len
-				= entry_len;
+			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+			vq->used->ring[idx].len = entry_len;
+
+			vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 
 			entry_len = 0;
 			cur_idx++;
@@ -354,10 +373,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
 				VRING_DESC_F_NEXT) == 0) {
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_len = 0;
 				cur_idx++;
 				entry_success++;
@@ -390,16 +412,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 					if ((vq->desc[desc_idx].flags &
 						VRING_DESC_F_NEXT) == 0) {
-						uint16_t wrapped_idx =
-							cur_idx & (vq->size - 1);
+						idx = cur_idx & (vq->size - 1);
 						/*
 						 * Update used ring with the
 						 * descriptor information
 						 */
-						vq->used->ring[wrapped_idx].id
+						vq->used->ring[idx].id
 							= desc_idx;
-						vq->used->ring[wrapped_idx].len
+						vq->used->ring[idx].len
 							= entry_len;
+						vhost_log_used_vring(dev, vq,
+							offsetof(struct vring_used, ring[idx]),
+							sizeof(vq->used->ring[idx]));
 						entry_success++;
 						entry_len = 0;
 						cur_idx++;
@@ -422,10 +446,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 				 * This whole packet completes.
 				 */
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_success++;
 				break;
 			}
@@ -653,6 +680,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		/* Update used index buffer information. */
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
+		vhost_log_used_vring(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
 
 		/* Allocate an mbuf and populate the structure. */
 		m = rte_pktmbuf_alloc(mbuf_pool);
@@ -773,6 +803,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
+	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..03044f6 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
 		return -1;
 	}
 
+	vq->log_guest_addr = addr->log_guest_addr;
+
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
 			dev->device_fh, vq->desc);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
 			dev->device_fh, vq->avail);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
 			dev->device_fh, vq->used);
+	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
+			dev->device_fh, vq->log_guest_addr);
 
 	return 0;
 }
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v2 4/6] vhost: log vring desc buffer changes
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
                     ` (2 preceding siblings ...)
  2015-12-17  3:11   ` [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
@ 2015-12-17  3:11   ` Yuanhan Liu
  2015-12-17  3:12   ` [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:11 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Every time we copy a buf to vring desc, we need to log it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com
---
 lib/librte_vhost/vhost_rxtx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index f305acd..c2d514b 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -71,7 +71,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 	struct rte_mbuf **pkts, uint32_t count)
 {
 	struct vhost_virtqueue *vq;
-	struct vring_desc *desc;
+	struct vring_desc *desc, *hdr_desc;
 	struct rte_mbuf *buff;
 	/* The virtio_hdr is initialised to 0. */
 	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
@@ -153,6 +153,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		/* Copy virtio_hdr to packet and increment buffer address */
 		buff_hdr_addr = buff_addr;
+		hdr_desc = desc;
 
 		/*
 		 * If the descriptors are chained the header and data are
@@ -177,6 +178,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 			rte_memcpy((void *)(uintptr_t)(buff_addr + vb_offset),
 				rte_pktmbuf_mtod_offset(buff, const void *, offset),
 				len_to_cpy);
+			vhost_log_write(dev, desc->addr + vb_offset, len_to_cpy);
 			PRINT_PACKET(dev, (uintptr_t)(buff_addr + vb_offset),
 				len_to_cpy, 0);
 
@@ -232,6 +234,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		rte_memcpy((void *)(uintptr_t)buff_hdr_addr,
 			(const void *)&virtio_hdr, vq->vhost_hlen);
+		vhost_log_write(dev, hdr_desc->addr, vq->vhost_hlen);
 
 		PRINT_PACKET(dev, (uintptr_t)buff_hdr_addr, vq->vhost_hlen, 1);
 
@@ -309,6 +312,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 	rte_memcpy((void *)(uintptr_t)vb_hdr_addr,
 		(const void *)&virtio_hdr, vq->vhost_hlen);
+	vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr, vq->vhost_hlen);
 
 	PRINT_PACKET(dev, (uintptr_t)vb_hdr_addr, vq->vhost_hlen, 1);
 
@@ -353,6 +357,8 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 		rte_memcpy((void *)(uintptr_t)(vb_addr + vb_offset),
 			rte_pktmbuf_mtod_offset(pkt, const void *, seg_offset),
 			cpy_len);
+		vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr + vb_offset,
+			cpy_len);
 
 		PRINT_PACKET(dev,
 			(uintptr_t)(vb_addr + vb_offset),
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
                     ` (3 preceding siblings ...)
  2015-12-17  3:11   ` [PATCH v2 4/6] vhost: log vring desc buffer changes Yuanhan Liu
@ 2015-12-17  3:12   ` Yuanhan Liu
  2015-12-22  8:11     ` Peter Xu
  2015-12-17  3:12   ` [PATCH v2 6/6] vhost: enable log_shmfd protocol feature Yuanhan Liu
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:12 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

It's actually a feature already enabled in Linux kernel. What we need to
do is simply to claim that we support such feature, and nothing else.

With that, the guest will send GARP messages after live migration.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/virtio-net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
 #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
 				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
 				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
 				(VHOST_SUPPORTS_MQ)            | \
 				(1ULL << VIRTIO_F_VERSION_1)   | \
 				(1ULL << VHOST_F_LOG_ALL)      | \
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v2 6/6] vhost: enable log_shmfd protocol feature
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
                     ` (4 preceding siblings ...)
  2015-12-17  3:12   ` [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2015-12-17  3:12   ` Yuanhan Liu
  2015-12-17 12:08   ` [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17  3:12 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_user/virtio-net-user.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..a3a889d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD	1
 
-#define VHOST_USER_PROTOCOL_FEATURES	(1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/6] vhost-user live migration support
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
                     ` (5 preceding siblings ...)
  2015-12-17  3:12   ` [PATCH v2 6/6] vhost: enable log_shmfd protocol feature Yuanhan Liu
@ 2015-12-17 12:08   ` Iremonger, Bernard
  2015-12-17 12:45     ` Yuanhan Liu
  2015-12-21  8:17   ` Pavel Fedin
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
  8 siblings, 1 reply; 98+ messages in thread
From: Iremonger, Bernard @ 2015-12-17 12:08 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Michael S. Tsirkin, Yang, Maggie, Victor Kaplansky

Hi Yuanhan,

> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Thursday, December 17, 2015 3:12 AM
> To: dev@dpdk.org
> Cc: Xie, Huawei <huawei.xie@intel.com>; Michael S. Tsirkin
> <mst@redhat.com>; Victor Kaplansky <vkaplans@redhat.com>; Iremonger,
> Bernard <bernard.iremonger@intel.com>; Pavel Fedin
> <p.fedin@samsung.com>; Peter Xu <peterx@redhat.com>; Yuanhan Liu
> <yuanhan.liu@linux.intel.com>; Chen, Zhihui <zhihui.chen@intel.com>;
> Yang, Maggie <maggie.yang@intel.com>
> Subject: [PATCH v2 0/6] vhost-user live migration support
> 
> This patch set adds the vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during live migration,
> including used vring and desc buffer. So, this patch set is basically about
> adding vhost log support, and using it.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
> 
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 logs changes we made to vring desc buffer.
> 
> - Patch 5 and 6 add some feature bits related to live migration.
> 
>
 
The follow test guide should probably be added the DPDK doc files.
It could be added to the sample app guide or the programmers guide.
There is already a Vhost Library  section in the programmers guide and
A Vhost Sample Application section in the sample app guide.


> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK (check [0] for how to setup OVS +
> DPDK):
> 
>     [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> Here is the rough test guide:
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 
> 
> Cc: Chen Zhihui <zhihui.chen@intel.com>
> Cc: Yang Maggie <maggie.yang@intel.com>
> ---
> Yuanhan Liu (6):
>   vhost: handle VHOST_USER_SET_LOG_BASE request
>   vhost: introduce vhost_log_write
>   vhost: log used vring changes
>   vhost: log vring desc buffer changes
>   vhost: claim that we support GUEST_ANNOUNCE feature
>   vhost: enable log_shmfd protocol feature
> 
>  lib/librte_vhost/rte_virtio_net.h             | 36 ++++++++++-
>  lib/librte_vhost/vhost_rxtx.c                 | 88 +++++++++++++++++++--------
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
> lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++
> lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
> lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
>  lib/librte_vhost/virtio-net.c                 |  5 ++
>  7 files changed, 165 insertions(+), 30 deletions(-)
> 
> --
> 1.9.0

Regards,

Bernard.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/6] vhost-user live migration support
  2015-12-17 12:08   ` [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
@ 2015-12-17 12:45     ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-17 12:45 UTC (permalink / raw)
  To: Iremonger, Bernard
  Cc: Michael S. Tsirkin, dev, Yang, Maggie, Victor Kaplansky

On Thu, Dec 17, 2015 at 12:08:13PM +0000, Iremonger, Bernard wrote:
> Hi Yuanhan,
> 
> > -----Original Message-----
> > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> > Sent: Thursday, December 17, 2015 3:12 AM
> > To: dev@dpdk.org
> > Cc: Xie, Huawei <huawei.xie@intel.com>; Michael S. Tsirkin
> > <mst@redhat.com>; Victor Kaplansky <vkaplans@redhat.com>; Iremonger,
> > Bernard <bernard.iremonger@intel.com>; Pavel Fedin
> > <p.fedin@samsung.com>; Peter Xu <peterx@redhat.com>; Yuanhan Liu
> > <yuanhan.liu@linux.intel.com>; Chen, Zhihui <zhihui.chen@intel.com>;
> > Yang, Maggie <maggie.yang@intel.com>
> > Subject: [PATCH v2 0/6] vhost-user live migration support
> > 
> > This patch set adds the vhost-user live migration support.
> > 
> > The major task behind that is to log pages we touched during live migration,
> > including used vring and desc buffer. So, this patch set is basically about
> > adding vhost log support, and using it.
> > 
> > Patchset
> > ========
> > - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
> >   the dirty memory bitmap is.
> > 
> > - Patch 2 introduces a vhost_log_write() helper function to log
> >   pages we are gonna change.
> > 
> > - Patch 3 logs changes we made to used vring.
> > 
> > - Patch 4 logs changes we made to vring desc buffer.
> > 
> > - Patch 5 and 6 add some feature bits related to live migration.
> > 
> >
>  
> The follow test guide should probably be added the DPDK doc files.

Yes, but not this one, which is a fare rough one. The official one
should do live migration between two hosts.

> It could be added to the sample app guide or the programmers guide.
> There is already a Vhost Library  section in the programmers guide and
> A Vhost Sample Application section in the sample app guide.

We may do it after the validation from validation team.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 0/6] vhost-user live migration support
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
                     ` (6 preceding siblings ...)
  2015-12-17 12:08   ` [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
@ 2015-12-21  8:17   ` Pavel Fedin
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
  8 siblings, 0 replies; 98+ messages in thread
From: Pavel Fedin @ 2015-12-21  8:17 UTC (permalink / raw)
  To: 'Yuanhan Liu', dev
  Cc: 'Michael S. Tsirkin', 'Yang Maggie',
	'Victor Kaplansky'

 Works fine.

 Tested-by: Pavel Fedin <p.fedin@samsung.com>

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Thursday, December 17, 2015 6:12 AM
> To: dev@dpdk.org
> Cc: huawei.xie@intel.com; Michael S. Tsirkin; Victor Kaplansky; Iremonger Bernard; Pavel
> Fedin; Peter Xu; Yuanhan Liu; Chen Zhihui; Yang Maggie
> Subject: [PATCH v2 0/6] vhost-user live migration support
> 
> This patch set adds the vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during
> live migration, including used vring and desc buffer. So, this
> patch set is basically about adding vhost log support, and
> using it.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
> 
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 logs changes we made to vring desc buffer.
> 
> - Patch 5 and 6 add some feature bits related to live migration.
> 
> 
> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK (check [0] for
> how to setup OVS + DPDK):
> 
>     [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> Here is the rough test guide:
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 
> 
> Cc: Chen Zhihui <zhihui.chen@intel.com>
> Cc: Yang Maggie <maggie.yang@intel.com>
> ---
> Yuanhan Liu (6):
>   vhost: handle VHOST_USER_SET_LOG_BASE request
>   vhost: introduce vhost_log_write
>   vhost: log used vring changes
>   vhost: log vring desc buffer changes
>   vhost: claim that we support GUEST_ANNOUNCE feature
>   vhost: enable log_shmfd protocol feature
> 
>  lib/librte_vhost/rte_virtio_net.h             | 36 ++++++++++-
>  lib/librte_vhost/vhost_rxtx.c                 | 88 +++++++++++++++++++--------
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  5 +-
>  lib/librte_vhost/virtio-net.c                 |  5 ++
>  7 files changed, 165 insertions(+), 30 deletions(-)
> 
> --
> 1.9.0

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-17  3:11   ` [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
@ 2015-12-21 15:06     ` Xie, Huawei
  2015-12-22  2:40       ` Yuanhan Liu
  2015-12-22  5:11     ` Peter Xu
  1 sibling, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-21 15:06 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> Introduce vhost_log_write() helper function to log the dirty pages we
> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> log is presented by 1 bit.
>
> Therefore, vhost_log_write() simply finds the right bit for related
> page we are gonna change, and set it to 1. dev->log_base denotes the
> start of the dirty page bitmap.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Signed-off-by: Victor Kaplansky <victork@redhat.com
> ---
>  lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 8acee02..5726683 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -40,6 +40,7 @@
>   */
>  
>  #include <stdint.h>
> +#include <linux/vhost.h>
>  #include <linux/virtio_ring.h>
>  #include <linux/virtio_net.h>
>  #include <sys/eventfd.h>
> @@ -59,6 +60,8 @@ struct rte_mbuf;
>  /* Backend value set by guest. */
>  #define VIRTIO_DEV_STOPPED -1
>  
> +#define VHOST_LOG_PAGE	4096
> +
>  
>  /* Enum for virtqueue management. */
>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>  	return vhost_va;
>  }
>  
> +static inline void __attribute__((always_inline))
> +vhost_log_page(uint8_t *log_base, uint64_t page)
> +{
> +	log_base[page / 8] |= 1 << (page % 8);
> +}
> +
Those logging functions are not supposed to be API. Could we move them
into an internal header file?
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> +{
> +	uint64_t page;
> +
Before we log, we need memory barrier to make sure updates are in place.
> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +		   !dev->log_base || !len))
> +		return;
> +
> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> +		return;
> +
> +	page = addr / VHOST_LOG_PAGE;
> +	while (page * VHOST_LOG_PAGE < addr + len) {
Let us have a page_end var to make the code simpler?
> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> +		page += VHOST_LOG_PAGE;
page += 1?
> +	}
> +}
> +
> +
>  /**
>   *  Disable features in feature_mask. Returns 0 on success.
>   */


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-17  3:11   ` [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2015-12-21 15:32     ` Xie, Huawei
  2015-12-22  2:25       ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-21 15:32 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
> vhost-user) where we should log dirty pages, and how big the log
> buffer is.
>
> This request introduces a new payload:
>
>     typedef struct VhostUserLog {
>             uint64_t mmap_size;
>             uint64_t mmap_offset;
>     } VhostUserLog;
>
> Also, a fd is delivered from QEMU by ancillary data.
>
> With those info given, an area of memory is mmaped, assigned
> to dev->log_base, for logging dirty pages.
>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Signed-off-by: Victor Kaplansky <victork@redhat.com
> ---
>
> v2: workaround mmap issue when offset is not zero
> ---
>  lib/librte_vhost/rte_virtio_net.h             |  4 ++-
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++--
>  lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
>  lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
>  5 files changed, 63 insertions(+), 3 deletions(-)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 10dcb90..8acee02 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -129,7 +129,9 @@ struct virtio_net {
>  	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
>  	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
>  	void			*priv;		/**< private context */
> -	uint64_t		reserved[64];	/**< Reserve some spaces for future extension. */
> +	uint64_t		log_size;	/**< Size of log area */
> +	uint64_t		log_base;	/**< Where dirty pages are logged */
> +	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>  } __rte_cache_aligned;
>  
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index 8b7a448..32ad6f6 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
>  		break;
>  
>  	case VHOST_USER_SET_LOG_BASE:
> -		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
> -		break;
> +		user_set_log_base(ctx, &msg);
>  
> +		/* it needs a reply */
> +		msg.size = sizeof(msg.payload.u64);
> +		send_vhost_message(connfd, &msg);
> +		break;
>  	case VHOST_USER_SET_LOG_FD:
>  		close(msg.fds[0]);
>  		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
> index 38637cc..6d252a3 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.h
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
> @@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
>  	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
>  } VhostUserMemory;
>  
> +typedef struct VhostUserLog {
> +	uint64_t mmap_size;
> +	uint64_t mmap_offset;
> +} VhostUserLog;
> +
>  typedef struct VhostUserMsg {
>  	VhostUserRequest request;
>  
> @@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
>  		struct vhost_vring_state state;
>  		struct vhost_vring_addr addr;
>  		VhostUserMemory memory;
> +		VhostUserLog    log;
>  	} payload;
>  	int fds[VHOST_MEMORY_MAX_NREGIONS];
>  } __attribute((packed)) VhostUserMsg;
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
> index 2934d1c..b77c9b3 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.c
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
> @@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
>  
>  	dev->protocol_features = protocol_features;
>  }
> +
> +int
> +user_set_log_base(struct vhost_device_ctx ctx,
> +		 struct VhostUserMsg *msg)
> +{
> +	struct virtio_net *dev;
> +	int fd = msg->fds[0];
> +	uint64_t size, off;
> +	void *addr;
> +
> +	dev = get_device(ctx);
> +	if (!dev)
> +		return -1;
> +
> +	if (fd < 0) {
> +		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
> +		return -1;
> +	}
> +
> +	if (msg->size != sizeof(VhostUserLog)) {
> +		RTE_LOG(ERR, VHOST_CONFIG,
> +			"invalid log base msg size: %"PRId32" != %d\n",
> +			msg->size, (int)sizeof(VhostUserLog));
> +		return -1;
> +	}
> +
> +	size = msg->payload.log.mmap_size;
> +	off  = msg->payload.log.mmap_offset;
> +	RTE_LOG(INFO, VHOST_CONFIG,
> +		"log mmap size: %"PRId64", offset: %"PRId64"\n",
> +		size, off);
> +
> +	/*
> +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> +	 * failed when offset is not page size aligned.
> +	 */
s /will be failed/will fail/
mmap will fail when offset is not zero.
Also we only know this workaround is for hugetlbfs. Not sure of other
tmpfs, so mention hugetlbfs here.
> +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> +	if (addr == MAP_FAILED) {
> +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> +		return -1;
> +	}
> +
> +	/* TODO: unmap on stop */
> +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
(uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> +	dev->log_size = size;
> +
> +	return 0;
> +}
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> index b82108d..013cf38 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
>  
>  void user_set_protocol_features(struct vhost_device_ctx ctx,
>  				uint64_t protocol_features);
> +int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
>  
>  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
>  


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-21 15:32     ` Xie, Huawei
@ 2015-12-22  2:25       ` Yuanhan Liu
  2015-12-22  2:41         ` Xie, Huawei
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-22  2:25 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Mon, Dec 21, 2015 at 03:32:53PM +0000, Xie, Huawei wrote:
> > +
> > +	/*
> > +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > +	 * failed when offset is not page size aligned.
> > +	 */
> s /will be failed/will fail/
> mmap will fail when offset is not zero.
> Also we only know this workaround is for hugetlbfs. Not sure of other
> tmpfs, so mention hugetlbfs here.

I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
not enough?

> > +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > +	if (addr == MAP_FAILED) {
> > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > +		return -1;
> > +	}
> > +
> > +	/* TODO: unmap on stop */
> > +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
> (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?

No, addr is of (void *) type, we should cast it to uint64_t type first,
before adding it with "off".

	--yliu

> > +	dev->log_size = size;
> > +
> > +	return 0;
> > +}
> > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > index b82108d..013cf38 100644
> > --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
> >  
> >  void user_set_protocol_features(struct vhost_device_ctx ctx,
> >  				uint64_t protocol_features);
> > +int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
> >  
> >  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
> >  
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-21 15:06     ` Xie, Huawei
@ 2015-12-22  2:40       ` Yuanhan Liu
  2015-12-22  2:45         ` Xie, Huawei
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-22  2:40 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Mon, Dec 21, 2015 at 03:06:43PM +0000, Xie, Huawei wrote:
> On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
> > Introduce vhost_log_write() helper function to log the dirty pages we
> > touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> > log is presented by 1 bit.
> >
> > Therefore, vhost_log_write() simply finds the right bit for related
> > page we are gonna change, and set it to 1. dev->log_base denotes the
> > start of the dirty page bitmap.
> >
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > Signed-off-by: Victor Kaplansky <victork@redhat.com
> > ---
> >  lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> >
> > diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> > index 8acee02..5726683 100644
> > --- a/lib/librte_vhost/rte_virtio_net.h
> > +++ b/lib/librte_vhost/rte_virtio_net.h
> > @@ -40,6 +40,7 @@
> >   */
> >  
> >  #include <stdint.h>
> > +#include <linux/vhost.h>
> >  #include <linux/virtio_ring.h>
> >  #include <linux/virtio_net.h>
> >  #include <sys/eventfd.h>
> > @@ -59,6 +60,8 @@ struct rte_mbuf;
> >  /* Backend value set by guest. */
> >  #define VIRTIO_DEV_STOPPED -1
> >  
> > +#define VHOST_LOG_PAGE	4096
> > +
> >  
> >  /* Enum for virtqueue management. */
> >  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
> > @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
> >  	return vhost_va;
> >  }
> >  
> > +static inline void __attribute__((always_inline))
> > +vhost_log_page(uint8_t *log_base, uint64_t page)
> > +{
> > +	log_base[page / 8] |= 1 << (page % 8);
> > +}
> > +
> Those logging functions are not supposed to be API. Could we move them
> into an internal header file?

Agreed. I should have put them into vhost_rxtx.c

> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> > +{
> > +	uint64_t page;
> > +
> Before we log, we need memory barrier to make sure updates are in place.
> > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +		   !dev->log_base || !len))
> > +		return;

Put a memory barrier inside set_features()?

I see no var dependence here, why putting a barrier then? We are
accessing and modifying same var, doesn't the cache MESI protocol
will get rid of your concerns?

> > +
> > +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> > +		return;
> > +
> > +	page = addr / VHOST_LOG_PAGE;
> > +	while (page * VHOST_LOG_PAGE < addr + len) {
> Let us have a page_end var to make the code simpler?

Could do that.


> > +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> > +		page += VHOST_LOG_PAGE;
> page += 1?

Oops, right.

	--yliu

> > +	}
> > +}
> > +
> > +
> >  /**
> >   *  Disable features in feature_mask. Returns 0 on success.
> >   */
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-22  2:25       ` Yuanhan Liu
@ 2015-12-22  2:41         ` Xie, Huawei
  2015-12-22  2:55           ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-22  2:41 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky



> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> Sent: Tuesday, December 22, 2015 10:26 AM
> To: Xie, Huawei
> Cc: dev@dpdk.org; Michael S. Tsirkin; Victor Kaplansky; Iremonger,
> Bernard; Pavel Fedin; Peter Xu
> Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
> request
> 
> On Mon, Dec 21, 2015 at 03:32:53PM +0000, Xie, Huawei wrote:
> > > +
> > > +	/*
> > > +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > > +	 * failed when offset is not page size aligned.
> > > +	 */
> > s /will be failed/will fail/
> > mmap will fail when offset is not zero.
I mistake for 4KB page size. Please check if huge page size align is enough.
> > Also we only know this workaround is for hugetlbfs. Not sure of
> other
> > tmpfs, so mention hugetlbfs here.
> 
> I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
> not enough?
Yes.
> 
> > > +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
> 0);
> > > +	if (addr == MAP_FAILED) {
> > > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	/* TODO: unmap on stop */
> > > +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
> > (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> 
> No, addr is of (void *) type, we should cast it to uint64_t type first,
> before adding it with "off".
> 
> 	--yliu
RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.
> 
> > > +	dev->log_size = size;
> > > +
> > > +	return 0;
> > > +}
> > > diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h
> b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > index b82108d..013cf38 100644
> > > --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> > > @@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx,
> struct VhostUserMsg *);
> > >
> > >  void user_set_protocol_features(struct vhost_device_ctx ctx,
> > >  				uint64_t protocol_features);
> > > +int user_set_log_base(struct vhost_device_ctx ctx, struct
> VhostUserMsg *);
> > >
> > >  int user_get_vring_base(struct vhost_device_ctx, struct
> vhost_vring_state *);
> > >
> >

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  2:40       ` Yuanhan Liu
@ 2015-12-22  2:45         ` Xie, Huawei
  2015-12-22  3:04           ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-22  2:45 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On 12/22/2015 10:40 AM, Yuanhan Liu wrote:
> On Mon, Dec 21, 2015 at 03:06:43PM +0000, Xie, Huawei wrote:
>> On 12/17/2015 11:11 AM, Yuanhan Liu wrote:
>>> Introduce vhost_log_write() helper function to log the dirty pages we
>>> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
>>> log is presented by 1 bit.
>>>
>>> Therefore, vhost_log_write() simply finds the right bit for related
>>> page we are gonna change, and set it to 1. dev->log_base denotes the
>>> start of the dirty page bitmap.
>>>
>>> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>>> Signed-off-by: Victor Kaplansky <victork@redhat.com
>>> ---
>>>  lib/librte_vhost/rte_virtio_net.h | 29 +++++++++++++++++++++++++++++
>>>  1 file changed, 29 insertions(+)
>>>
>>> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
>>> index 8acee02..5726683 100644
>>> --- a/lib/librte_vhost/rte_virtio_net.h
>>> +++ b/lib/librte_vhost/rte_virtio_net.h
>>> @@ -40,6 +40,7 @@
>>>   */
>>>  
>>>  #include <stdint.h>
>>> +#include <linux/vhost.h>
>>>  #include <linux/virtio_ring.h>
>>>  #include <linux/virtio_net.h>
>>>  #include <sys/eventfd.h>
>>> @@ -59,6 +60,8 @@ struct rte_mbuf;
>>>  /* Backend value set by guest. */
>>>  #define VIRTIO_DEV_STOPPED -1
>>>  
>>> +#define VHOST_LOG_PAGE	4096
>>> +
>>>  
>>>  /* Enum for virtqueue management. */
>>>  enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>> @@ -205,6 +208,32 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
>>>  	return vhost_va;
>>>  }
>>>  
>>> +static inline void __attribute__((always_inline))
>>> +vhost_log_page(uint8_t *log_base, uint64_t page)
>>> +{
>>> +	log_base[page / 8] |= 1 << (page % 8);
>>> +}
>>> +
>> Those logging functions are not supposed to be API. Could we move them
>> into an internal header file?
> Agreed. I should have put them into vhost_rxtx.c
>
>>> +static inline void __attribute__((always_inline))
>>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
>>> +{
>>> +	uint64_t page;
>>> +
>> Before we log, we need memory barrier to make sure updates are in place.
>>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>>> +		   !dev->log_base || !len))
>>> +		return;
> Put a memory barrier inside set_features()?
>
> I see no var dependence here, why putting a barrier then? We are
> accessing and modifying same var, doesn't the cache MESI protocol
> will get rid of your concerns?
This fence isn't about feature var. It is to ensure that updates to the
guest buffer are committed before the logging.
For IA strong memory model, compiler barrier is enough. For other weak
memory model, fence is required.
>>> +
>>> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
>>> +		return;
>>> +
>>> +	page = addr / VHOST_LOG_PAGE;
>>> +	while (page * VHOST_LOG_PAGE < addr + len) {
>> Let us have a page_end var to make the code simpler?
> Could do that.
>
>
>>> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
>>> +		page += VHOST_LOG_PAGE;
>> page += 1?
> Oops, right.
>
> 	--yliu
>
>>> +	}
>>> +}
>>> +
>>> +
>>>  /**
>>>   *  Disable features in feature_mask. Returns 0 on success.
>>>   */


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request
  2015-12-22  2:41         ` Xie, Huawei
@ 2015-12-22  2:55           ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-22  2:55 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 02:41:43AM +0000, Xie, Huawei wrote:
> 
> 
> > -----Original Message-----
> > From: Yuanhan Liu [mailto:yuanhan.liu@linux.intel.com]
> > Sent: Tuesday, December 22, 2015 10:26 AM
> > To: Xie, Huawei
> > Cc: dev@dpdk.org; Michael S. Tsirkin; Victor Kaplansky; Iremonger,
> > Bernard; Pavel Fedin; Peter Xu
> > Subject: Re: [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE
> > request
> > 
> > On Mon, Dec 21, 2015 at 03:32:53PM +0000, Xie, Huawei wrote:
> > > > +
> > > > +	/*
> > > > +	 * mmap from 0 to workaround a hugepage mmap bug: mmap will be
> > > > +	 * failed when offset is not page size aligned.
> > > > +	 */
> > > s /will be failed/will fail/
> > > mmap will fail when offset is not zero.
> I mistake for 4KB page size.

Didn't follow you.

> Please check if huge page size align is enough.

It should be. However, I don't think we need bother to do that:
first of all, it happened on few specific old kernels. And, "off"
here is kind of guaranteed to be 0. Last, even it's not, mmaping
it from 0 will resolve that.

> > > Also we only know this workaround is for hugetlbfs. Not sure of
> > other
> > > tmpfs, so mention hugetlbfs here.
> > 
> > I have already mentioned "to workaround a __hugepage__ mmap bug"; it's
> > not enough?
> Yes.
> > 
> > > > +	addr = mmap(0, size + off, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
> > 0);
> > > > +	if (addr == MAP_FAILED) {
> > > > +		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
> > > > +		return -1;
> > > > +	}
> > > > +
> > > > +	/* TODO: unmap on stop */
> > > > +	dev->log_base = (uint64_t)(uintptr_t)addr + off;
> > > (uint64_t)(uintptr_t)RTE_PTR_ADD(addr, off)?
> > 
> > No, addr is of (void *) type, we should cast it to uint64_t type first,
> > before adding it with "off".
> > 
> > 	--yliu
> RTE_PTR_ADD is the DPDK interface for pointer arithmetic operation.

log_base is with "uint64_t" type, RTE_PTR_ADD() returns (void*), so it
won't work here.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  2:45         ` Xie, Huawei
@ 2015-12-22  3:04           ` Yuanhan Liu
  2015-12-22  7:02             ` Xie, Huawei
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-22  3:04 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 02:45:52AM +0000, Xie, Huawei wrote:
> >>> +static inline void __attribute__((always_inline))
> >>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> >>> +{
> >>> +	uint64_t page;
> >>> +
> >> Before we log, we need memory barrier to make sure updates are in place.
> >>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> >>> +		   !dev->log_base || !len))
> >>> +		return;
> > Put a memory barrier inside set_features()?
> >
> > I see no var dependence here, why putting a barrier then? We are
> > accessing and modifying same var, doesn't the cache MESI protocol
> > will get rid of your concerns?
> This fence isn't about feature var. It is to ensure that updates to the
> guest buffer are committed before the logging.

Oh.., I was thinking you were talking about the "dev->features" field
concurrent access and modify you mentioned from V1.

> For IA strong memory model, compiler barrier is enough. For other weak
> memory model, fence is required.
> >>> +
> >>> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> >>> +		return;

So that I should put a "rte_mb()" __here__?

	--yliu
> >>> +
> >>> +	page = addr / VHOST_LOG_PAGE;
> >>> +	while (page * VHOST_LOG_PAGE < addr + len) {
> >> Let us have a page_end var to make the code simpler?
> > Could do that.
> >
> >
> >>> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
> >>> +		page += VHOST_LOG_PAGE;
> >> page += 1?
> > Oops, right.
> >
> > 	--yliu
> >
> >>> +	}
> >>> +}
> >>> +
> >>> +
> >>>  /**
> >>>   *  Disable features in feature_mask. Returns 0 on success.
> >>>   */
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-17  3:11   ` [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
  2015-12-21 15:06     ` Xie, Huawei
@ 2015-12-22  5:11     ` Peter Xu
  2015-12-22  6:09       ` Yuanhan Liu
  1 sibling, 1 reply; 98+ messages in thread
From: Peter Xu @ 2015-12-22  5:11 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Thu, Dec 17, 2015 at 11:11:57AM +0800, Yuanhan Liu wrote:
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> +{
> +	uint64_t page;
> +
> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> +		   !dev->log_base || !len))
> +		return;
> +
> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))

Should it be "<="?

Peter

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  5:11     ` Peter Xu
@ 2015-12-22  6:09       ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-22  6:09 UTC (permalink / raw)
  To: Peter Xu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 01:11:02PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:57AM +0800, Yuanhan Liu wrote:
> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> > +{
> > +	uint64_t page;
> > +
> > +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
> > +		   !dev->log_base || !len))
> > +		return;
> > +
> > +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
> 
> Should it be "<="?

Right, thanks for catching it.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/6] vhost: log used vring changes
  2015-12-17  3:11   ` [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
@ 2015-12-22  6:55     ` Peter Xu
  2015-12-22  7:07       ` Xie, Huawei
  2015-12-22  7:13       ` Yuanhan Liu
  0 siblings, 2 replies; 98+ messages in thread
From: Peter Xu @ 2015-12-22  6:55 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> +static inline void __attribute__((always_inline))
> +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> +		     uint64_t offset, uint64_t len)
> +{

One thing optional: I feel it a little bit confusing regarding to
the helper function name here, for the reasons:

1. It more sounds like "logging all the vrings we used", however,
   what I understand is that, here we are logging dirty pages for
   guest memory. Or say, there is merely nothing to do directly with
   vring (although many vring ops might call this function, we are
   only passing [buf, len] pairs).

2. It may also let people think of "vring_used", which is part of
   virtio protocol. However, it does not mean it too.

I would suggest a better name without confusion, like
vhost_log_dirty_range() or anything else to avoid those keywords.

> +	uint64_t addr;
> +
> +	addr = vq->log_guest_addr + offset;
> +	vhost_log_write(dev, addr, len);

Optional too: since addr is only used once, would it cleaner using
one line? Like:

vhost_log_write(dev, vq->log_guest_addr + offset, len);

> +}
> +
>  /**
>   * This function adds buffers to the virtio devices RX virtqueue. Buffers can
>   * be received from the physical port or from another virtio device. A packet
> @@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		uint32_t offset = 0, vb_offset = 0;
>  		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
>  		uint8_t hdr = 0, uncompleted_pkt = 0;
> +		uint16_t idx;
>  
>  		/* Get descriptor from available ring */
>  		desc = &vq->desc[head[packet_success]];
> @@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  		}
>  
>  		/* Update used ring with desc information */
> -		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> -							head[packet_success];
> +		idx = res_cur_idx & (vq->size - 1);
> +		vq->used->ring[idx].id = head[packet_success];
>  
>  		/* Drop the packet if it is uncompleted */
>  		if (unlikely(uncompleted_pkt == 1))
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							vq->vhost_hlen;
> +			vq->used->ring[idx].len = vq->vhost_hlen;
>  		else
> -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> -							pkt_len + vq->vhost_hlen;
> +			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> +
> +		vhost_log_used_vring(dev, vq,
> +			offsetof(struct vring_used, ring[idx]),
> +			sizeof(vq->used->ring[idx]));

Got a question here:

I see that we are logging down changes when we are marking
used_vring. Do we need to log down buffer copy in rte_memcpy() too?
I am not sure whether I understand it correctly, it seems that this
is part of DPDK API ops to deliver data to the guest (from, e.g.,
OVS?), when we do rte_memcpy(), we seems to be modifying guest
memory too. Am I wrong?

Peter

>  
>  		res_cur_idx++;
>  		packet_success++;
> @@ -236,6 +249,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  
>  	*(volatile uint16_t *)&vq->used->idx += count;
>  	vq->last_used_idx = res_end_idx;
> +	vhost_log_used_vring(dev, vq,
> +		offsetof(struct vring_used, idx),
> +		sizeof(vq->used->idx));
>  
>  	/* flush used->idx update before we read avail->flags. */
>  	rte_mb();
> @@ -265,6 +281,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	uint32_t seg_avail;
>  	uint32_t vb_avail;
>  	uint32_t cpy_len, entry_len;
> +	uint16_t idx;
>  
>  	if (pkt == NULL)
>  		return 0;
> @@ -302,16 +319,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  	entry_len = vq->vhost_hlen;
>  
>  	if (vb_avail == 0) {
> -		uint32_t desc_idx =
> -			vq->buf_vec[vec_idx].desc_idx;
> +		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
> +
> +		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
> +			idx = cur_idx & (vq->size - 1);
>  
> -		if ((vq->desc[desc_idx].flags
> -			& VRING_DESC_F_NEXT) == 0) {
>  			/* Update used ring with desc information */
> -			vq->used->ring[cur_idx & (vq->size - 1)].id
> -				= vq->buf_vec[vec_idx].desc_idx;
> -			vq->used->ring[cur_idx & (vq->size - 1)].len
> -				= entry_len;
> +			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
> +			vq->used->ring[idx].len = entry_len;
> +
> +			vhost_log_used_vring(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  
>  			entry_len = 0;
>  			cur_idx++;
> @@ -354,10 +373,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
>  				VRING_DESC_F_NEXT) == 0) {
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_used_vring(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_len = 0;
>  				cur_idx++;
>  				entry_success++;
> @@ -390,16 +412,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  
>  					if ((vq->desc[desc_idx].flags &
>  						VRING_DESC_F_NEXT) == 0) {
> -						uint16_t wrapped_idx =
> -							cur_idx & (vq->size - 1);
> +						idx = cur_idx & (vq->size - 1);
>  						/*
>  						 * Update used ring with the
>  						 * descriptor information
>  						 */
> -						vq->used->ring[wrapped_idx].id
> +						vq->used->ring[idx].id
>  							= desc_idx;
> -						vq->used->ring[wrapped_idx].len
> +						vq->used->ring[idx].len
>  							= entry_len;
> +						vhost_log_used_vring(dev, vq,
> +							offsetof(struct vring_used, ring[idx]),
> +							sizeof(vq->used->ring[idx]));
>  						entry_success++;
>  						entry_len = 0;
>  						cur_idx++;
> @@ -422,10 +446,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
>  				 * This whole packet completes.
>  				 */
>  				/* Update used ring with desc information */
> -				vq->used->ring[cur_idx & (vq->size - 1)].id
> +				idx = cur_idx & (vq->size - 1);
> +				vq->used->ring[idx].id
>  					= vq->buf_vec[vec_idx].desc_idx;
> -				vq->used->ring[cur_idx & (vq->size - 1)].len
> -					= entry_len;
> +				vq->used->ring[idx].len = entry_len;
> +				vhost_log_used_vring(dev, vq,
> +					offsetof(struct vring_used, ring[idx]),
> +					sizeof(vq->used->ring[idx]));
>  				entry_success++;
>  				break;
>  			}
> @@ -653,6 +680,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  		/* Update used index buffer information. */
>  		vq->used->ring[used_idx].id = head[entry_success];
>  		vq->used->ring[used_idx].len = 0;
> +		vhost_log_used_vring(dev, vq,
> +				offsetof(struct vring_used, ring[used_idx]),
> +				sizeof(vq->used->ring[used_idx]));
>  
>  		/* Allocate an mbuf and populate the structure. */
>  		m = rte_pktmbuf_alloc(mbuf_pool);
> @@ -773,6 +803,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  
>  	rte_compiler_barrier();
>  	vq->used->idx += entry_success;
> +	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
> +			sizeof(vq->used->idx));
>  	/* Kick guest if required. */
>  	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
>  		eventfd_write(vq->callfd, (eventfd_t)1);
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index de78a0f..03044f6 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
>  		return -1;
>  	}
>  
> +	vq->log_guest_addr = addr->log_guest_addr;
> +
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
>  			dev->device_fh, vq->desc);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
>  			dev->device_fh, vq->avail);
>  	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
>  			dev->device_fh, vq->used);
> +	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
> +			dev->device_fh, vq->log_guest_addr);
>  
>  	return 0;
>  }
> -- 
> 1.9.0
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 2/6] vhost: introduce vhost_log_write
  2015-12-22  3:04           ` Yuanhan Liu
@ 2015-12-22  7:02             ` Xie, Huawei
  0 siblings, 0 replies; 98+ messages in thread
From: Xie, Huawei @ 2015-12-22  7:02 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On 12/22/2015 11:03 AM, Yuanhan Liu wrote:
> On Tue, Dec 22, 2015 at 02:45:52AM +0000, Xie, Huawei wrote:
>>>>> +static inline void __attribute__((always_inline))
>>>>> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
>>>>> +{
>>>>> +	uint64_t page;
>>>>> +
>>>> Before we log, we need memory barrier to make sure updates are in place.
>>>>> +	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
>>>>> +		   !dev->log_base || !len))
>>>>> +		return;
>>> Put a memory barrier inside set_features()?
>>>
>>> I see no var dependence here, why putting a barrier then? We are
>>> accessing and modifying same var, doesn't the cache MESI protocol
>>> will get rid of your concerns?
>> This fence isn't about feature var. It is to ensure that updates to the
>> guest buffer are committed before the logging.
> Oh.., I was thinking you were talking about the "dev->features" field
> concurrent access and modify you mentioned from V1.
>
>> For IA strong memory model, compiler barrier is enough. For other weak
>> memory model, fence is required.
>>>>> +
>>>>> +	if (unlikely(dev->log_size < ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
>>>>> +		return;
> So that I should put a "rte_mb()" __here__?
>
> 	--yliu

I find that we already have the arch dependent version of rte_smp_wmb()
        --huawei
>>>>> +
>>>>> +	page = addr / VHOST_LOG_PAGE;
>>>>> +	while (page * VHOST_LOG_PAGE < addr + len) {
>>>> Let us have a page_end var to make the code simpler?
>>> Could do that.
>>>
>>>
>>>>> +		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
>>>>> +		page += VHOST_LOG_PAGE;
>>>> page += 1?
>>> Oops, right.
>>>
>>> 	--yliu
>>>
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +
>>>>>  /**
>>>>>   *  Disable features in feature_mask. Returns 0 on success.
>>>>>   */


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  6:55     ` Peter Xu
@ 2015-12-22  7:07       ` Xie, Huawei
  2015-12-22  7:59         ` Peter Xu
  2015-12-22  7:13       ` Yuanhan Liu
  1 sibling, 1 reply; 98+ messages in thread
From: Xie, Huawei @ 2015-12-22  7:07 UTC (permalink / raw)
  To: Peter Xu, Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 12/22/2015 2:56 PM, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
>> +static inline void __attribute__((always_inline))
>> +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
>> +		     uint64_t offset, uint64_t len)
>> +{
[...]
> Got a question here:
>
> I see that we are logging down changes when we are marking
> used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> I am not sure whether I understand it correctly, it seems that this
> is part of DPDK API ops to deliver data to the guest (from, e.g.,
> OVS?), when we do rte_memcpy(), we seems to be modifying guest
> memory too. Am I wrong?
>
> Peter

desc buffer logging isn't included in v1, but in the patch 4 of this
patch set, and actually it is the major work in vhost live migration.
    --huawei

[...]


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  6:55     ` Peter Xu
  2015-12-22  7:07       ` Xie, Huawei
@ 2015-12-22  7:13       ` Yuanhan Liu
  2015-12-22  8:01         ` Peter Xu
  1 sibling, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-22  7:13 UTC (permalink / raw)
  To: Peter Xu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 02:55:52PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > +static inline void __attribute__((always_inline))
> > +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > +		     uint64_t offset, uint64_t len)
> > +{
> 
> One thing optional: I feel it a little bit confusing regarding to
> the helper function name here, for the reasons:
> 
> 1. It more sounds like "logging all the vrings we used", however,
>    what I understand is that, here we are logging dirty pages for
>    guest memory. Or say, there is merely nothing to do directly with
>    vring (although many vring ops might call this function, we are
>    only passing [buf, len] pairs).
> 
> 2. It may also let people think of "vring_used", which is part of
>    virtio protocol. However, it does not mean it too.

Yes, it does. Here log_guest_addr is set to physical address of
vring_used. (check code at vhost_virtqueue_set_addr() of qemu).

     627 static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
     628                                     struct vhost_virtqueue *vq,
     629                                     unsigned idx, bool enable_log)
     630 {
     631     struct vhost_vring_addr addr = {
     632         .index = idx,
     633         .desc_user_addr = (uint64_t)(unsigned long)vq->desc,
     634         .avail_user_addr = (uint64_t)(unsigned long)vq->avail,
     635         .used_user_addr = (uint64_t)(unsigned long)vq->used,
==>  636         .log_guest_addr = vq->used_phys,
     637         .flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0,
     638     };

So, this function does log changes to used vring.

> 
> I would suggest a better name without confusion, like
> vhost_log_dirty_range() or anything else to avoid those keywords.
> 
> > +	uint64_t addr;
> > +
> > +	addr = vq->log_guest_addr + offset;
> > +	vhost_log_write(dev, addr, len);
> 
> Optional too: since addr is only used once, would it cleaner using
> one line? Like:

Yes, it is. Will fix it.

> 
> vhost_log_write(dev, vq->log_guest_addr + offset, len);
> 
> > +}
> > +
> >  /**
> >   * This function adds buffers to the virtio devices RX virtqueue. Buffers can
> >   * be received from the physical port or from another virtio device. A packet
> > @@ -129,6 +139,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> >  		uint32_t offset = 0, vb_offset = 0;
> >  		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
> >  		uint8_t hdr = 0, uncompleted_pkt = 0;
> > +		uint16_t idx;
> >  
> >  		/* Get descriptor from available ring */
> >  		desc = &vq->desc[head[packet_success]];
> > @@ -200,16 +211,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> >  		}
> >  
> >  		/* Update used ring with desc information */
> > -		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
> > -							head[packet_success];
> > +		idx = res_cur_idx & (vq->size - 1);
> > +		vq->used->ring[idx].id = head[packet_success];
> >  
> >  		/* Drop the packet if it is uncompleted */
> >  		if (unlikely(uncompleted_pkt == 1))
> > -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> > -							vq->vhost_hlen;
> > +			vq->used->ring[idx].len = vq->vhost_hlen;
> >  		else
> > -			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
> > -							pkt_len + vq->vhost_hlen;
> > +			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
> > +
> > +		vhost_log_used_vring(dev, vq,
> > +			offsetof(struct vring_used, ring[idx]),
> > +			sizeof(vq->used->ring[idx]));
> 
> Got a question here:
> 
> I see that we are logging down changes when we are marking
> used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> I am not sure whether I understand it correctly, it seems that this
> is part of DPDK API ops to deliver data to the guest (from, e.g.,
> OVS?), when we do rte_memcpy(), we seems to be modifying guest
> memory too. Am I wrong?

It's done in the next patch.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  7:07       ` Xie, Huawei
@ 2015-12-22  7:59         ` Peter Xu
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Xu @ 2015-12-22  7:59 UTC (permalink / raw)
  To: Xie, Huawei; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 07:07:25AM +0000, Xie, Huawei wrote:
> On 12/22/2015 2:56 PM, Peter Xu wrote:
> > Got a question here:
> >
> > I see that we are logging down changes when we are marking
> > used_vring. Do we need to log down buffer copy in rte_memcpy() too?
> > I am not sure whether I understand it correctly, it seems that this
> > is part of DPDK API ops to deliver data to the guest (from, e.g.,
> > OVS?), when we do rte_memcpy(), we seems to be modifying guest
> > memory too. Am I wrong?
> >
> > Peter
> 
> desc buffer logging isn't included in v1, but in the patch 4 of this
> patch set, and actually it is the major work in vhost live migration.

Yes, it is. Thanks to point out.

Peter

>     --huawei
> 
> [...]
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 3/6] vhost: log used vring changes
  2015-12-22  7:13       ` Yuanhan Liu
@ 2015-12-22  8:01         ` Peter Xu
  0 siblings, 0 replies; 98+ messages in thread
From: Peter Xu @ 2015-12-22  8:01 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 03:13:49PM +0800, Yuanhan Liu wrote:
> On Tue, Dec 22, 2015 at 02:55:52PM +0800, Peter Xu wrote:
> > On Thu, Dec 17, 2015 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > +static inline void __attribute__((always_inline))
> > > +vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > > +		     uint64_t offset, uint64_t len)
> > > +{
> > 
> > One thing optional: I feel it a little bit confusing regarding to
> > the helper function name here, for the reasons:
> > 
> > 1. It more sounds like "logging all the vrings we used", however,
> >    what I understand is that, here we are logging dirty pages for
> >    guest memory. Or say, there is merely nothing to do directly with
> >    vring (although many vring ops might call this function, we are
> >    only passing [buf, len] pairs).
> > 
> > 2. It may also let people think of "vring_used", which is part of
> >    virtio protocol. However, it does not mean it too.
> 
> Yes, it does. Here log_guest_addr is set to physical address of
> vring_used. (check code at vhost_virtqueue_set_addr() of qemu).
> 
>      627 static int vhost_virtqueue_set_addr(struct vhost_dev *dev,
>      628                                     struct vhost_virtqueue *vq,
>      629                                     unsigned idx, bool enable_log)
>      630 {
>      631     struct vhost_vring_addr addr = {
>      632         .index = idx,
>      633         .desc_user_addr = (uint64_t)(unsigned long)vq->desc,
>      634         .avail_user_addr = (uint64_t)(unsigned long)vq->avail,
>      635         .used_user_addr = (uint64_t)(unsigned long)vq->used,
> ==>  636         .log_guest_addr = vq->used_phys,
>      637         .flags = enable_log ? (1 << VHOST_VRING_F_LOG) : 0,
>      638     };
> 
> So, this function does log changes to used vring.

Yes. I have made a mistake.

Thanks!
Peter

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-17  3:12   ` [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2015-12-22  8:11     ` Peter Xu
  2015-12-22  8:21       ` Yuanhan Liu
  2015-12-22  8:24       ` Pavel Fedin
  0 siblings, 2 replies; 98+ messages in thread
From: Peter Xu @ 2015-12-22  8:11 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Thu, Dec 17, 2015 at 11:12:00AM +0800, Yuanhan Liu wrote:
> It's actually a feature already enabled in Linux kernel. What we need to
> do is simply to claim that we support such feature, and nothing else.
> 
> With that, the guest will send GARP messages after live migration.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/virtio-net.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 03044f6..0ba5045 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
>  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \

Do we really need this? I can understand when guest declare with
this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
handle the announcement after migration. However, how could I
understand if it's declared by a vhost-user backend? What does it
mean?

In vhost-user.txt (in QEMU repo docs/specs/), the only place that
mentioned this is SEND_RARP:

 * VHOST_USER_SEND_RARP

      Id: 19
      Equivalent ioctl: N/A
      Master payload: u64

      Ask vhost user backend to broadcast a fake RARP to notify the migration
      is terminated for guest that does not support GUEST_ANNOUNCE.
	  ...

Here, it mention the GUEST_ANNOUNCE since when guest support this,
we do not need to send SEND_RARP to vhost-user backend again. It
never explain what does it mean when vhost-user declaring to have
this flag...

Thanks.
Peter

>  				(VHOST_SUPPORTS_MQ)            | \
>  				(1ULL << VIRTIO_F_VERSION_1)   | \
>  				(1ULL << VHOST_F_LOG_ALL)      | \
> -- 
> 1.9.0
> 

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-22  8:11     ` Peter Xu
@ 2015-12-22  8:21       ` Yuanhan Liu
  2015-12-22  8:24       ` Pavel Fedin
  1 sibling, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2015-12-22  8:21 UTC (permalink / raw)
  To: Peter Xu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Tue, Dec 22, 2015 at 04:11:08PM +0800, Peter Xu wrote:
> On Thu, Dec 17, 2015 at 11:12:00AM +0800, Yuanhan Liu wrote:
> > It's actually a feature already enabled in Linux kernel. What we need to
> > do is simply to claim that we support such feature, and nothing else.
> > 
> > With that, the guest will send GARP messages after live migration.
> > 
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > ---
> >  lib/librte_vhost/virtio-net.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 03044f6..0ba5045 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
> >  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> 
> Do we really need this? I can understand when guest declare with
> this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
> handle the announcement after migration. However, how could I
> understand if it's declared by a vhost-user backend? What does it
> mean?
> 
> In vhost-user.txt (in QEMU repo docs/specs/), the only place that
> mentioned this is SEND_RARP:
> 
>  * VHOST_USER_SEND_RARP
> 
>       Id: 19
>       Equivalent ioctl: N/A
>       Master payload: u64
> 
>       Ask vhost user backend to broadcast a fake RARP to notify the migration
>       is terminated for guest that does not support GUEST_ANNOUNCE.
> 	  ...
> 
> Here, it mention the GUEST_ANNOUNCE since when guest support this,
> we do not need to send SEND_RARP to vhost-user backend again. It
> never explain what does it mean when vhost-user declaring to have
> this flag...

The actually work is done in QEMU and guest kernel. QEMU sends a config
interrupt to guest kernel when such flag is enabled after live migration
(see QEMU code virtio_net_load()). Guest kernel generates the GARP
once it receives the interrupt (see the kernel code virtnet_config_changed_work()).

Therefore, we could simply claim that we support VIRTIO_NET_F_GUEST_ANNOUNCE
feature and do nothing here.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature
  2015-12-22  8:11     ` Peter Xu
  2015-12-22  8:21       ` Yuanhan Liu
@ 2015-12-22  8:24       ` Pavel Fedin
  1 sibling, 0 replies; 98+ messages in thread
From: Pavel Fedin @ 2015-12-22  8:24 UTC (permalink / raw)
  To: 'Peter Xu', 'Yuanhan Liu'
  Cc: dev, 'Victor Kaplansky', 'Michael S. Tsirkin'

 Hello!

> > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > index 03044f6..0ba5045 100644
> > --- a/lib/librte_vhost/virtio-net.c
> > +++ b/lib/librte_vhost/virtio-net.c
> > @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
> >  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> >  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> 
> Do we really need this? I can understand when guest declare with
> this VIRTIO_NET_F_GUEST_ANNOUNCE flag. With that, guest itself will
> handle the announcement after migration. However, how could I
> understand if it's declared by a vhost-user backend?

 I guess the documentation is unclear. This is due to way how qemu works. It queries vhost-user backend for the features, then offers them to the guest. The guest then responds with features FROM THE SUGGESTED SET, which it supports. So, if the backend does not claim to support this feature, qemu will not offer it to the guest, therefore the guest will not try to activate it.
 I think this is done because this feature is only useful for migration. If vhost-user backend does not support migration, it needs neither VHOST_USER_SEND_RARP nor guest-side announce.
 Actually, i was thinking about patching qemu once, but... The changeset seemed too complicated, and i imagined the situation described in the above sentence, so decided to abandon it.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v3 0/8] vhost-user live migration support
  2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
                     ` (7 preceding siblings ...)
  2015-12-21  8:17   ` Pavel Fedin
@ 2016-01-29  4:57   ` Yuanhan Liu
  2016-01-29  4:57     ` [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
                       ` (9 more replies)
  8 siblings, 10 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

This patch set adds the vhost-user live migration support.

The major task behind that is to log pages we touched during
live migration, including used vring and desc buffer. So, this
patch set is basically about adding vhost log support, and
using it.

Another important thing is that you need notify the switches
about the VM location change after migration is done. GUEST_ANNOUNCE
feature is for that, which sends an GARP message after migration.
For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
we construct and broadcast a RARP message, with the mac address
from VHOST_USER_SEND_RARP payload.

Patchset
========
- Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
  the dirty memory bitmap is.
    
- Patch 2 introduces a vhost_log_write() helper function to log
  pages we are gonna change.

- Patch 3 logs changes we made to used vring.

- Patch 4 logs changes we made to vring desc buffer.

- Patch 5 and 7 add some feature bits related to live migration.

- patch 6 does the RARP construction and broadcast job.


A simple test guide (on same host)
==================================

The following test is based on OVS + DPDK (check [0] for
how to setup OVS + DPDK):

    [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk

Here is the rough test guide:

1. start ovs-vswitchd

2. Add two ovs vhost-user port, say vhost0 and vhost1

3. Start a VM1 to connect to vhost0. Here is my example:

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3333,server,nowait -curses

4. run "ping $host" inside VM1

5. Start VM2 to connect to vhost0, and marking it as the target
   of live migration (by adding -incoming tcp:0:4444 option)

   $ $QEMU -enable-kvm -m 1024 -smp 4 \
       -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
       -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
       -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
       -object memory-backend-file,id=mem,size=1024M,mem-path=$HOME/hugetlbfs,share=on \
       -numa node,memdev=mem -mem-prealloc \
       -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
       -hda fc-19-i386.img \
       -monitor telnet::3334,server,nowait -curses \
       -incoming tcp:0:4444 

6. connect to VM1 monitor, and start migration:

   > migrate tcp:0:4444

7. After a while, you will find that VM1 has been migrated to VM2,
   and the "ping" command continues running, perfectly.




---
Yuanhan Liu (8):
  vhost: handle VHOST_USER_SET_LOG_BASE request
  vhost: introduce vhost_log_write
  vhost: log used vring changes
  vhost: log vring desc buffer changes
  vhost: claim that we support GUEST_ANNOUNCE feature
  vhost: handle VHOST_USER_SEND_RARP request
  vhost: enable log_shmfd protocol feature
  vhost: remove duplicate header include

 doc/guides/rel_notes/release_2_3.rst          |   2 +
 lib/librte_vhost/rte_virtio_net.h             |   9 +-
 lib/librte_vhost/vhost_rxtx.c                 | 114 +++++++++++++----
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  11 +-
 lib/librte_vhost/vhost_user/vhost-net-user.h  |   7 ++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 174 +++++++++++++++++++++++++-
 lib/librte_vhost/vhost_user/virtio-net-user.h |   8 +-
 lib/librte_vhost/virtio-net.c                 |   5 +
 8 files changed, 299 insertions(+), 31 deletions(-)

-- 
1.9.0

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-01-29  4:57     ` [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
                       ` (8 subsequent siblings)
  9 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

VHOST_USER_SET_LOG_BASE request is used to tell the backend (dpdk
vhost-user) where we should log dirty pages, and how big the log
buffer is.

This request introduces a new payload:

    typedef struct VhostUserLog {
            uint64_t mmap_size;
            uint64_t mmap_offset;
    } VhostUserLog;

Also, a fd is delivered from QEMU by ancillary data.

With those info given, an area of memory is mmaped, assigned
to dev->log_base, for logging dirty pages.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 lib/librte_vhost/rte_virtio_net.h             |  4 ++-
 lib/librte_vhost/vhost_user/vhost-net-user.c  |  7 ++--
 lib/librte_vhost/vhost_user/vhost-net-user.h  |  6 ++++
 lib/librte_vhost/vhost_user/virtio-net-user.c | 48 +++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |  1 +
 5 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 10dcb90..8acee02 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -129,7 +129,9 @@ struct virtio_net {
 	char			ifname[IF_NAME_SZ];	/**< Name of the tap device or socket path. */
 	uint32_t		virt_qp_nb;	/**< number of queue pair we have allocated */
 	void			*priv;		/**< private context */
-	uint64_t		reserved[64];	/**< Reserve some spaces for future extension. */
+	uint64_t		log_size;	/**< Size of log area */
+	uint64_t		log_base;	/**< Where dirty pages are logged */
+	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 8b7a448..32ad6f6 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -388,9 +388,12 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		break;
 
 	case VHOST_USER_SET_LOG_BASE:
-		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
-		break;
+		user_set_log_base(ctx, &msg);
 
+		/* it needs a reply */
+		msg.size = sizeof(msg.payload.u64);
+		send_vhost_message(connfd, &msg);
+		break;
 	case VHOST_USER_SET_LOG_FD:
 		close(msg.fds[0]);
 		RTE_LOG(INFO, VHOST_CONFIG, "not implemented.\n");
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 38637cc..6d252a3 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -83,6 +83,11 @@ typedef struct VhostUserMemory {
 	VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
 } VhostUserMemory;
 
+typedef struct VhostUserLog {
+	uint64_t mmap_size;
+	uint64_t mmap_offset;
+} VhostUserLog;
+
 typedef struct VhostUserMsg {
 	VhostUserRequest request;
 
@@ -97,6 +102,7 @@ typedef struct VhostUserMsg {
 		struct vhost_vring_state state;
 		struct vhost_vring_addr addr;
 		VhostUserMemory memory;
+		VhostUserLog    log;
 	} payload;
 	int fds[VHOST_MEMORY_MAX_NREGIONS];
 } __attribute((packed)) VhostUserMsg;
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 2934d1c..0f3b163 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -365,3 +365,51 @@ user_set_protocol_features(struct vhost_device_ctx ctx,
 
 	dev->protocol_features = protocol_features;
 }
+
+int
+user_set_log_base(struct vhost_device_ctx ctx,
+		 struct VhostUserMsg *msg)
+{
+	struct virtio_net *dev;
+	int fd = msg->fds[0];
+	uint64_t size, off;
+	void *addr;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
+
+	if (fd < 0) {
+		RTE_LOG(ERR, VHOST_CONFIG, "invalid log fd: %d\n", fd);
+		return -1;
+	}
+
+	if (msg->size != sizeof(VhostUserLog)) {
+		RTE_LOG(ERR, VHOST_CONFIG,
+			"invalid log base msg size: %"PRId32" != %d\n",
+			msg->size, (int)sizeof(VhostUserLog));
+		return -1;
+	}
+
+	size = msg->payload.log.mmap_size;
+	off  = msg->payload.log.mmap_offset;
+	RTE_LOG(INFO, VHOST_CONFIG,
+		"log mmap size: %"PRId64", offset: %"PRId64"\n",
+		size, off);
+
+	/*
+	 * mmap from 0 to workaround a hugepage mmap bug: mmap will
+	 * fail when offset is not page size aligned.
+	 */
+	addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		RTE_LOG(ERR, VHOST_CONFIG, "mmap log base failed!\n");
+		return -1;
+	}
+
+	/* TODO: unmap on stop */
+	dev->log_base = (uint64_t)(uintptr_t)addr + off;
+	dev->log_size = size;
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index b82108d..013cf38 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -49,6 +49,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
+int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 2/8] vhost: introduce vhost_log_write
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
  2016-01-29  4:57     ` [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-02-19 14:26       ` Thomas Monjalon
  2016-01-29  4:57     ` [PATCH v3 3/8] vhost: log used vring changes Yuanhan Liu
                       ` (7 subsequent siblings)
  9 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Introduce vhost_log_write() helper function to log the dirty pages we
touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
log is presented by 1 bit.

Therefore, vhost_log_write() simply finds the right bit for related
page we are gonna change, and set it to 1. dev->log_base denotes the
start of the dirty page bitmap.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---

v3: - move the two functions inside vhost_rxtx.c

    - add a memory barrier before logging, to make sure guest memory
      updates are committed before logging.

    - Fix an off-by-one bug
---
 lib/librte_vhost/rte_virtio_net.h |  2 ++
 lib/librte_vhost/vhost_rxtx.c     | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 8acee02..2c891ae 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -40,6 +40,7 @@
  */
 
 #include <stdint.h>
+#include <linux/vhost.h>
 #include <linux/virtio_ring.h>
 #include <linux/virtio_net.h>
 #include <sys/eventfd.h>
@@ -205,6 +206,7 @@ gpa_to_vva(struct virtio_net *dev, uint64_t guest_pa)
 	return vhost_va;
 }
 
+
 /**
  *  Disable features in feature_mask. Returns 0 on success.
  */
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index bbf3fac..d485364 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -42,6 +42,35 @@
 #include "vhost-net.h"
 
 #define MAX_PKT_BURST 32
+#define VHOST_LOG_PAGE	4096
+
+static inline void __attribute__((always_inline))
+vhost_log_page(uint8_t *log_base, uint64_t page)
+{
+	log_base[page / 8] |= 1 << (page % 8);
+}
+
+static inline void __attribute__((always_inline))
+vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
+{
+	uint64_t page;
+
+	if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||
+		   !dev->log_base || !len))
+		return;
+
+	if (unlikely(dev->log_size <= ((addr + len - 1) / VHOST_LOG_PAGE / 8)))
+		return;
+
+	/* To make sure guest memory updates are committed before logging */
+	rte_smp_wmb();
+
+	page = addr / VHOST_LOG_PAGE;
+	while (page * VHOST_LOG_PAGE < addr + len) {
+		vhost_log_page((uint8_t *)(uintptr_t)dev->log_base, page);
+		page += 1;
+	}
+}
 
 static bool
 is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 3/8] vhost: log used vring changes
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
  2016-01-29  4:57     ` [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
  2016-01-29  4:57     ` [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-01-29  4:57     ` [PATCH v3 4/8] vhost: log vring desc buffer changes Yuanhan Liu
                       ` (6 subsequent siblings)
  9 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Every time we update virtio used ring, we need to log it. And it's
been done by a new vhost_log_write() wrapper, vhost_log_used_vring().

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---

v3: remove the unnecessary var "addr" iat vhost_log_used_vring()
---
 lib/librte_vhost/rte_virtio_net.h |  3 +-
 lib/librte_vhost/vhost_rxtx.c     | 77 +++++++++++++++++++++++++++------------
 lib/librte_vhost/virtio-net.c     |  4 ++
 3 files changed, 59 insertions(+), 25 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 2c891ae..4a2303a 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -91,7 +91,8 @@ struct vhost_virtqueue {
 	int			callfd;			/**< Used to notify the guest (trigger interrupt). */
 	int			kickfd;			/**< Currently unused as polling mode is enabled. */
 	int			enabled;
-	uint64_t		reserved[16];		/**< Reserve some spaces for future extension. */
+	uint64_t		log_guest_addr;		/**< Physical address of used ring, for logging */
+	uint64_t		reserved[15];		/**< Reserve some spaces for future extension. */
 	struct buf_vector	buf_vec[BUF_VECTOR_MAX];	/**< for scatter RX. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index d485364..9f7b1f8 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -72,6 +72,13 @@ vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
 	}
 }
 
+static inline void __attribute__((always_inline))
+vhost_log_used_vring(struct virtio_net *dev, struct vhost_virtqueue *vq,
+		     uint64_t offset, uint64_t len)
+{
+	vhost_log_write(dev, vq->log_guest_addr + offset, len);
+}
+
 static bool
 is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t qp_nb)
 {
@@ -158,6 +165,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		uint32_t offset = 0, vb_offset = 0;
 		uint32_t pkt_len, len_to_cpy, data_len, total_copied = 0;
 		uint8_t hdr = 0, uncompleted_pkt = 0;
+		uint16_t idx;
 
 		/* Get descriptor from available ring */
 		desc = &vq->desc[head[packet_success]];
@@ -229,16 +237,18 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 		}
 
 		/* Update used ring with desc information */
-		vq->used->ring[res_cur_idx & (vq->size - 1)].id =
-							head[packet_success];
+		idx = res_cur_idx & (vq->size - 1);
+		vq->used->ring[idx].id = head[packet_success];
 
 		/* Drop the packet if it is uncompleted */
 		if (unlikely(uncompleted_pkt == 1))
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							vq->vhost_hlen;
+			vq->used->ring[idx].len = vq->vhost_hlen;
 		else
-			vq->used->ring[res_cur_idx & (vq->size - 1)].len =
-							pkt_len + vq->vhost_hlen;
+			vq->used->ring[idx].len = pkt_len + vq->vhost_hlen;
+
+		vhost_log_used_vring(dev, vq,
+			offsetof(struct vring_used, ring[idx]),
+			sizeof(vq->used->ring[idx]));
 
 		res_cur_idx++;
 		packet_success++;
@@ -265,6 +275,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 	*(volatile uint16_t *)&vq->used->idx += count;
 	vq->last_used_idx = res_end_idx;
+	vhost_log_used_vring(dev, vq,
+		offsetof(struct vring_used, idx),
+		sizeof(vq->used->idx));
 
 	/* flush used->idx update before we read avail->flags. */
 	rte_mb();
@@ -294,6 +307,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	uint32_t seg_avail;
 	uint32_t vb_avail;
 	uint32_t cpy_len, entry_len;
+	uint16_t idx;
 
 	if (pkt == NULL)
 		return 0;
@@ -331,16 +345,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 	entry_len = vq->vhost_hlen;
 
 	if (vb_avail == 0) {
-		uint32_t desc_idx =
-			vq->buf_vec[vec_idx].desc_idx;
+		uint32_t desc_idx = vq->buf_vec[vec_idx].desc_idx;
+
+		if ((vq->desc[desc_idx].flags & VRING_DESC_F_NEXT) == 0) {
+			idx = cur_idx & (vq->size - 1);
 
-		if ((vq->desc[desc_idx].flags
-			& VRING_DESC_F_NEXT) == 0) {
 			/* Update used ring with desc information */
-			vq->used->ring[cur_idx & (vq->size - 1)].id
-				= vq->buf_vec[vec_idx].desc_idx;
-			vq->used->ring[cur_idx & (vq->size - 1)].len
-				= entry_len;
+			vq->used->ring[idx].id = vq->buf_vec[vec_idx].desc_idx;
+			vq->used->ring[idx].len = entry_len;
+
+			vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 
 			entry_len = 0;
 			cur_idx++;
@@ -383,10 +399,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 			if ((vq->desc[vq->buf_vec[vec_idx].desc_idx].flags &
 				VRING_DESC_F_NEXT) == 0) {
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_len = 0;
 				cur_idx++;
 				entry_success++;
@@ -419,16 +438,18 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 					if ((vq->desc[desc_idx].flags &
 						VRING_DESC_F_NEXT) == 0) {
-						uint16_t wrapped_idx =
-							cur_idx & (vq->size - 1);
+						idx = cur_idx & (vq->size - 1);
 						/*
 						 * Update used ring with the
 						 * descriptor information
 						 */
-						vq->used->ring[wrapped_idx].id
+						vq->used->ring[idx].id
 							= desc_idx;
-						vq->used->ring[wrapped_idx].len
+						vq->used->ring[idx].len
 							= entry_len;
+						vhost_log_used_vring(dev, vq,
+							offsetof(struct vring_used, ring[idx]),
+							sizeof(vq->used->ring[idx]));
 						entry_success++;
 						entry_len = 0;
 						cur_idx++;
@@ -451,10 +472,13 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 				 * This whole packet completes.
 				 */
 				/* Update used ring with desc information */
-				vq->used->ring[cur_idx & (vq->size - 1)].id
+				idx = cur_idx & (vq->size - 1);
+				vq->used->ring[idx].id
 					= vq->buf_vec[vec_idx].desc_idx;
-				vq->used->ring[cur_idx & (vq->size - 1)].len
-					= entry_len;
+				vq->used->ring[idx].len = entry_len;
+				vhost_log_used_vring(dev, vq,
+					offsetof(struct vring_used, ring[idx]),
+					sizeof(vq->used->ring[idx]));
 				entry_success++;
 				break;
 			}
@@ -682,6 +706,9 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 		/* Update used index buffer information. */
 		vq->used->ring[used_idx].id = head[entry_success];
 		vq->used->ring[used_idx].len = 0;
+		vhost_log_used_vring(dev, vq,
+				offsetof(struct vring_used, ring[used_idx]),
+				sizeof(vq->used->ring[used_idx]));
 
 		/* Allocate an mbuf and populate the structure. */
 		m = rte_pktmbuf_alloc(mbuf_pool);
@@ -802,6 +829,8 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 
 	rte_compiler_barrier();
 	vq->used->idx += entry_success;
+	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
+			sizeof(vq->used->idx));
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index de78a0f..03044f6 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -666,12 +666,16 @@ set_vring_addr(struct vhost_device_ctx ctx, struct vhost_vring_addr *addr)
 		return -1;
 	}
 
+	vq->log_guest_addr = addr->log_guest_addr;
+
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address desc: %p\n",
 			dev->device_fh, vq->desc);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address avail: %p\n",
 			dev->device_fh, vq->avail);
 	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") mapped address used: %p\n",
 			dev->device_fh, vq->used);
+	LOG_DEBUG(VHOST_CONFIG, "(%"PRIu64") log_guest_addr: %"PRIx64"\n",
+			dev->device_fh, vq->log_guest_addr);
 
 	return 0;
 }
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 4/8] vhost: log vring desc buffer changes
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
                       ` (2 preceding siblings ...)
  2016-01-29  4:57     ` [PATCH v3 3/8] vhost: log used vring changes Yuanhan Liu
@ 2016-01-29  4:57     ` Yuanhan Liu
  2016-01-29  4:58     ` [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
                       ` (5 subsequent siblings)
  9 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:57 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Every time we copy a buf to vring desc, we need to log it.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Victor Kaplansky <victork@redhat.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 lib/librte_vhost/vhost_rxtx.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 9f7b1f8..9c6cc00 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -97,7 +97,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 	struct rte_mbuf **pkts, uint32_t count)
 {
 	struct vhost_virtqueue *vq;
-	struct vring_desc *desc;
+	struct vring_desc *desc, *hdr_desc;
 	struct rte_mbuf *buff;
 	/* The virtio_hdr is initialised to 0. */
 	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
@@ -179,6 +179,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		/* Copy virtio_hdr to packet and increment buffer address */
 		buff_hdr_addr = buff_addr;
+		hdr_desc = desc;
 
 		/*
 		 * If the descriptors are chained the header and data are
@@ -203,6 +204,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 			rte_memcpy((void *)(uintptr_t)(buff_addr + vb_offset),
 				rte_pktmbuf_mtod_offset(buff, const void *, offset),
 				len_to_cpy);
+			vhost_log_write(dev, desc->addr + vb_offset, len_to_cpy);
 			PRINT_PACKET(dev, (uintptr_t)(buff_addr + vb_offset),
 				len_to_cpy, 0);
 
@@ -258,6 +260,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
 
 		rte_memcpy((void *)(uintptr_t)buff_hdr_addr,
 			(const void *)&virtio_hdr, vq->vhost_hlen);
+		vhost_log_write(dev, hdr_desc->addr, vq->vhost_hlen);
 
 		PRINT_PACKET(dev, (uintptr_t)buff_hdr_addr, vq->vhost_hlen, 1);
 
@@ -335,6 +338,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 
 	rte_memcpy((void *)(uintptr_t)vb_hdr_addr,
 		(const void *)&virtio_hdr, vq->vhost_hlen);
+	vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr, vq->vhost_hlen);
 
 	PRINT_PACKET(dev, (uintptr_t)vb_hdr_addr, vq->vhost_hlen, 1);
 
@@ -379,6 +383,8 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
 		rte_memcpy((void *)(uintptr_t)(vb_addr + vb_offset),
 			rte_pktmbuf_mtod_offset(pkt, const void *, seg_offset),
 			cpy_len);
+		vhost_log_write(dev, vq->buf_vec[vec_idx].buf_addr + vb_offset,
+			cpy_len);
 
 		PRINT_PACKET(dev,
 			(uintptr_t)(vb_addr + vb_offset),
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
                       ` (3 preceding siblings ...)
  2016-01-29  4:57     ` [PATCH v3 4/8] vhost: log vring desc buffer changes Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-03-11 12:39       ` Olivier MATZ
  2016-01-29  4:58     ` [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
                       ` (4 subsequent siblings)
  9 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

It's actually a feature already enabled in Linux kernel (since v3.5).
What we need to do is simply to claim that we support such feature,
and nothing else.

With that, the guest will send an ARP message after live migration
to notify the switches about the new location of migrated VM.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 lib/librte_vhost/virtio-net.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
index 03044f6..0ba5045 100644
--- a/lib/librte_vhost/virtio-net.c
+++ b/lib/librte_vhost/virtio-net.c
@@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
 #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
 				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
 				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
+				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
 				(VHOST_SUPPORTS_MQ)            | \
 				(1ULL << VIRTIO_F_VERSION_1)   | \
 				(1ULL << VHOST_F_LOG_ALL)      | \
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
                       ` (4 preceding siblings ...)
  2016-01-29  4:58     ` [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-02-19  6:11       ` Tan, Jianfeng
  2016-01-29  4:58     ` [PATCH v3 7/8] vhost: enable log_shmfd protocol feature Yuanhan Liu
                       ` (3 subsequent siblings)
  9 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

While in former patch we enabled GUEST_ANNOUNCE feature, so that the
guest OS will broadcast a GARP message after migration to notify the
switch about the new location of migrated VM, the thing is that
GUEST_ANNOUNCE is enabled since kernel v3.5 only. For older kernel,
VHOST_USER_SEND_RARP request comes to rescue.

The payload of this new request is the mac address of the migrated VM,
with that, we could construct a RARP message, and then broadcast it
to host interfaces.

That's how this patch works:

- list all interfaces, with the help of SIOCGIFCONF ioctl command

- construct an RARP message and broadcast it

Cc: Thibaut Collet <thibaut.collet@6wind.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---

  Note that this patch did take effect in my test:

  - it indeed updated target vswitch's mac learning table. (with
    the "ovs fdb/show bridge" command)

  - the ping request packets after migration were indeeded flowed
    to the target (but not the source) host's vswitch. (with tcpdump
    command)

  However, I still saw ping lost. I asked help from Thibaut, the
  original author of the VHOST_USER_SEND_RARP request, he suggested
  that it might be an issue of my network topo, or ovs settings, which
  is likely, regarding to what I observed above.

  Anyway, I'd like to send this out, hopefully someone knows what's
  wrong there if there any.  In the meantime, I will do more debugs.
---
 lib/librte_vhost/vhost_user/vhost-net-user.c  |   4 +
 lib/librte_vhost/vhost_user/vhost-net-user.h  |   1 +
 lib/librte_vhost/vhost_user/virtio-net-user.c | 125 ++++++++++++++++++++++++++
 lib/librte_vhost/vhost_user/virtio-net-user.h |   5 +-
 4 files changed, 134 insertions(+), 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index 32ad6f6..cb18396 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -100,6 +100,7 @@ static const char *vhost_message_str[VHOST_USER_MAX] = {
 	[VHOST_USER_SET_PROTOCOL_FEATURES]  = "VHOST_USER_SET_PROTOCOL_FEATURES",
 	[VHOST_USER_GET_QUEUE_NUM]  = "VHOST_USER_GET_QUEUE_NUM",
 	[VHOST_USER_SET_VRING_ENABLE]  = "VHOST_USER_SET_VRING_ENABLE",
+	[VHOST_USER_SEND_RARP]  = "VHOST_USER_SEND_RARP",
 };
 
 /**
@@ -437,6 +438,9 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 	case VHOST_USER_SET_VRING_ENABLE:
 		user_set_vring_enable(ctx, &msg.payload.state);
 		break;
+	case VHOST_USER_SEND_RARP:
+		user_send_rarp(&msg);
+		break;
 
 	default:
 		break;
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.h b/lib/librte_vhost/vhost_user/vhost-net-user.h
index 6d252a3..e3bb413 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.h
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.h
@@ -67,6 +67,7 @@ typedef enum VhostUserRequest {
 	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
 	VHOST_USER_GET_QUEUE_NUM = 17,
 	VHOST_USER_SET_VRING_ENABLE = 18,
+	VHOST_USER_SEND_RARP = 19,
 	VHOST_USER_MAX
 } VhostUserRequest;
 
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 0f3b163..cda330d 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -34,11 +34,18 @@
 #include <stdint.h>
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 #include <unistd.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/ethernet.h>
+#include <netinet/in.h>
+#include <netinet/if_ether.h>
+#include <linux/if_packet.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -413,3 +420,121 @@ user_set_log_base(struct vhost_device_ctx ctx,
 
 	return 0;
 }
+
+#define RARP_BUF_SIZE	64
+
+static void
+make_rarp_packet(uint8_t *buf, uint8_t *mac)
+{
+	struct ether_header *eth_hdr;
+	struct ether_arp *rarp;
+
+	/* Ethernet header. */
+	eth_hdr = (struct ether_header *)buf;
+	memset(&eth_hdr->ether_dhost, 0xff, ETH_ALEN);
+	memcpy(&eth_hdr->ether_shost, mac,  ETH_ALEN);
+	eth_hdr->ether_type = htons(ETH_P_RARP);
+
+	/* RARP header. */
+	rarp = (struct ether_arp *)(eth_hdr + 1);
+	rarp->ea_hdr.ar_hrd = htons(ARPHRD_ETHER);
+	rarp->ea_hdr.ar_pro = htons(ETHERTYPE_IP);
+	rarp->ea_hdr.ar_hln = ETH_ALEN;
+	rarp->ea_hdr.ar_pln = 4;
+	rarp->ea_hdr.ar_op  = htons(ARPOP_RREQUEST);
+
+	memcpy(&rarp->arp_sha, mac, ETH_ALEN);
+	memset(&rarp->arp_spa, 0x00, 4);
+	memcpy(&rarp->arp_tha, mac, 6);
+	memset(&rarp->arp_tpa, 0x00, 4);
+}
+
+
+static void
+send_rarp(const char *ifname, uint8_t *rarp)
+{
+	int fd;
+	struct ifreq ifr;
+	struct sockaddr_ll addr;
+
+	fd = socket(AF_PACKET, SOCK_RAW, 0);
+	if (fd < 0) {
+		perror("socket failed");
+		return;
+	}
+
+	memset(&ifr, 0, sizeof(struct ifreq));
+	strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
+	if (ioctl(fd, SIOCGIFINDEX, &ifr) < 0) {
+		perror("failed to get interface index");
+		close(fd);
+		return;
+	}
+
+	addr.sll_ifindex = ifr.ifr_ifindex;
+	addr.sll_halen   = ETH_ALEN;
+
+	if (sendto(fd, rarp, RARP_BUF_SIZE, 0,
+		   (const struct sockaddr*)&addr, sizeof(addr)) < 0) {
+		perror("send rarp packet failed");
+	}
+}
+
+
+/*
+ * Broadcast a RARP message to all interfaces, to update
+ * switch's mac table
+ */
+int
+user_send_rarp(struct VhostUserMsg *msg)
+{
+	uint8_t *mac = (uint8_t *)&msg->payload.u64;
+	uint8_t rarp[RARP_BUF_SIZE];
+	struct ifconf ifc = {0, };
+	struct ifreq *ifr;
+	int nr = 16;
+	int fd;
+	uint32_t i;
+
+	RTE_LOG(DEBUG, VHOST_CONFIG,
+		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
+		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
+
+	make_rarp_packet(rarp, mac);
+
+	/*
+	 * Get all interfaces
+	 */
+	fd = socket(AF_INET, SOCK_DGRAM, 0);
+	if (fd < 0) {
+		perror("failed to create AF_INET socket");
+		return -1;
+	}
+
+again:
+	ifc.ifc_len = sizeof(*ifr) * nr;
+	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
+
+	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
+		perror("failed at SIOCGIFCONF");
+		close(fd);
+		return -1;
+	}
+
+	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
+		/*
+		 * current ifc_buf is not big enough to hold
+		 * all interfaces; double it and try again.
+		 */
+		nr *= 2;
+		goto again;
+	}
+
+	ifr = (struct ifreq *)ifc.ifc_buf;
+	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
+		send_rarp(ifr[i].ifr_name, rarp);
+
+	close(fd);
+
+	return 0;
+}
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 013cf38..1e9ff9a 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,8 +38,10 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_RARP	2
 
-#define VHOST_USER_PROTOCOL_FEATURES	(1ULL << VHOST_USER_PROTOCOL_F_MQ)
+#define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_RARP))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
 
@@ -50,6 +52,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
 int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
+int user_send_rarp(struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 7/8] vhost: enable log_shmfd protocol feature
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
                       ` (5 preceding siblings ...)
  2016-01-29  4:58     ` [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-01-29  4:58     ` [PATCH v3 8/8] vhost: remove duplicate header include Yuanhan Liu
                       ` (2 subsequent siblings)
  9 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

To claim that we support vhost-user live migration support:
SET_LOG_BASE request will be send only when this feature flag
is set.

Besides this flag, we actually need another feature flag set
to make vhost-user live migration work: VHOST_F_LOG_ALL.
Which, however, has been enabled long time ago.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Tested-by: Pavel Fedin <p.fedin@samsung.com>
---
 doc/guides/rel_notes/release_2_3.rst          | 2 ++
 lib/librte_vhost/vhost_user/virtio-net-user.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/doc/guides/rel_notes/release_2_3.rst b/doc/guides/rel_notes/release_2_3.rst
index 99de186..f2c9e41 100644
--- a/doc/guides/rel_notes/release_2_3.rst
+++ b/doc/guides/rel_notes/release_2_3.rst
@@ -4,6 +4,8 @@ DPDK Release 2.3
 New Features
 ------------
 
+* **Added vhost-user live migration support.**
+
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 1e9ff9a..28213f3 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -38,9 +38,11 @@
 #include "vhost-net-user.h"
 
 #define VHOST_USER_PROTOCOL_F_MQ	0
+#define VHOST_USER_PROTOCOL_F_LOG_SHMFD	1
 #define VHOST_USER_PROTOCOL_F_RARP	2
 
 #define VHOST_USER_PROTOCOL_FEATURES	((1ULL << VHOST_USER_PROTOCOL_F_MQ) | \
+					 (1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD) |\
 					 (1ULL << VHOST_USER_PROTOCOL_F_RARP))
 
 int user_set_mem_table(struct vhost_device_ctx, struct VhostUserMsg *);
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* [PATCH v3 8/8] vhost: remove duplicate header include
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
                       ` (6 preceding siblings ...)
  2016-01-29  4:58     ` [PATCH v3 7/8] vhost: enable log_shmfd protocol feature Yuanhan Liu
@ 2016-01-29  4:58     ` Yuanhan Liu
  2016-02-01 15:54     ` [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
  2016-02-19 15:01     ` Thomas Monjalon
  9 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-01-29  4:58 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

unistd.h has been included twice; remove one.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/vhost_user/virtio-net-user.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index cda330d..4270c98 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -39,7 +39,6 @@
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
-#include <unistd.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
 #include <net/ethernet.h>
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 0/8] vhost-user live migration support
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
                       ` (7 preceding siblings ...)
  2016-01-29  4:58     ` [PATCH v3 8/8] vhost: remove duplicate header include Yuanhan Liu
@ 2016-02-01 15:54     ` Iremonger, Bernard
  2016-02-02  1:53       ` Yuanhan Liu
  2016-02-19 15:01     ` Thomas Monjalon
  9 siblings, 1 reply; 98+ messages in thread
From: Iremonger, Bernard @ 2016-02-01 15:54 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Hi Yuanhan,

<snip>

> A simple test guide (on same host)
> ==================================
> 
> The following test is based on OVS + DPDK (check [0] for how to setup OVS +
> DPDK):
> 
>     [0]: http://wiki.qemu.org/Features/vhost-user-ovs-dpdk
> 
> Here is the rough test guide:
> 
> 1. start ovs-vswitchd
> 
> 2. Add two ovs vhost-user port, say vhost0 and vhost1
> 
> 3. Start a VM1 to connect to vhost0. Here is my example:
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost0  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3333,server,nowait -curses
> 
> 4. run "ping $host" inside VM1
> 
> 5. Start VM2 to connect to vhost0, and marking it as the target
>    of live migration (by adding -incoming tcp:0:4444 option)
> 
>    $ $QEMU -enable-kvm -m 1024 -smp 4 \
>        -chardev socket,id=char0,path=/var/run/openvswitch/vhost1  \
>        -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
>        -device virtio-net-pci,netdev=mynet1,mac=52:54:00:12:34:58 \
>        -object memory-backend-file,id=mem,size=1024M,mem-
> path=$HOME/hugetlbfs,share=on \
>        -numa node,memdev=mem -mem-prealloc \
>        -kernel $HOME/iso/vmlinuz -append "root=/dev/sda1" \
>        -hda fc-19-i386.img \
>        -monitor telnet::3334,server,nowait -curses \
>        -incoming tcp:0:4444
> 
> 6. connect to VM1 monitor, and start migration:
> 
>    > migrate tcp:0:4444
> 
> 7. After a while, you will find that VM1 has been migrated to VM2,
>    and the "ping" command continues running, perfectly.
> 

The above test guide should probably be added to the DPDK doc files.
It could be added to the sample app guide or the programmers guide.
There is already a Vhost Library section in the programmers guide and
a Vhost Sample Application section in the sample app guide.
The Vhost Sample Application section might be the best place to add it.

It would be useful to add some documentation on the Vhost Logging feature Itself. 

Regards,

Bernard.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 0/8] vhost-user live migration support
  2016-02-01 15:54     ` [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
@ 2016-02-02  1:53       ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-02  1:53 UTC (permalink / raw)
  To: Iremonger, Bernard; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

On Mon, Feb 01, 2016 at 03:54:05PM +0000, Iremonger, Bernard wrote:
> Hi Yuanhan,
> 
> <snip>
> 
> > A simple test guide (on same host)
> > ==================================
> > 
> > The following test is based on OVS + DPDK (check [0] for how to setup OVS +
> > DPDK):
...
> 
> The above test guide should probably be added to the DPDK doc files.
> It could be added to the sample app guide or the programmers guide.
> There is already a Vhost Library section in the programmers guide and
> a Vhost Sample Application section in the sample app guide.

Hi Iremonger,

You had same ask in my last version, and I was aware of that while
preparing this version. The reason I didn't include it is the same
as I replied to you last time: it's a very rough test guide; the
formal test guide would be with 2 hosts, and also you have to establish
a connection to the VM before migration and make sure it still works
right after the migration.

So, I will consider to add such formal test plan after validation is
done from validation team.

> The Vhost Sample Application section might be the best place to add it.

I don't think that's the right place: we can't do the live migration
test with the vhost-switch example: it includes some hardware features,
such as VLAN. Those info would be lost after migration.

And that's the reason I choose OVS.

> 
> It would be useful to add some documentation on the Vhost Logging feature Itself. 

Good suggestion; I will think what I can do for it.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-01-29  4:58     ` [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
@ 2016-02-19  6:11       ` Tan, Jianfeng
  2016-02-19  7:03         ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Tan, Jianfeng @ 2016-02-19  6:11 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

Hi Yuanhan,

On 1/29/2016 12:58 PM, Yuanhan Liu wrote:
> While in former patch we enabled GUEST_ANNOUNCE feature, so that the
> guest OS will broadcast a GARP message after migration to notify the
> switch about the new location of migrated VM, the thing is that
> GUEST_ANNOUNCE is enabled since kernel v3.5 only. For older kernel,
> VHOST_USER_SEND_RARP request comes to rescue.
>
> The payload of this new request is the mac address of the migrated VM,
> with that, we could construct a RARP message, and then broadcast it
> to host interfaces.
>
> That's how this patch works:
>
> - list all interfaces, with the help of SIOCGIFCONF ioctl command
>
> - construct an RARP message and broadcast it
>
> Cc: Thibaut Collet <thibaut.collet@6wind.com>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
...
> +
> +/*
> + * Broadcast a RARP message to all interfaces, to update
> + * switch's mac table
> + */
> +int
> +user_send_rarp(struct VhostUserMsg *msg)
> +{
> +	uint8_t *mac = (uint8_t *)&msg->payload.u64;
> +	uint8_t rarp[RARP_BUF_SIZE];
> +	struct ifconf ifc = {0, };
> +	struct ifreq *ifr;
> +	int nr = 16;
> +	int fd;
> +	uint32_t i;
> +
> +	RTE_LOG(DEBUG, VHOST_CONFIG,
> +		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
> +		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
> +
> +	make_rarp_packet(rarp, mac);
> +
> +	/*
> +	 * Get all interfaces
> +	 */
> +	fd = socket(AF_INET, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("failed to create AF_INET socket");
> +		return -1;
> +	}
> +
> +again:
> +	ifc.ifc_len = sizeof(*ifr) * nr;
> +	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
> +
> +	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
> +		perror("failed at SIOCGIFCONF");
> +		close(fd);
> +		return -1;
> +	}
> +
> +	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
> +		/*
> +		 * current ifc_buf is not big enough to hold
> +		 * all interfaces; double it and try again.
> +		 */
> +		nr *= 2;
> +		goto again;
> +	}
> +
> +	ifr = (struct ifreq *)ifc.ifc_buf;
> +	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
> +		send_rarp(ifr[i].ifr_name, rarp);
> +
> +	close(fd);
> +
> +	return 0;
> +}

 From how you implement user_send_rarp(), if I understand it correctly, 
it broadcasts this ARP packets to all host interfaces, which I don't 
think it's appropriate. This ARP packets should be sent to it's own L2 
networking. You should not make the hypothesis that all interfaces 
maintained in the kernel are in the same L2 networking. Even worse, this 
could bring problems when used in overlay networking, in which two VM in 
two different overlay networking, can have same MAC address.

What I suggest here is to move user_send_rarp() to 
rte_vhost_dequeue_burst() using a flag to control, so that this arp 
packet can be broadcasted in its own L2 network.

Thanks,
Jianfeng

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-02-19  6:11       ` Tan, Jianfeng
@ 2016-02-19  7:03         ` Yuanhan Liu
  2016-02-19  8:55           ` Yuanhan Liu
  2016-02-22 14:36           ` [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
  0 siblings, 2 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-19  7:03 UTC (permalink / raw)
  To: Tan, Jianfeng; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 02:11:36PM +0800, Tan, Jianfeng wrote:
> Hi Yuanhan,
> 
> On 1/29/2016 12:58 PM, Yuanhan Liu wrote:
> >While in former patch we enabled GUEST_ANNOUNCE feature, so that the
> >guest OS will broadcast a GARP message after migration to notify the
> >switch about the new location of migrated VM, the thing is that
> >GUEST_ANNOUNCE is enabled since kernel v3.5 only. For older kernel,
> >VHOST_USER_SEND_RARP request comes to rescue.
> >
> >The payload of this new request is the mac address of the migrated VM,
> >with that, we could construct a RARP message, and then broadcast it
> >to host interfaces.
> >
> >That's how this patch works:
> >
> >- list all interfaces, with the help of SIOCGIFCONF ioctl command
> >
> >- construct an RARP message and broadcast it
> >
> >Cc: Thibaut Collet <thibaut.collet@6wind.com>
> >Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >---
> ...
> >+
> >+/*
> >+ * Broadcast a RARP message to all interfaces, to update
> >+ * switch's mac table
> >+ */
> >+int
> >+user_send_rarp(struct VhostUserMsg *msg)
> >+{
> >+	uint8_t *mac = (uint8_t *)&msg->payload.u64;
> >+	uint8_t rarp[RARP_BUF_SIZE];
> >+	struct ifconf ifc = {0, };
> >+	struct ifreq *ifr;
> >+	int nr = 16;
> >+	int fd;
> >+	uint32_t i;
> >+
> >+	RTE_LOG(DEBUG, VHOST_CONFIG,
> >+		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
> >+		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
> >+
> >+	make_rarp_packet(rarp, mac);
> >+
> >+	/*
> >+	 * Get all interfaces
> >+	 */
> >+	fd = socket(AF_INET, SOCK_DGRAM, 0);
> >+	if (fd < 0) {
> >+		perror("failed to create AF_INET socket");
> >+		return -1;
> >+	}
> >+
> >+again:
> >+	ifc.ifc_len = sizeof(*ifr) * nr;
> >+	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
> >+
> >+	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
> >+		perror("failed at SIOCGIFCONF");
> >+		close(fd);
> >+		return -1;
> >+	}
> >+
> >+	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
> >+		/*
> >+		 * current ifc_buf is not big enough to hold
> >+		 * all interfaces; double it and try again.
> >+		 */
> >+		nr *= 2;
> >+		goto again;
> >+	}
> >+
> >+	ifr = (struct ifreq *)ifc.ifc_buf;
> >+	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
> >+		send_rarp(ifr[i].ifr_name, rarp);
> >+
> >+	close(fd);
> >+
> >+	return 0;
> >+}
> 
> From how you implement user_send_rarp(), if I understand it correctly, it
> broadcasts this ARP packets to all host interfaces, which I don't think it's
> appropriate. This ARP packets should be sent to it's own L2 networking. You
> should not make the hypothesis that all interfaces maintained in the kernel
> are in the same L2 networking. Even worse, this could bring problems when
> used in overlay networking, in which two VM in two different overlay
> networking, can have same MAC address.
> 
> What I suggest here is to move user_send_rarp() to rte_vhost_dequeue_burst()
> using a flag to control, so that this arp packet can be broadcasted in its
> own L2 network.

I have thought of that, too. It was given up because SEND_RARP request was
handled in different thread from rte_vhost_dequeue_burst(), leading to the
fact that the RARP packet will not be broadcasted immediately after migration
is done: it will be broadcasted only when rte_vhost_dequeue_burst() is invoked.

I was thinking the delay might be a problem. While thinking it twice, it
doesn't look like one then. As GUEST_ANNOUNCE is also broadcasted by
rte_vhost_dequeue_burst(); it's enqueued by guest kernel though. And
judging that we are polling mode driver, it won't be an issue then.

So, thanks. I will give it a quick try; it should work.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request
  2016-02-19  7:03         ` Yuanhan Liu
@ 2016-02-19  8:55           ` Yuanhan Liu
  2016-02-22 14:36           ` [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
  1 sibling, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-19  8:55 UTC (permalink / raw)
  To: Tan, Jianfeng; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 03:03:26PM +0800, Yuanhan Liu wrote:
> On Fri, Feb 19, 2016 at 02:11:36PM +0800, Tan, Jianfeng wrote:
> > What I suggest here is to move user_send_rarp() to rte_vhost_dequeue_burst()
> > using a flag to control, so that this arp packet can be broadcasted in its
> > own L2 network.
> 
> I have thought of that, too. It was given up because SEND_RARP request was
> handled in different thread from rte_vhost_dequeue_burst(), leading to the
> fact that the RARP packet will not be broadcasted immediately after migration
> is done: it will be broadcasted only when rte_vhost_dequeue_burst() is invoked.
> 
> I was thinking the delay might be a problem. While thinking it twice, it
> doesn't look like one then. As GUEST_ANNOUNCE is also broadcasted by
> rte_vhost_dequeue_burst(); it's enqueued by guest kernel though. And
> judging that we are polling mode driver, it won't be an issue then.
> 
> So, thanks. I will give it a quick try; it should work.

It worked like a charm :) Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 2/8] vhost: introduce vhost_log_write
  2016-01-29  4:57     ` [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
@ 2016-02-19 14:26       ` Thomas Monjalon
  2016-02-22  6:59         ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2016-02-19 14:26 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2016-01-29 12:57, Yuanhan Liu:
> Introduce vhost_log_write() helper function to log the dirty pages we
> touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> log is presented by 1 bit.
> 
> Therefore, vhost_log_write() simply finds the right bit for related
> page we are gonna change, and set it to 1. dev->log_base denotes the
> start of the dirty page bitmap.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Signed-off-by: Victor Kaplansky <victork@redhat.com>
> Tested-by: Pavel Fedin <p.fedin@samsung.com>
[...]
> +static inline void __attribute__((always_inline))
> +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)

lib/librte_vhost/vhost_rxtx.c:59:1: error: unused function 'vhost_log_write'

I think it's better to squash with the next patch.

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 0/8] vhost-user live migration support
  2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
                       ` (8 preceding siblings ...)
  2016-02-01 15:54     ` [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
@ 2016-02-19 15:01     ` Thomas Monjalon
  2016-02-22  7:08       ` Yuanhan Liu
  9 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2016-02-19 15:01 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2016-01-29 12:57, Yuanhan Liu:
> This patch set adds the vhost-user live migration support.
> 
> The major task behind that is to log pages we touched during
> live migration, including used vring and desc buffer. So, this
> patch set is basically about adding vhost log support, and
> using it.
> 
> Another important thing is that you need notify the switches
> about the VM location change after migration is done. GUEST_ANNOUNCE
> feature is for that, which sends an GARP message after migration.
> For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
> we construct and broadcast a RARP message, with the mac address
> from VHOST_USER_SEND_RARP payload.
> 
> Patchset
> ========
> - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
>   the dirty memory bitmap is.
>     
> - Patch 2 introduces a vhost_log_write() helper function to log
>   pages we are gonna change.
> 
> - Patch 3 logs changes we made to used vring.
> 
> - Patch 4 logs changes we made to vring desc buffer.
> 
> - Patch 5 and 7 add some feature bits related to live migration.
> 
> - patch 6 does the RARP construction and broadcast job.

Patches 2 and 3 have been merged to avoid a compilation error.
Applied, thanks

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 2/8] vhost: introduce vhost_log_write
  2016-02-19 14:26       ` Thomas Monjalon
@ 2016-02-22  6:59         ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-22  6:59 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 03:26:36PM +0100, Thomas Monjalon wrote:
> 2016-01-29 12:57, Yuanhan Liu:
> > Introduce vhost_log_write() helper function to log the dirty pages we
> > touched. Page size is harded code to 4096 (VHOST_LOG_PAGE), and each
> > log is presented by 1 bit.
> > 
> > Therefore, vhost_log_write() simply finds the right bit for related
> > page we are gonna change, and set it to 1. dev->log_base denotes the
> > start of the dirty page bitmap.
> > 
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > Signed-off-by: Victor Kaplansky <victork@redhat.com>
> > Tested-by: Pavel Fedin <p.fedin@samsung.com>
> [...]
> > +static inline void __attribute__((always_inline))
> > +vhost_log_write(struct virtio_net *dev, uint64_t addr, uint64_t len)
> 
> lib/librte_vhost/vhost_rxtx.c:59:1: error: unused function 'vhost_log_write'

Oops. This functions was defined in a header file. I then moved it to
vhost_rxtx.c, but I forgot to do the compile test :(

> I think it's better to squash with the next patch.

Yes, and thanks!

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 0/8] vhost-user live migration support
  2016-02-19 15:01     ` Thomas Monjalon
@ 2016-02-22  7:08       ` Yuanhan Liu
  2016-02-22  9:56         ` Thomas Monjalon
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-22  7:08 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Fri, Feb 19, 2016 at 04:01:25PM +0100, Thomas Monjalon wrote:
> 2016-01-29 12:57, Yuanhan Liu:
> > This patch set adds the vhost-user live migration support.
> > 
> > The major task behind that is to log pages we touched during
> > live migration, including used vring and desc buffer. So, this
> > patch set is basically about adding vhost log support, and
> > using it.
> > 
> > Another important thing is that you need notify the switches
> > about the VM location change after migration is done. GUEST_ANNOUNCE
> > feature is for that, which sends an GARP message after migration.
> > For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
> > we construct and broadcast a RARP message, with the mac address
> > from VHOST_USER_SEND_RARP payload.
> > 
> > Patchset
> > ========
> > - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
> >   the dirty memory bitmap is.
> >     
> > - Patch 2 introduces a vhost_log_write() helper function to log
> >   pages we are gonna change.
> > 
> > - Patch 3 logs changes we made to used vring.
> > 
> > - Patch 4 logs changes we made to vring desc buffer.
> > 
> > - Patch 5 and 7 add some feature bits related to live migration.
> > 
> > - patch 6 does the RARP construction and broadcast job.
> 
> Patches 2 and 3 have been merged to avoid a compilation error.
> Applied, thanks

Actually, there was a ongoing discussion about patch 6, the handling
of VHOST_USER_SEND_RARP request:

  http://dpdk.org/ml/archives/dev/2016-February/033539.html

Maybe you have seen that and I didn't make it clear; my bad. Since you
have already applied it. I will make a standalone patch, and try to
send it out today.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 0/8] vhost-user live migration support
  2016-02-22  7:08       ` Yuanhan Liu
@ 2016-02-22  9:56         ` Thomas Monjalon
  2016-02-22 14:24           ` Yuanhan Liu
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2016-02-22  9:56 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

2016-02-22 15:08, Yuanhan Liu:
> On Fri, Feb 19, 2016 at 04:01:25PM +0100, Thomas Monjalon wrote:
> > 2016-01-29 12:57, Yuanhan Liu:
> > > This patch set adds the vhost-user live migration support.
> > > 
> > > The major task behind that is to log pages we touched during
> > > live migration, including used vring and desc buffer. So, this
> > > patch set is basically about adding vhost log support, and
> > > using it.
> > > 
> > > Another important thing is that you need notify the switches
> > > about the VM location change after migration is done. GUEST_ANNOUNCE
> > > feature is for that, which sends an GARP message after migration.
> > > For older kernel (<= v3.4) without GUEST_ANNOUNCE support,
> > > we construct and broadcast a RARP message, with the mac address
> > > from VHOST_USER_SEND_RARP payload.
> > > 
> > > Patchset
> > > ========
> > > - Patch 1 handles VHOST_USER_SET_LOG_BASE, which tells us where
> > >   the dirty memory bitmap is.
> > >     
> > > - Patch 2 introduces a vhost_log_write() helper function to log
> > >   pages we are gonna change.
> > > 
> > > - Patch 3 logs changes we made to used vring.
> > > 
> > > - Patch 4 logs changes we made to vring desc buffer.
> > > 
> > > - Patch 5 and 7 add some feature bits related to live migration.
> > > 
> > > - patch 6 does the RARP construction and broadcast job.
> > 
> > Patches 2 and 3 have been merged to avoid a compilation error.
> > Applied, thanks
> 
> Actually, there was a ongoing discussion about patch 6, the handling
> of VHOST_USER_SEND_RARP request:
> 
>   http://dpdk.org/ml/archives/dev/2016-February/033539.html
> 
> Maybe you have seen that and I didn't make it clear; my bad. Since you
> have already applied it. I will make a standalone patch, and try to
> send it out today.

Yes I've wrongly understood there was no problem.
The series would have not been applied if you had said that a new version
was needed or if you had set the patchwork status to "Changes Requested".
Sorry, we'll do better next times ;)

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 0/8] vhost-user live migration support
  2016-02-22  9:56         ` Thomas Monjalon
@ 2016-02-22 14:24           ` Yuanhan Liu
  0 siblings, 0 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-22 14:24 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Mon, Feb 22, 2016 at 10:56:42AM +0100, Thomas Monjalon wrote:
> > > Patches 2 and 3 have been merged to avoid a compilation error.
> > > Applied, thanks
> > 
> > Actually, there was a ongoing discussion about patch 6, the handling
> > of VHOST_USER_SEND_RARP request:
> > 
> >   http://dpdk.org/ml/archives/dev/2016-February/033539.html
> > 
> > Maybe you have seen that and I didn't make it clear; my bad. Since you
> > have already applied it. I will make a standalone patch, and try to
> > send it out today.
> 
> Yes I've wrongly understood there was no problem.
> The series would have not been applied if you had said that a new version
> was needed or if you had set the patchwork status to "Changes Requested".

Yeah, that's my bad. I meant to send it out soon, say, last Friday
night. However, I didn't make it. For the patchwork stuff, I'm
still gettig used to it. So, thanks for the tip. I will bear that
in mind next time.

> Sorry, we'll do better next times ;)

Yeah :)

A new patch will be sent out soon. Please consider to apply it. Or,
maybe we can hold it for a while, to see if someone else have comments.
I will ping you to apply it next week if no comments.

Thanks.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-19  7:03         ` Yuanhan Liu
  2016-02-19  8:55           ` Yuanhan Liu
@ 2016-02-22 14:36           ` Yuanhan Liu
  2016-02-24  8:15             ` Qiu, Michael
  2016-02-29 15:56             ` Thomas Monjalon
  1 sibling, 2 replies; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-22 14:36 UTC (permalink / raw)
  To: dev; +Cc: Michael S. Tsirkin, Victor Kaplansky

Broadcast RARP packet by injecting it to receiving mbuf array at
rte_vhost_dequeue_burst().

Commit 33226236a35e ("vhost: handle request to send RARP") iterates
all host interfaces and then broadcast it by all of them.  It did
notify the switches about the new location of the migrated VM, however,
the mac learning table in the target host is wrong (at least in my
test with OVS):

    $ ovs-appctl fdb/show ovsbr0
     port  VLAN  MAC                Age
        1     0  b6:3c:72:71:cd:4d   10
    LOCAL     0  b6:3c:72:71:cd:4e   10
    LOCAL     0  52:54:00:12:34:68    9
        1     0  56:f6:64:2c:bc:c0    1

Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
above, the port learned is "LOCAL", which is the "ovsbr0" port. That
is reasonable, since we indeed send the pkt by the "ovsbr0" interface.

The wrong mac table lead all the packets to the VM go to the "ovsbr0"
in the end, which ends up with all packets being lost, until the guest
send a ARP quest (or reply) to refresh the mac learning table.

Jianfeng then came up with a solution I have thought of firstly but NAKed
by myself, concerning it has potential issues [0]. The solution is as title
stated: broadcast the RARP packet by injecting it to the receiving mbuf
arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
think it twice; it looked like a false concern to me then. And I had done
a rough verification: it worked as expected.

[0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html

Another note is that while preparing this version, I found that DPDK has
some ARP related structures and macros defined. So, use them instead of
the one from standard header files here.

Cc: Thibaut Collet <thibaut.collet@6wind.com>
Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 lib/librte_vhost/rte_virtio_net.h             |   5 +-
 lib/librte_vhost/vhost_rxtx.c                 |  80 +++++++++++++++-
 lib/librte_vhost/vhost_user/vhost-net-user.c  |   2 +-
 lib/librte_vhost/vhost_user/virtio-net-user.c | 128 ++++----------------------
 lib/librte_vhost/vhost_user/virtio-net-user.h |   2 +-
 5 files changed, 104 insertions(+), 113 deletions(-)

diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
index 4a2303a..7d1fde2 100644
--- a/lib/librte_vhost/rte_virtio_net.h
+++ b/lib/librte_vhost/rte_virtio_net.h
@@ -49,6 +49,7 @@
 
 #include <rte_memory.h>
 #include <rte_mempool.h>
+#include <rte_ether.h>
 
 struct rte_mbuf;
 
@@ -133,7 +134,9 @@ struct virtio_net {
 	void			*priv;		/**< private context */
 	uint64_t		log_size;	/**< Size of log area */
 	uint64_t		log_base;	/**< Where dirty pages are logged */
-	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
+	struct ether_addr	mac;		/**< MAC address */
+	rte_atomic16_t		broadcast_rarp;	/**< A flag to tell if we need broadcast rarp packet */
+	uint64_t		reserved[61];	/**< Reserve some spaces for future extension. */
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
 } __rte_cache_aligned;
 
diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
index 12ce0cc..9d23eb1 100644
--- a/lib/librte_vhost/vhost_rxtx.c
+++ b/lib/librte_vhost/vhost_rxtx.c
@@ -43,6 +43,7 @@
 #include <rte_tcp.h>
 #include <rte_udp.h>
 #include <rte_sctp.h>
+#include <rte_arp.h>
 
 #include "vhost-net.h"
 
@@ -761,11 +762,50 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
 	}
 }
 
+#define RARP_PKT_SIZE	64
+
+static int
+make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
+{
+	struct ether_hdr *eth_hdr;
+	struct arp_hdr  *rarp;
+
+	if (rarp_mbuf->buf_len < 64) {
+		RTE_LOG(WARNING, VHOST_DATA,
+			"failed to make RARP; mbuf size too small %u (< %d)\n",
+			rarp_mbuf->buf_len, RARP_PKT_SIZE);
+		return -1;
+	}
+
+	/* Ethernet header. */
+	eth_hdr = rte_pktmbuf_mtod_offset(rarp_mbuf, struct ether_hdr *, 0);
+	memset(eth_hdr->d_addr.addr_bytes, 0xff, ETHER_ADDR_LEN);
+	ether_addr_copy(mac, &eth_hdr->s_addr);
+	eth_hdr->ether_type = htons(ETHER_TYPE_RARP);
+
+	/* RARP header. */
+	rarp = (struct arp_hdr *)(eth_hdr + 1);
+	rarp->arp_hrd = htons(ARP_HRD_ETHER);
+	rarp->arp_pro = htons(ETHER_TYPE_IPv4);
+	rarp->arp_hln = ETHER_ADDR_LEN;
+	rarp->arp_pln = 4;
+	rarp->arp_op  = htons(ARP_OP_REVREQUEST);
+
+	ether_addr_copy(mac, &rarp->arp_data.arp_sha);
+	ether_addr_copy(mac, &rarp->arp_data.arp_tha);
+	memset(&rarp->arp_data.arp_sip, 0x00, 4);
+	memset(&rarp->arp_data.arp_tip, 0x00, 4);
+
+	rarp_mbuf->pkt_len  = rarp_mbuf->data_len = RARP_PKT_SIZE;
+
+	return 0;
+}
+
 uint16_t
 rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
 {
-	struct rte_mbuf *m, *prev;
+	struct rte_mbuf *m, *prev, *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
 	struct vring_desc *desc;
 	uint64_t vb_addr = 0;
@@ -788,11 +828,34 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	if (unlikely(vq->enabled == 0))
 		return 0;
 
+	/*
+	 * Construct a RARP broadcast packet, and inject it to the "pkts"
+	 * array, to looks like that guest actually send such packet.
+	 *
+	 * Check user_send_rarp() for more information.
+	 */
+	if (unlikely(rte_atomic16_cmpset((volatile uint16_t *)
+					 &dev->broadcast_rarp.cnt, 1, 0))) {
+		rarp_mbuf = rte_pktmbuf_alloc(mbuf_pool);
+		if (rarp_mbuf == NULL) {
+			RTE_LOG(ERR, VHOST_DATA,
+				"Failed to allocate memory for mbuf.\n");
+			return 0;
+		}
+
+		if (make_rarp_packet(rarp_mbuf, &dev->mac)) {
+			rte_pktmbuf_free(rarp_mbuf);
+			rarp_mbuf = NULL;
+		} else {
+			count -= 1;
+		}
+	}
+
 	avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
 
 	/* If there are no available buffers then return. */
 	if (vq->last_used_idx == avail_idx)
-		return 0;
+		goto out;
 
 	LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__,
 		dev->device_fh);
@@ -983,8 +1046,21 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
 	vq->used->idx += entry_success;
 	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
 			sizeof(vq->used->idx));
+
 	/* Kick guest if required. */
 	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
 		eventfd_write(vq->callfd, (eventfd_t)1);
+
+out:
+	if (unlikely(rarp_mbuf != NULL)) {
+		/*
+		 * Inject it to the head of "pkts" array, so that switch's mac
+		 * learning table will get updated first.
+		 */
+		memmove(&pkts[1], pkts, entry_success * sizeof(m));
+		pkts[0] = rarp_mbuf;
+		entry_success += 1;
+	}
+
 	return entry_success;
 }
diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
index de7eecb..df2bd64 100644
--- a/lib/librte_vhost/vhost_user/vhost-net-user.c
+++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
@@ -437,7 +437,7 @@ vserver_message_handler(int connfd, void *dat, int *remove)
 		user_set_vring_enable(ctx, &msg.payload.state);
 		break;
 	case VHOST_USER_SEND_RARP:
-		user_send_rarp(&msg);
+		user_send_rarp(ctx, &msg);
 		break;
 
 	default:
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
index 68b24f4..65b5652 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.c
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
@@ -39,12 +39,6 @@
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
-#include <sys/ioctl.h>
-#include <sys/socket.h>
-#include <net/ethernet.h>
-#include <netinet/in.h>
-#include <netinet/if_ether.h>
-#include <linux/if_packet.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
@@ -415,120 +409,38 @@ user_set_log_base(struct vhost_device_ctx ctx,
 	return 0;
 }
 
-#define RARP_BUF_SIZE	64
-
-static void
-make_rarp_packet(uint8_t *buf, uint8_t *mac)
-{
-	struct ether_header *eth_hdr;
-	struct ether_arp *rarp;
-
-	/* Ethernet header. */
-	eth_hdr = (struct ether_header *)buf;
-	memset(&eth_hdr->ether_dhost, 0xff, ETH_ALEN);
-	memcpy(&eth_hdr->ether_shost, mac,  ETH_ALEN);
-	eth_hdr->ether_type = htons(ETH_P_RARP);
-
-	/* RARP header. */
-	rarp = (struct ether_arp *)(eth_hdr + 1);
-	rarp->ea_hdr.ar_hrd = htons(ARPHRD_ETHER);
-	rarp->ea_hdr.ar_pro = htons(ETHERTYPE_IP);
-	rarp->ea_hdr.ar_hln = ETH_ALEN;
-	rarp->ea_hdr.ar_pln = 4;
-	rarp->ea_hdr.ar_op  = htons(ARPOP_RREQUEST);
-
-	memcpy(&rarp->arp_sha, mac, ETH_ALEN);
-	memset(&rarp->arp_spa, 0x00, 4);
-	memcpy(&rarp->arp_tha, mac, 6);
-	memset(&rarp->arp_tpa, 0x00, 4);
-}
-
-
-static void
-send_rarp(const char *ifname, uint8_t *rarp)
-{
-	int fd;
-	struct ifreq ifr;
-	struct sockaddr_ll addr;
-
-	fd = socket(AF_PACKET, SOCK_RAW, 0);
-	if (fd < 0) {
-		perror("socket failed");
-		return;
-	}
-
-	memset(&ifr, 0, sizeof(struct ifreq));
-	strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
-	if (ioctl(fd, SIOCGIFINDEX, &ifr) < 0) {
-		perror("failed to get interface index");
-		close(fd);
-		return;
-	}
-
-	addr.sll_ifindex = ifr.ifr_ifindex;
-	addr.sll_halen   = ETH_ALEN;
-
-	if (sendto(fd, rarp, RARP_BUF_SIZE, 0,
-		   (const struct sockaddr*)&addr, sizeof(addr)) < 0) {
-		perror("send rarp packet failed");
-	}
-}
-
-
 /*
- * Broadcast a RARP message to all interfaces, to update
- * switch's mac table
+ * An rarp packet is constructed and broadcasted to notify switches about
+ * the new location of the migrated VM, so that packets from outside will
+ * not be lost after migration.
+ *
+ * However, we don't actually "send" a rarp packet here, instead, we set
+ * a flag 'broadcast_rarp' to let rte_vhost_dequeue_burst() inject it.
  */
 int
-user_send_rarp(struct VhostUserMsg *msg)
+user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *msg)
 {
+	struct virtio_net *dev;
 	uint8_t *mac = (uint8_t *)&msg->payload.u64;
-	uint8_t rarp[RARP_BUF_SIZE];
-	struct ifconf ifc = {0, };
-	struct ifreq *ifr;
-	int nr = 16;
-	int fd;
-	uint32_t i;
+
+	dev = get_device(ctx);
+	if (!dev)
+		return -1;
 
 	RTE_LOG(DEBUG, VHOST_CONFIG,
 		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
 		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
-
-	make_rarp_packet(rarp, mac);
+	memcpy(dev->mac.addr_bytes, mac, 6);
 
 	/*
-	 * Get all interfaces
+	 * Set the flag to inject a RARP broadcast packet at
+	 * rte_vhost_dequeue_burst().
+	 *
+	 * rte_smp_wmb() is for making sure the mac is copied
+	 * before the flag is set.
 	 */
-	fd = socket(AF_INET, SOCK_DGRAM, 0);
-	if (fd < 0) {
-		perror("failed to create AF_INET socket");
-		return -1;
-	}
-
-again:
-	ifc.ifc_len = sizeof(*ifr) * nr;
-	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
-
-	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
-		perror("failed at SIOCGIFCONF");
-		close(fd);
-		return -1;
-	}
-
-	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
-		/*
-		 * current ifc_buf is not big enough to hold
-		 * all interfaces; double it and try again.
-		 */
-		nr *= 2;
-		goto again;
-	}
-
-	ifr = (struct ifreq *)ifc.ifc_buf;
-	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
-		send_rarp(ifr[i].ifr_name, rarp);
-
-	close(fd);
+	rte_smp_wmb();
+	rte_atomic16_set(&dev->broadcast_rarp, 1);
 
 	return 0;
 }
diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
index 559bb46..cefec16 100644
--- a/lib/librte_vhost/vhost_user/virtio-net-user.h
+++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
@@ -54,7 +54,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
 void user_set_protocol_features(struct vhost_device_ctx ctx,
 				uint64_t protocol_features);
 int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
-int user_send_rarp(struct VhostUserMsg *);
+int user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *);
 
 int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
 
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 98+ messages in thread

* Re: [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-22 14:36           ` [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
@ 2016-02-24  8:15             ` Qiu, Michael
  2016-02-24  8:28               ` Yuanhan Liu
  2016-02-29 15:56             ` Thomas Monjalon
  1 sibling, 1 reply; 98+ messages in thread
From: Qiu, Michael @ 2016-02-24  8:15 UTC (permalink / raw)
  To: Yuanhan Liu, dev; +Cc: Victor Kaplansky, Michael S. Tsirkin

On 2/22/2016 10:35 PM, Yuanhan Liu wrote:
> Broadcast RARP packet by injecting it to receiving mbuf array at
> rte_vhost_dequeue_burst().
>
> Commit 33226236a35e ("vhost: handle request to send RARP") iterates
> all host interfaces and then broadcast it by all of them.  It did
> notify the switches about the new location of the migrated VM, however,
> the mac learning table in the target host is wrong (at least in my
> test with OVS):
>
>     $ ovs-appctl fdb/show ovsbr0
>      port  VLAN  MAC                Age
>         1     0  b6:3c:72:71:cd:4d   10
>     LOCAL     0  b6:3c:72:71:cd:4e   10
>     LOCAL     0  52:54:00:12:34:68    9
>         1     0  56:f6:64:2c:bc:c0    1
>
> Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
> above, the port learned is "LOCAL", which is the "ovsbr0" port. That
> is reasonable, since we indeed send the pkt by the "ovsbr0" interface.
>
> The wrong mac table lead all the packets to the VM go to the "ovsbr0"
> in the end, which ends up with all packets being lost, until the guest
> send a ARP quest (or reply) to refresh the mac learning table.
>
> Jianfeng then came up with a solution I have thought of firstly but NAKed

Is it suitable to mention someone in the commit log?

Thanks,
Michael
> by myself, concerning it has potential issues [0]. The solution is as title
> stated: broadcast the RARP packet by injecting it to the receiving mbuf
> arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
> think it twice; it looked like a false concern to me then. And I had done
> a rough verification: it worked as expected.
>
> [0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html
>
> Another note is that while preparing this version, I found that DPDK has
> some ARP related structures and macros defined. So, use them instead of
> the one from standard header files here.
>
> Cc: Thibaut Collet <thibaut.collet@6wind.com>
> Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  lib/librte_vhost/rte_virtio_net.h             |   5 +-
>  lib/librte_vhost/vhost_rxtx.c                 |  80 +++++++++++++++-
>  lib/librte_vhost/vhost_user/vhost-net-user.c  |   2 +-
>  lib/librte_vhost/vhost_user/virtio-net-user.c | 128 ++++----------------------
>  lib/librte_vhost/vhost_user/virtio-net-user.h |   2 +-
>  5 files changed, 104 insertions(+), 113 deletions(-)
>
> diff --git a/lib/librte_vhost/rte_virtio_net.h b/lib/librte_vhost/rte_virtio_net.h
> index 4a2303a..7d1fde2 100644
> --- a/lib/librte_vhost/rte_virtio_net.h
> +++ b/lib/librte_vhost/rte_virtio_net.h
> @@ -49,6 +49,7 @@
>  
>  #include <rte_memory.h>
>  #include <rte_mempool.h>
> +#include <rte_ether.h>
>  
>  struct rte_mbuf;
>  
> @@ -133,7 +134,9 @@ struct virtio_net {
>  	void			*priv;		/**< private context */
>  	uint64_t		log_size;	/**< Size of log area */
>  	uint64_t		log_base;	/**< Where dirty pages are logged */
> -	uint64_t		reserved[62];	/**< Reserve some spaces for future extension. */
> +	struct ether_addr	mac;		/**< MAC address */
> +	rte_atomic16_t		broadcast_rarp;	/**< A flag to tell if we need broadcast rarp packet */
> +	uint64_t		reserved[61];	/**< Reserve some spaces for future extension. */
>  	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];	/**< Contains all virtqueue information. */
>  } __rte_cache_aligned;
>  
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 12ce0cc..9d23eb1 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -43,6 +43,7 @@
>  #include <rte_tcp.h>
>  #include <rte_udp.h>
>  #include <rte_sctp.h>
> +#include <rte_arp.h>
>  
>  #include "vhost-net.h"
>  
> @@ -761,11 +762,50 @@ vhost_dequeue_offload(struct virtio_net_hdr *hdr, struct rte_mbuf *m)
>  	}
>  }
>  
> +#define RARP_PKT_SIZE	64
> +
> +static int
> +make_rarp_packet(struct rte_mbuf *rarp_mbuf, const struct ether_addr *mac)
> +{
> +	struct ether_hdr *eth_hdr;
> +	struct arp_hdr  *rarp;
> +
> +	if (rarp_mbuf->buf_len < 64) {
> +		RTE_LOG(WARNING, VHOST_DATA,
> +			"failed to make RARP; mbuf size too small %u (< %d)\n",
> +			rarp_mbuf->buf_len, RARP_PKT_SIZE);
> +		return -1;
> +	}
> +
> +	/* Ethernet header. */
> +	eth_hdr = rte_pktmbuf_mtod_offset(rarp_mbuf, struct ether_hdr *, 0);
> +	memset(eth_hdr->d_addr.addr_bytes, 0xff, ETHER_ADDR_LEN);
> +	ether_addr_copy(mac, &eth_hdr->s_addr);
> +	eth_hdr->ether_type = htons(ETHER_TYPE_RARP);
> +
> +	/* RARP header. */
> +	rarp = (struct arp_hdr *)(eth_hdr + 1);
> +	rarp->arp_hrd = htons(ARP_HRD_ETHER);
> +	rarp->arp_pro = htons(ETHER_TYPE_IPv4);
> +	rarp->arp_hln = ETHER_ADDR_LEN;
> +	rarp->arp_pln = 4;
> +	rarp->arp_op  = htons(ARP_OP_REVREQUEST);
> +
> +	ether_addr_copy(mac, &rarp->arp_data.arp_sha);
> +	ether_addr_copy(mac, &rarp->arp_data.arp_tha);
> +	memset(&rarp->arp_data.arp_sip, 0x00, 4);
> +	memset(&rarp->arp_data.arp_tip, 0x00, 4);
> +
> +	rarp_mbuf->pkt_len  = rarp_mbuf->data_len = RARP_PKT_SIZE;
> +
> +	return 0;
> +}
> +
>  uint16_t
>  rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
>  {
> -	struct rte_mbuf *m, *prev;
> +	struct rte_mbuf *m, *prev, *rarp_mbuf = NULL;
>  	struct vhost_virtqueue *vq;
>  	struct vring_desc *desc;
>  	uint64_t vb_addr = 0;
> @@ -788,11 +828,34 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  	if (unlikely(vq->enabled == 0))
>  		return 0;
>  
> +	/*
> +	 * Construct a RARP broadcast packet, and inject it to the "pkts"
> +	 * array, to looks like that guest actually send such packet.
> +	 *
> +	 * Check user_send_rarp() for more information.
> +	 */
> +	if (unlikely(rte_atomic16_cmpset((volatile uint16_t *)
> +					 &dev->broadcast_rarp.cnt, 1, 0))) {
> +		rarp_mbuf = rte_pktmbuf_alloc(mbuf_pool);
> +		if (rarp_mbuf == NULL) {
> +			RTE_LOG(ERR, VHOST_DATA,
> +				"Failed to allocate memory for mbuf.\n");
> +			return 0;
> +		}
> +
> +		if (make_rarp_packet(rarp_mbuf, &dev->mac)) {
> +			rte_pktmbuf_free(rarp_mbuf);
> +			rarp_mbuf = NULL;
> +		} else {
> +			count -= 1;
> +		}
> +	}
> +
>  	avail_idx =  *((volatile uint16_t *)&vq->avail->idx);
>  
>  	/* If there are no available buffers then return. */
>  	if (vq->last_used_idx == avail_idx)
> -		return 0;
> +		goto out;
>  
>  	LOG_DEBUG(VHOST_DATA, "%s (%"PRIu64")\n", __func__,
>  		dev->device_fh);
> @@ -983,8 +1046,21 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, uint16_t queue_id,
>  	vq->used->idx += entry_success;
>  	vhost_log_used_vring(dev, vq, offsetof(struct vring_used, idx),
>  			sizeof(vq->used->idx));
> +
>  	/* Kick guest if required. */
>  	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT))
>  		eventfd_write(vq->callfd, (eventfd_t)1);
> +
> +out:
> +	if (unlikely(rarp_mbuf != NULL)) {
> +		/*
> +		 * Inject it to the head of "pkts" array, so that switch's mac
> +		 * learning table will get updated first.
> +		 */
> +		memmove(&pkts[1], pkts, entry_success * sizeof(m));
> +		pkts[0] = rarp_mbuf;
> +		entry_success += 1;
> +	}
> +
>  	return entry_success;
>  }
> diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c b/lib/librte_vhost/vhost_user/vhost-net-user.c
> index de7eecb..df2bd64 100644
> --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> @@ -437,7 +437,7 @@ vserver_message_handler(int connfd, void *dat, int *remove)
>  		user_set_vring_enable(ctx, &msg.payload.state);
>  		break;
>  	case VHOST_USER_SEND_RARP:
> -		user_send_rarp(&msg);
> +		user_send_rarp(ctx, &msg);
>  		break;
>  
>  	default:
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.c b/lib/librte_vhost/vhost_user/virtio-net-user.c
> index 68b24f4..65b5652 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.c
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.c
> @@ -39,12 +39,6 @@
>  #include <sys/mman.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
> -#include <sys/ioctl.h>
> -#include <sys/socket.h>
> -#include <net/ethernet.h>
> -#include <netinet/in.h>
> -#include <netinet/if_ether.h>
> -#include <linux/if_packet.h>
>  
>  #include <rte_common.h>
>  #include <rte_log.h>
> @@ -415,120 +409,38 @@ user_set_log_base(struct vhost_device_ctx ctx,
>  	return 0;
>  }
>  
> -#define RARP_BUF_SIZE	64
> -
> -static void
> -make_rarp_packet(uint8_t *buf, uint8_t *mac)
> -{
> -	struct ether_header *eth_hdr;
> -	struct ether_arp *rarp;
> -
> -	/* Ethernet header. */
> -	eth_hdr = (struct ether_header *)buf;
> -	memset(&eth_hdr->ether_dhost, 0xff, ETH_ALEN);
> -	memcpy(&eth_hdr->ether_shost, mac,  ETH_ALEN);
> -	eth_hdr->ether_type = htons(ETH_P_RARP);
> -
> -	/* RARP header. */
> -	rarp = (struct ether_arp *)(eth_hdr + 1);
> -	rarp->ea_hdr.ar_hrd = htons(ARPHRD_ETHER);
> -	rarp->ea_hdr.ar_pro = htons(ETHERTYPE_IP);
> -	rarp->ea_hdr.ar_hln = ETH_ALEN;
> -	rarp->ea_hdr.ar_pln = 4;
> -	rarp->ea_hdr.ar_op  = htons(ARPOP_RREQUEST);
> -
> -	memcpy(&rarp->arp_sha, mac, ETH_ALEN);
> -	memset(&rarp->arp_spa, 0x00, 4);
> -	memcpy(&rarp->arp_tha, mac, 6);
> -	memset(&rarp->arp_tpa, 0x00, 4);
> -}
> -
> -
> -static void
> -send_rarp(const char *ifname, uint8_t *rarp)
> -{
> -	int fd;
> -	struct ifreq ifr;
> -	struct sockaddr_ll addr;
> -
> -	fd = socket(AF_PACKET, SOCK_RAW, 0);
> -	if (fd < 0) {
> -		perror("socket failed");
> -		return;
> -	}
> -
> -	memset(&ifr, 0, sizeof(struct ifreq));
> -	strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
> -	if (ioctl(fd, SIOCGIFINDEX, &ifr) < 0) {
> -		perror("failed to get interface index");
> -		close(fd);
> -		return;
> -	}
> -
> -	addr.sll_ifindex = ifr.ifr_ifindex;
> -	addr.sll_halen   = ETH_ALEN;
> -
> -	if (sendto(fd, rarp, RARP_BUF_SIZE, 0,
> -		   (const struct sockaddr*)&addr, sizeof(addr)) < 0) {
> -		perror("send rarp packet failed");
> -	}
> -}
> -
> -
>  /*
> - * Broadcast a RARP message to all interfaces, to update
> - * switch's mac table
> + * An rarp packet is constructed and broadcasted to notify switches about
> + * the new location of the migrated VM, so that packets from outside will
> + * not be lost after migration.
> + *
> + * However, we don't actually "send" a rarp packet here, instead, we set
> + * a flag 'broadcast_rarp' to let rte_vhost_dequeue_burst() inject it.
>   */
>  int
> -user_send_rarp(struct VhostUserMsg *msg)
> +user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *msg)
>  {
> +	struct virtio_net *dev;
>  	uint8_t *mac = (uint8_t *)&msg->payload.u64;
> -	uint8_t rarp[RARP_BUF_SIZE];
> -	struct ifconf ifc = {0, };
> -	struct ifreq *ifr;
> -	int nr = 16;
> -	int fd;
> -	uint32_t i;
> +
> +	dev = get_device(ctx);
> +	if (!dev)
> +		return -1;
>  
>  	RTE_LOG(DEBUG, VHOST_CONFIG,
>  		":: mac: %02x:%02x:%02x:%02x:%02x:%02x\n",
>  		mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
> -
> -	make_rarp_packet(rarp, mac);
> +	memcpy(dev->mac.addr_bytes, mac, 6);
>  
>  	/*
> -	 * Get all interfaces
> +	 * Set the flag to inject a RARP broadcast packet at
> +	 * rte_vhost_dequeue_burst().
> +	 *
> +	 * rte_smp_wmb() is for making sure the mac is copied
> +	 * before the flag is set.
>  	 */
> -	fd = socket(AF_INET, SOCK_DGRAM, 0);
> -	if (fd < 0) {
> -		perror("failed to create AF_INET socket");
> -		return -1;
> -	}
> -
> -again:
> -	ifc.ifc_len = sizeof(*ifr) * nr;
> -	ifc.ifc_buf = realloc(ifc.ifc_buf, ifc.ifc_len);
> -
> -	if (ioctl(fd, SIOCGIFCONF, &ifc) < 0) {
> -		perror("failed at SIOCGIFCONF");
> -		close(fd);
> -		return -1;
> -	}
> -
> -	if (ifc.ifc_len == (int)sizeof(struct ifreq) * nr) {
> -		/*
> -		 * current ifc_buf is not big enough to hold
> -		 * all interfaces; double it and try again.
> -		 */
> -		nr *= 2;
> -		goto again;
> -	}
> -
> -	ifr = (struct ifreq *)ifc.ifc_buf;
> -	for (i = 0; i < ifc.ifc_len / sizeof(struct ifreq); i++)
> -		send_rarp(ifr[i].ifr_name, rarp);
> -
> -	close(fd);
> +	rte_smp_wmb();
> +	rte_atomic16_set(&dev->broadcast_rarp, 1);
>  
>  	return 0;
>  }
> diff --git a/lib/librte_vhost/vhost_user/virtio-net-user.h b/lib/librte_vhost/vhost_user/virtio-net-user.h
> index 559bb46..cefec16 100644
> --- a/lib/librte_vhost/vhost_user/virtio-net-user.h
> +++ b/lib/librte_vhost/vhost_user/virtio-net-user.h
> @@ -54,7 +54,7 @@ void user_set_vring_kick(struct vhost_device_ctx, struct VhostUserMsg *);
>  void user_set_protocol_features(struct vhost_device_ctx ctx,
>  				uint64_t protocol_features);
>  int user_set_log_base(struct vhost_device_ctx ctx, struct VhostUserMsg *);
> -int user_send_rarp(struct VhostUserMsg *);
> +int user_send_rarp(struct vhost_device_ctx ctx, struct VhostUserMsg *);
>  
>  int user_get_vring_base(struct vhost_device_ctx, struct vhost_vring_state *);
>  


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-24  8:15             ` Qiu, Michael
@ 2016-02-24  8:28               ` Yuanhan Liu
  2016-02-25  7:55                 ` Qiu, Michael
  0 siblings, 1 reply; 98+ messages in thread
From: Yuanhan Liu @ 2016-02-24  8:28 UTC (permalink / raw)
  To: Qiu, Michael; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On Wed, Feb 24, 2016 at 08:15:36AM +0000, Qiu, Michael wrote:
> On 2/22/2016 10:35 PM, Yuanhan Liu wrote:
> > Broadcast RARP packet by injecting it to receiving mbuf array at
> > rte_vhost_dequeue_burst().
> >
> > Commit 33226236a35e ("vhost: handle request to send RARP") iterates
> > all host interfaces and then broadcast it by all of them.  It did
> > notify the switches about the new location of the migrated VM, however,
> > the mac learning table in the target host is wrong (at least in my
> > test with OVS):
> >
> >     $ ovs-appctl fdb/show ovsbr0
> >      port  VLAN  MAC                Age
> >         1     0  b6:3c:72:71:cd:4d   10
> >     LOCAL     0  b6:3c:72:71:cd:4e   10
> >     LOCAL     0  52:54:00:12:34:68    9
> >         1     0  56:f6:64:2c:bc:c0    1
> >
> > Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
> > above, the port learned is "LOCAL", which is the "ovsbr0" port. That
> > is reasonable, since we indeed send the pkt by the "ovsbr0" interface.
> >
> > The wrong mac table lead all the packets to the VM go to the "ovsbr0"
> > in the end, which ends up with all packets being lost, until the guest
> > send a ARP quest (or reply) to refresh the mac learning table.
> >
> > Jianfeng then came up with a solution I have thought of firstly but NAKed
> 
> Is it suitable to mention someone in the commit log?

Why it's not? It's not a secret name or something like that after all :)

On the other hand, it's way of thanking Jianfeng's contribution to this
patch.

	--yliu

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-24  8:28               ` Yuanhan Liu
@ 2016-02-25  7:55                 ` Qiu, Michael
  0 siblings, 0 replies; 98+ messages in thread
From: Qiu, Michael @ 2016-02-25  7:55 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: dev, Victor Kaplansky, Michael S. Tsirkin

On 2/24/2016 4:27 PM, Yuanhan Liu wrote:
> On Wed, Feb 24, 2016 at 08:15:36AM +0000, Qiu, Michael wrote:
>> On 2/22/2016 10:35 PM, Yuanhan Liu wrote:
>>> Broadcast RARP packet by injecting it to receiving mbuf array at
>>> rte_vhost_dequeue_burst().
>>>
>>> Commit 33226236a35e ("vhost: handle request to send RARP") iterates
>>> all host interfaces and then broadcast it by all of them.  It did
>>> notify the switches about the new location of the migrated VM, however,
>>> the mac learning table in the target host is wrong (at least in my
>>> test with OVS):
>>>
>>>     $ ovs-appctl fdb/show ovsbr0
>>>      port  VLAN  MAC                Age
>>>         1     0  b6:3c:72:71:cd:4d   10
>>>     LOCAL     0  b6:3c:72:71:cd:4e   10
>>>     LOCAL     0  52:54:00:12:34:68    9
>>>         1     0  56:f6:64:2c:bc:c0    1
>>>
>>> Where 52:54:00:12:34:68 is the mac of the VM. As you can see from the
>>> above, the port learned is "LOCAL", which is the "ovsbr0" port. That
>>> is reasonable, since we indeed send the pkt by the "ovsbr0" interface.
>>>
>>> The wrong mac table lead all the packets to the VM go to the "ovsbr0"
>>> in the end, which ends up with all packets being lost, until the guest
>>> send a ARP quest (or reply) to refresh the mac learning table.
>>>
>>> Jianfeng then came up with a solution I have thought of firstly but NAKed
>> Is it suitable to mention someone in the commit log?
> Why it's not? It's not a secret name or something like that after all :)
>
> On the other hand, it's way of thanking Jianfeng's contribution to this
> patch.

OK, I've never seen this fashion before, forgive me.

Thanks,
Michael
>
> 	--yliu
>


^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array
  2016-02-22 14:36           ` [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
  2016-02-24  8:15             ` Qiu, Michael
@ 2016-02-29 15:56             ` Thomas Monjalon
  1 sibling, 0 replies; 98+ messages in thread
From: Thomas Monjalon @ 2016-02-29 15:56 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Michael S. Tsirkin, dev, Victor Kaplansky

2016-02-22 22:36, Yuanhan Liu:
> The wrong mac table lead all the packets to the VM go to the "ovsbr0"
> in the end, which ends up with all packets being lost, until the guest
> send a ARP quest (or reply) to refresh the mac learning table.
> 
> Jianfeng then came up with a solution I have thought of firstly but NAKed
> by myself, concerning it has potential issues [0]. The solution is as title
> stated: broadcast the RARP packet by injecting it to the receiving mbuf
> arrays at rte_vhost_dequeue_burst(). The re-bring of that idea made me
> think it twice; it looked like a false concern to me then. And I had done
> a rough verification: it worked as expected.
> 
> [0]: http://dpdk.org/ml/archives/dev/2016-February/033527.html
> 
> Another note is that while preparing this version, I found that DPDK has
> some ARP related structures and macros defined. So, use them instead of
> the one from standard header files here.
> 
> Cc: Thibaut Collet <thibaut.collet@6wind.com>
> Suggested-by: Jianfeng Tan <jianfeng.tan@intel.com>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

Applied, thanks

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-01-29  4:58     ` [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
@ 2016-03-11 12:39       ` Olivier MATZ
  2016-03-11 13:16         ` Thomas Monjalon
  0 siblings, 1 reply; 98+ messages in thread
From: Olivier MATZ @ 2016-03-11 12:39 UTC (permalink / raw)
  To: Yuanhan Liu, dev

Hi Yuanhan,

On 01/29/2016 05:58 AM, Yuanhan Liu wrote:
> It's actually a feature already enabled in Linux kernel (since v3.5).
> What we need to do is simply to claim that we support such feature,
> and nothing else.
> 
> With that, the guest will send an ARP message after live migration
> to notify the switches about the new location of migrated VM.
> 
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Tested-by: Pavel Fedin <p.fedin@samsung.com>
> ---
>  lib/librte_vhost/virtio-net.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index 03044f6..0ba5045 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -74,6 +74,7 @@ static struct virtio_net_config_ll *ll_root;
>  #define VHOST_SUPPORTED_FEATURES ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_VQ) | \
>  				(1ULL << VIRTIO_NET_F_CTRL_RX) | \
> +				(1ULL << VIRTIO_NET_F_GUEST_ANNOUNCE) | \
>  				(VHOST_SUPPORTS_MQ)            | \
>  				(1ULL << VIRTIO_F_VERSION_1)   | \
>  				(1ULL << VHOST_F_LOG_ALL)      | \
> 

I'm trying to compile the dpdk on an debian-7, and it fails due
to this patch. Indeed, the define VIRTIO_NET_F_GUEST_ANNOUNCE is
not present in /usr/include/linux/virtio_net.h on this distribution.

I'm wondering if the librte_vhost shouldn't embed its own version
of virtio_net.h instead of relying on the one from the distribution.
It seems it has been done this way for the virtio guest PMD in
dpdk.org/drivers/net/virtio/virtio_pci.h.

What do you think?


Thanks,
Olivier

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-03-11 12:39       ` Olivier MATZ
@ 2016-03-11 13:16         ` Thomas Monjalon
  2016-03-11 13:22           ` Olivier MATZ
  0 siblings, 1 reply; 98+ messages in thread
From: Thomas Monjalon @ 2016-03-11 13:16 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev

2016-03-11 13:39, Olivier MATZ:
> I'm trying to compile the dpdk on an debian-7, and it fails due
> to this patch. Indeed, the define VIRTIO_NET_F_GUEST_ANNOUNCE is
> not present in /usr/include/linux/virtio_net.h on this distribution.

It will be fixed by this patch:
http://dpdk.org/dev/patchwork/patch/11195/

> I'm wondering if the librte_vhost shouldn't embed its own version
> of virtio_net.h instead of relying on the one from the distribution.
> It seems it has been done this way for the virtio guest PMD in
> dpdk.org/drivers/net/virtio/virtio_pci.h.
> 
> What do you think?

^ permalink raw reply	[flat|nested] 98+ messages in thread

* Re: [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature
  2016-03-11 13:16         ` Thomas Monjalon
@ 2016-03-11 13:22           ` Olivier MATZ
  0 siblings, 0 replies; 98+ messages in thread
From: Olivier MATZ @ 2016-03-11 13:22 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev



On 03/11/2016 02:16 PM, Thomas Monjalon wrote:
> 2016-03-11 13:39, Olivier MATZ:
>> I'm trying to compile the dpdk on an debian-7, and it fails due
>> to this patch. Indeed, the define VIRTIO_NET_F_GUEST_ANNOUNCE is
>> not present in /usr/include/linux/virtio_net.h on this distribution.
> 
> It will be fixed by this patch:
> http://dpdk.org/dev/patchwork/patch/11195/

OK, thanks Thomas

^ permalink raw reply	[flat|nested] 98+ messages in thread

end of thread, other threads:[~2016-03-11 13:22 UTC | newest]

Thread overview: 98+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-02  3:43 [PATCH 0/4 for 2.3] vhost-user live migration support Yuanhan Liu
2015-12-02  3:43 ` [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
2015-12-02 13:53   ` Panu Matilainen
2015-12-02 14:31     ` Yuanhan Liu
2015-12-02 14:48       ` Panu Matilainen
2015-12-02 15:09         ` Yuanhan Liu
2015-12-02 16:58           ` Panu Matilainen
2015-12-02 17:24             ` Michael S. Tsirkin
2015-12-02 16:38       ` Thomas Monjalon
2015-12-03  1:49         ` Yuanhan Liu
2015-12-06 23:07     ` Thomas Monjalon
2015-12-07  2:00       ` Yuanhan Liu
2015-12-07  2:03         ` Thomas Monjalon
2015-12-07  2:18           ` Yuanhan Liu
2015-12-07  2:49             ` Thomas Monjalon
2015-12-07  6:29       ` Panu Matilainen
2015-12-07 11:28         ` Thomas Monjalon
2015-12-07 11:41           ` Panu Matilainen
2015-12-07 13:55             ` Thomas Monjalon
2015-12-07 16:48               ` Panu Matilainen
2015-12-07 17:47                 ` Thomas Monjalon
2015-12-08  5:57   ` Xie, Huawei
2015-12-08  7:25     ` Yuanhan Liu
2015-12-02  3:43 ` [PATCH 2/4] vhost: introduce vhost_log_write Yuanhan Liu
2015-12-02 13:53   ` Victor Kaplansky
2015-12-02 14:39     ` Yuanhan Liu
2015-12-09  3:33     ` Xie, Huawei
2015-12-09  3:42       ` Yuanhan Liu
2015-12-09  5:44         ` Xie, Huawei
2015-12-09  8:41           ` Yuanhan Liu
2015-12-02  3:43 ` [PATCH 3/4] vhost: log vring changes Yuanhan Liu
2015-12-02 14:07   ` Victor Kaplansky
2015-12-02 14:38     ` Yuanhan Liu
2015-12-02 15:58       ` Victor Kaplansky
2015-12-02 16:26         ` Michael S. Tsirkin
2015-12-03  2:31           ` Yuanhan Liu
2015-12-09  2:45     ` Xie, Huawei
2015-12-02  3:43 ` [PATCH 4/4] vhost: enable log_shmfd protocol feature Yuanhan Liu
2015-12-02 14:10 ` [PATCH 0/4 for 2.3] vhost-user live migration support Victor Kaplansky
2015-12-02 14:33   ` Yuanhan Liu
2015-12-09  3:41 ` Xie, Huawei
2015-12-17  3:11 ` [PATCH v2 0/6] " Yuanhan Liu
2015-12-17  3:11   ` [PATCH v2 1/6] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
2015-12-21 15:32     ` Xie, Huawei
2015-12-22  2:25       ` Yuanhan Liu
2015-12-22  2:41         ` Xie, Huawei
2015-12-22  2:55           ` Yuanhan Liu
2015-12-17  3:11   ` [PATCH v2 2/6] vhost: introduce vhost_log_write Yuanhan Liu
2015-12-21 15:06     ` Xie, Huawei
2015-12-22  2:40       ` Yuanhan Liu
2015-12-22  2:45         ` Xie, Huawei
2015-12-22  3:04           ` Yuanhan Liu
2015-12-22  7:02             ` Xie, Huawei
2015-12-22  5:11     ` Peter Xu
2015-12-22  6:09       ` Yuanhan Liu
2015-12-17  3:11   ` [PATCH v2 3/6] vhost: log used vring changes Yuanhan Liu
2015-12-22  6:55     ` Peter Xu
2015-12-22  7:07       ` Xie, Huawei
2015-12-22  7:59         ` Peter Xu
2015-12-22  7:13       ` Yuanhan Liu
2015-12-22  8:01         ` Peter Xu
2015-12-17  3:11   ` [PATCH v2 4/6] vhost: log vring desc buffer changes Yuanhan Liu
2015-12-17  3:12   ` [PATCH v2 5/6] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
2015-12-22  8:11     ` Peter Xu
2015-12-22  8:21       ` Yuanhan Liu
2015-12-22  8:24       ` Pavel Fedin
2015-12-17  3:12   ` [PATCH v2 6/6] vhost: enable log_shmfd protocol feature Yuanhan Liu
2015-12-17 12:08   ` [PATCH v2 0/6] vhost-user live migration support Iremonger, Bernard
2015-12-17 12:45     ` Yuanhan Liu
2015-12-21  8:17   ` Pavel Fedin
2016-01-29  4:57   ` [PATCH v3 0/8] " Yuanhan Liu
2016-01-29  4:57     ` [PATCH v3 1/8] vhost: handle VHOST_USER_SET_LOG_BASE request Yuanhan Liu
2016-01-29  4:57     ` [PATCH v3 2/8] vhost: introduce vhost_log_write Yuanhan Liu
2016-02-19 14:26       ` Thomas Monjalon
2016-02-22  6:59         ` Yuanhan Liu
2016-01-29  4:57     ` [PATCH v3 3/8] vhost: log used vring changes Yuanhan Liu
2016-01-29  4:57     ` [PATCH v3 4/8] vhost: log vring desc buffer changes Yuanhan Liu
2016-01-29  4:58     ` [PATCH v3 5/8] vhost: claim that we support GUEST_ANNOUNCE feature Yuanhan Liu
2016-03-11 12:39       ` Olivier MATZ
2016-03-11 13:16         ` Thomas Monjalon
2016-03-11 13:22           ` Olivier MATZ
2016-01-29  4:58     ` [PATCH v3 6/8] vhost: handle VHOST_USER_SEND_RARP request Yuanhan Liu
2016-02-19  6:11       ` Tan, Jianfeng
2016-02-19  7:03         ` Yuanhan Liu
2016-02-19  8:55           ` Yuanhan Liu
2016-02-22 14:36           ` [PATCH] vhost: broadcast RARP pkt by injecting it to receiving mbuf array Yuanhan Liu
2016-02-24  8:15             ` Qiu, Michael
2016-02-24  8:28               ` Yuanhan Liu
2016-02-25  7:55                 ` Qiu, Michael
2016-02-29 15:56             ` Thomas Monjalon
2016-01-29  4:58     ` [PATCH v3 7/8] vhost: enable log_shmfd protocol feature Yuanhan Liu
2016-01-29  4:58     ` [PATCH v3 8/8] vhost: remove duplicate header include Yuanhan Liu
2016-02-01 15:54     ` [PATCH v3 0/8] vhost-user live migration support Iremonger, Bernard
2016-02-02  1:53       ` Yuanhan Liu
2016-02-19 15:01     ` Thomas Monjalon
2016-02-22  7:08       ` Yuanhan Liu
2016-02-22  9:56         ` Thomas Monjalon
2016-02-22 14:24           ` Yuanhan Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.