netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
@ 2012-05-17  9:20 Liu Ping Fan
  2012-05-17  9:20 ` [PATCH 1/2] [kvm/vhost]: make vhost support NUMA model Liu Ping Fan
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: Krishna Kumar, Shirley Ma, Tom Lendacky, Michael S. Tsirkin,
	qemu-devel, Rusty Russell, Srivatsa Vaddagiri, linux-kernel,
	Ryan Harper, Avi Kivity, Anthony Liguori

Currently, the guest can not know the NUMA info of the vcpu, which will
result in performance drawback.

This is the discovered and experiment by
        Shirley Ma <xma@us.ibm.com>
        Krishna Kumar <krkumar2@in.ibm.com>
        Tom Lendacky <toml@us.ibm.com>
Refer to - http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
we can see the big perfermance gap between NUMA aware and unaware.

Enlightened by their discovery, I think, we can do more work -- that is to
export NUMA info of host to guest.

So here comes the idea:
1. export host numa info through guest's sched domain to its scheduler
  Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
  has been handled by host).  So the guest's lb will consider the cost.
  I am still working on this, and my original idea is to export these info
  through "static struct sched_domain_topology_level *sched_domain_topology"
  to guest.

2. Do a better emulation of virt mach exported to guest.
  In real world, the devices are limited by kinds of reasons to own the NUMA
  property. But as to Qemu, the device is emulated by thread, which inherit
  the NUMA attr in nature.  We can implement the device as components of many
  logic units, each of the unit is backed by a thread in different host node.
  Currently, I want to start the work on vhost. But I think, maybe in
  future, the iothread in Qemu can also has such attr.


Forgive me, for the limited time, I can not have more better understand of
vhost/virtio_net drivers. These patches are just draft, _FAR_, _FAR_ from work.
I will do more detail work for them in future.

To easy the review, the following is the sum up of the 2nd point of the idea.
As for the 1st point of the idea, it is not reflected in the patches.

--spread/shrink the vhost_workers over the host nodes as demanded from Qemu.
  And we can consider each vhost_worker as an independent net logic device
  embeded in physical device "vhost_net".  At the meanwhile, we spread vcpu
  threads over the host node. 
  The vrings on guest are allocated PAGE_SIZE align separately, so they can 
  will only be mapped into different host node, so vhost_worker in the same
  node can access it with the least cost. So does the vq on guest.

--virtio_net driver will changes and talk with the logic device. And which
  logic device it will talk to is determined by on which vcpu it is scheduled.

--the binding of vcpus and vhost_worker is implemented by: 
  for call direction, vq-a in the node-A will have a dedicated irq-a. And 
  we set the irq-a's affinity to vcpus in node-A.
  for kick direction, kick register-b trigger different eventfd-b which wake up
  vhost_worker-b.


Please give some comments and suggestion.

Thanks and regards,
pingfan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] [kvm/vhost]: make vhost support NUMA model.
  2012-05-17  9:20 [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Liu Ping Fan
@ 2012-05-17  9:20 ` Liu Ping Fan
  2012-05-17  9:20 ` [PATCH 2/2] [kvm/vhost-net]: make vhost net own NUMA attribute Liu Ping Fan
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Make vhost allocate vhost_virtqueue on different host nodes as required.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/vhost/vhost.c |  380 +++++++++++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.h |   41 ++++--
 include/linux/vhost.h |    2 +-
 3 files changed, 304 insertions(+), 119 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 51e4c1e..b0d2855 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -23,6 +23,7 @@
 #include <linux/file.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
 
@@ -37,12 +38,11 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static unsigned vhost_zcopy_mask __read_mostly;
 
 #define vhost_used_event(vq) ((u16 __user *)&vq->avail->ring[vq->num])
 #define vhost_avail_event(vq) ((u16 __user *)&vq->used->ring[vq->num])
 
-static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
+void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
 	struct vhost_poll *poll;
@@ -75,12 +75,12 @@ static void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_sub_dev *dev)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->subdev = dev;
 
 	vhost_work_init(&poll->work, fn);
 }
@@ -103,7 +103,7 @@ void vhost_poll_stop(struct vhost_poll *poll)
 	remove_wait_queue(poll->wqh, &poll->wait);
 }
 
-static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
+static bool vhost_work_seq_done(struct vhost_sub_dev *dev, struct vhost_work *work,
 				unsigned seq)
 {
 	int left;
@@ -114,19 +114,19 @@ static bool vhost_work_seq_done(struct vhost_dev *dev, struct vhost_work *work,
 	return left <= 0;
 }
 
-static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+static void vhost_work_flush(struct vhost_sub_dev *sub, struct vhost_work *work)
 {
 	unsigned seq;
 	int flushing;
 
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(&sub->work_lock);
 	seq = work->queue_seq;
 	work->flushing++;
-	spin_unlock_irq(&dev->work_lock);
-	wait_event(work->done, vhost_work_seq_done(dev, work, seq));
-	spin_lock_irq(&dev->work_lock);
+	spin_unlock_irq(&sub->work_lock);
+	wait_event(work->done, vhost_work_seq_done(sub, work, seq));
+	spin_lock_irq(&sub->work_lock);
 	flushing = --work->flushing;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(&sub->work_lock);
 	BUG_ON(flushing < 0);
 }
 
@@ -134,26 +134,26 @@ static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	vhost_work_flush(poll->dev, &poll->work);
+	vhost_work_flush(poll->subdev, &poll->work);
 }
 
-static inline void vhost_work_queue(struct vhost_dev *dev,
+static inline void vhost_work_queue(struct vhost_sub_dev *sub,
 				    struct vhost_work *work)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&dev->work_lock, flags);
+	spin_lock_irqsave(&sub->work_lock, flags);
 	if (list_empty(&work->node)) {
-		list_add_tail(&work->node, &dev->work_list);
+		list_add_tail(&work->node, &sub->work_list);
 		work->queue_seq++;
-		wake_up_process(dev->worker);
+		wake_up_process(sub->worker);
 	}
-	spin_unlock_irqrestore(&dev->work_lock, flags);
+	spin_unlock_irqrestore(&sub->work_lock, flags);
 }
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	vhost_work_queue(poll->subdev, &poll->work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -188,7 +188,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_sub_dev *sub = data;
+	struct vhost_dev *dev = sub->owner;
 	struct vhost_work *work = NULL;
 	unsigned uninitialized_var(seq);
 
@@ -198,7 +199,7 @@ static int vhost_worker(void *data)
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		spin_lock_irq(&dev->work_lock);
+		spin_lock_irq(&sub->work_lock);
 		if (work) {
 			work->done_seq = seq;
 			if (work->flushing)
@@ -206,18 +207,18 @@ static int vhost_worker(void *data)
 		}
 
 		if (kthread_should_stop()) {
-			spin_unlock_irq(&dev->work_lock);
+			spin_unlock_irq(&sub->work_lock);
 			__set_current_state(TASK_RUNNING);
 			break;
 		}
-		if (!list_empty(&dev->work_list)) {
-			work = list_first_entry(&dev->work_list,
+		if (!list_empty(&sub->work_list)) {
+			work = list_first_entry(&sub->work_list,
 						struct vhost_work, node);
 			list_del_init(&work->node);
 			seq = work->queue_seq;
 		} else
 			work = NULL;
-		spin_unlock_irq(&dev->work_lock);
+		spin_unlock_irq(&sub->work_lock);
 
 		if (work) {
 			__set_current_state(TASK_RUNNING);
@@ -244,54 +245,189 @@ static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
 	vq->ubuf_info = NULL;
 }
 
-void vhost_enable_zcopy(int vq)
+void vhost_enable_zcopy(struct vhost_dev *dev, int rx)
 {
-	vhost_zcopy_mask |= 0x1 << vq;
+	int i;
+	if (rx == 0)
+		for (i = 0; i < dev->node_cnt; i++)
+			dev->zcopy_mask |= 0x1<<(2*i+1);
 }
 
-/* Helper to allocate iovec buffers for all vqs. */
-static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
+/* Need for vq dynamicly allocator, which is important to migrate among NUMA */
+static int vhost_vq_alloc_iovecs(struct vhost_virtqueue *vq)
 {
-	int i;
 	bool zcopy;
+	int i;
+	struct vhost_dev *dev = vq->dev;
+	int node = vq->node_id;
+	vq->indirect = kmalloc_node(sizeof *vq->indirect  *
+					   UIO_MAXIOV, GFP_KERNEL, node);
+	vq->log = kmalloc_node(sizeof *vq->log * UIO_MAXIOV,
+				  GFP_KERNEL, node);
+	vq->heads = kmalloc_node(sizeof *vq->heads *
+					UIO_MAXIOV, GFP_KERNEL, node);
+	for (i = 0; i < dev->node_cnt*2; i++) {
+		if (dev->vqs[i] == vq) {
+			zcopy = dev->zcopy_mask & (0x1 << i);
+			break;
+		}
+	}
+	if (zcopy)
+		vq->ubuf_info =
+			kmalloc_node(sizeof *vq->ubuf_info *
+				UIO_MAXIOV, GFP_KERNEL, node);
+	if (!vq->indirect || !vq->log || !vq->heads ||
+		(zcopy && !vq->ubuf_info)) {
+		kfree(vq->indirect);
+		kfree(vq->log);
+		kfree(vq->heads);
+		kfree(vq->ubuf_info);
 
-	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].indirect = kmalloc(sizeof *dev->vqs[i].indirect *
-					       UIO_MAXIOV, GFP_KERNEL);
-		dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
-					  GFP_KERNEL);
-		dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
-					    UIO_MAXIOV, GFP_KERNEL);
-		zcopy = vhost_zcopy_mask & (0x1 << i);
-		if (zcopy)
-			dev->vqs[i].ubuf_info =
-				kmalloc(sizeof *dev->vqs[i].ubuf_info *
-					UIO_MAXIOV, GFP_KERNEL);
-		if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
-			!dev->vqs[i].heads ||
-			(zcopy && !dev->vqs[i].ubuf_info))
+		return -ENOMEM;
+	} else
+		return 0;
+}
+
+/* Helper to allocate iovec buffers for all vqs. */
+static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
+{
+	int i, ret;
+	for (i = 0; i < dev->nvqs; i++) {
+		ret = vhost_vq_alloc_iovecs(dev->vqs[i]);
+		if (ret < 0) {
+			i -= 1;
 			goto err_nomem;
+		}
 	}
 	return 0;
-
 err_nomem:
 	for (; i >= 0; --i)
-		vhost_vq_free_iovecs(&dev->vqs[i]);
+		vhost_vq_free_iovecs(dev->vqs[i]);
 	return -ENOMEM;
 }
 
 static void vhost_dev_free_iovecs(struct vhost_dev *dev)
 {
 	int i;
-
 	for (i = 0; i < dev->nvqs; ++i)
-		vhost_vq_free_iovecs(&dev->vqs[i]);
+		vhost_vq_free_iovecs(dev->vqs[i]);
 }
 
-long vhost_dev_init(struct vhost_dev *dev,
-		    struct vhost_virtqueue *vqs, int nvqs)
+int vhost_dev_alloc_subdevs(struct vhost_dev *dev, unsigned long *numa_map,
+	int sz)
+{
+	int i, j = 0;
+	int cur, prev = 0;
+	struct vhost_sub_dev *sub;
+	/* Todo,replace allow_map with dynamic allocated */
+	dev->allow_map = *numa_map;
+	dev->sub_devs = kmalloc(dev->node_cnt*sizeof(void *), GFP_KERNEL);
+
+	while (1) {
+		cur = find_next_bit(numa_map, sz, prev);
+		if (cur >= sz)
+			break;
+		prev = cur;
+		sub =  kmalloc_node(sizeof(struct vhost_sub_dev), GFP_KERNEL, cur);
+		if (sub == NULL)
+			goto err;
+		j++;
+		sub->node_id = cur;
+		sub->owner = dev;
+		spin_lock_init(&sub->work_lock);
+		INIT_LIST_HEAD(&sub->work_list);
+		dev->sub_devs[i] = sub;
+	}
+
+	dev->node_cnt = j;
+	return 0;
+err:
+	for (i = 0; i < j; i++) {
+		kfree(dev->sub_devs[i]);
+		dev->sub_devs[i] = NULL;
+	}
+	return -ENOMEM;
+
+}
+
+void vhost_dev_free_subdevs(struct vhost_dev *dev)
 {
 	int i;
+	for (i = 0; i < dev->node_cnt; i++)
+		kfree(dev->sub_devs[i]);
+	return;
+}
+
+static int check_numa(int *vqs_map, int sz)
+{
+	int i, node;
+
+	for (i = 0; i < sz; i++) {
+		for_each_online_node(node)
+			if (vqs_map[i] == node)
+				break;
+		if (vqs_map[i] != node)
+			return -1;
+	}
+	return 0;
+}
+
+int check_numa_bmp(unsigned long *numa_bmp, int sz)
+{
+	int i, node, cur, prev = 0;
+
+	for (i = 0; i < sz; i++) {
+		cur = find_next_bit(numa_bmp, sz, prev);
+		prev = cur;
+		if (cur >= sz)
+			return 0;
+		for_each_online_node(node)
+			if (cur == node)
+				break;
+		if (cur != node)
+			return -1;
+	}
+	return 0;
+}
+
+/* allocate vqs in node according to request map */
+int vhost_dev_alloc_vqs(struct vhost_dev *dev, struct vhost_virtqueue **vqs, int cnt,
+	int *vqs_map, int sz, vhost_work_fn_t *handle_kick)
+{
+	int r, i, j = 0;
+	r = check_numa(vqs_map, sz);
+	if (r < 0)
+		return -EINVAL;
+	for (i = 0; i < cnt ; i++) {
+		vqs[i] = kmalloc_node(sizeof(struct vhost_virtqueue),
+			GFP_KERNEL, vqs_map[i]);
+		if (vqs[i] == NULL)
+			goto err;
+		vqs[i]->handle_kick = handle_kick[i];
+		j = i;
+	}
+	return 0;
+err:
+	for (i = 0; i < j; i++)
+		kfree(vqs[i]);
+	return -ENOMEM;
+
+}
+
+void vhost_dev_free_vqs(struct vhost_dev *dev, struct vhost_virtqueue **vqs,
+	int cnt)
+{
+	int i;
+	for (i = 0; i < cnt ; i++)
+		kfree(vqs[i]);
+	return;
+}
+
+long vhost_dev_init(struct vhost_dev *dev, struct vhost_virtqueue **vqs, int nvqs)
+{
+	int i, j, ret = 0;
+	struct vhost_sub_dev *subdev;
+	struct vhost_virtqueue *vq;
 
 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
@@ -300,24 +436,32 @@ long vhost_dev_init(struct vhost_dev *dev,
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
-	spin_lock_init(&dev->work_lock);
-	INIT_LIST_HEAD(&dev->work_list);
-	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].log = NULL;
-		dev->vqs[i].indirect = NULL;
-		dev->vqs[i].heads = NULL;
-		dev->vqs[i].ubuf_info = NULL;
-		dev->vqs[i].dev = dev;
-		mutex_init(&dev->vqs[i].mutex);
-		vhost_vq_reset(dev, dev->vqs + i);
-		if (dev->vqs[i].handle_kick)
-			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick, POLLIN, dev);
-	}
+		vq = dev->vqs[i];
+		/* for each numa node, in-vq/out-vq */
+		vq->log = NULL;
+		vq->indirect = NULL;
+		vq->heads = NULL;
+		vq->ubuf_info = NULL;
+		vq->dev = dev;
+		mutex_init(&vq->mutex);
+		vhost_vq_reset(dev, vq);
+
+		if (vq->handle_kick) {
+			for (j = 0; j < i; j++) {
+				subdev =  dev->sub_devs[j];
+				if (vq->node_id == subdev->node_id)
+					vhost_poll_init(&vq->poll, vq->handle_kick, POLLIN, subdev);
+				else {
+					vhost_poll_init(&vq->poll, vq->handle_kick, POLLIN, dev->sub_devs[0]);
+					ret = 1;
+				}
+			}
+		}
 
-	return 0;
+	}
+	return ret;
 }
 
 /* Caller should have device mutex */
@@ -344,19 +488,26 @@ static void vhost_attach_cgroups_work(struct vhost_work *work)
 static int vhost_attach_cgroups(struct vhost_dev *dev)
 {
 	struct vhost_attach_cgroups_struct attach;
-
+	int i, ret = 0;
+	struct vhost_sub_dev *sub;
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-	vhost_work_queue(dev, &attach.work);
-	vhost_work_flush(dev, &attach.work);
-	return attach.ret;
+	for (i = 0; i < dev->node_cnt; i++) {
+		sub = dev->sub_devs[i];
+		vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+		vhost_work_queue(sub, &attach.work);
+		vhost_work_flush(sub, &attach.work);
+		ret |= attach.ret;
+	}
+	return ret;
 }
 
 /* Caller should have device mutex */
 static long vhost_dev_set_owner(struct vhost_dev *dev)
 {
 	struct task_struct *worker;
-	int err;
+	int err, i, j, cur, prev = 0;
+	int sz = sizeof(unsigned long);
+	const struct cpumask *mask;
 
 	/* Is there an owner already? */
 	if (dev->mm) {
@@ -366,14 +517,19 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	/* No owner, become one */
 	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
+
+	for (i = 0, j = 0; i < dev->node_cnt; i++, j++) {
+		cur = find_next_bit(&dev->allow_map, sz, prev);
+		dev->sub_devs[i]->worker = kthread_create_on_node(vhost_worker,
+			dev->sub_devs[i], cur, "vhost-%d-node-%d", current->pid, cur);
+		if (dev->sub_devs[i]->worker == NULL)
+			goto err_cgroup;
+		mask = cpumask_of_node(cur);
+		do_set_cpus_allowed(worker, mask);
 	}
 
-	dev->worker = worker;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
+	for (i = 0; i < dev->node_cnt; i++)
+		wake_up_process(dev->sub_devs[i]->worker);
 
 	err = vhost_attach_cgroups(dev);
 	if (err)
@@ -385,9 +541,12 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	return 0;
 err_cgroup:
-	kthread_stop(worker);
-	dev->worker = NULL;
-err_worker:
+
+	for (i = 0; i < j; i++) {
+		kthread_stop(dev->sub_devs[i]->worker);
+		dev->sub_devs[i]->worker = NULL;
+	}
+
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
@@ -442,28 +601,28 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 	int i;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
-			vhost_poll_stop(&dev->vqs[i].poll);
-			vhost_poll_flush(&dev->vqs[i].poll);
+		if (dev->vqs[i]->kick && dev->vqs[i]->handle_kick) {
+			vhost_poll_stop(&dev->vqs[i]->poll);
+			vhost_poll_flush(&dev->vqs[i]->poll);
 		}
 		/* Wait for all lower device DMAs done. */
-		if (dev->vqs[i].ubufs)
-			vhost_ubuf_put_and_wait(dev->vqs[i].ubufs);
+		if (dev->vqs[i]->ubufs)
+			vhost_ubuf_put_and_wait(dev->vqs[i]->ubufs);
 
 		/* Signal guest as appropriate. */
-		vhost_zerocopy_signal_used(&dev->vqs[i]);
-
-		if (dev->vqs[i].error_ctx)
-			eventfd_ctx_put(dev->vqs[i].error_ctx);
-		if (dev->vqs[i].error)
-			fput(dev->vqs[i].error);
-		if (dev->vqs[i].kick)
-			fput(dev->vqs[i].kick);
-		if (dev->vqs[i].call_ctx)
-			eventfd_ctx_put(dev->vqs[i].call_ctx);
-		if (dev->vqs[i].call)
-			fput(dev->vqs[i].call);
-		vhost_vq_reset(dev, dev->vqs + i);
+		vhost_zerocopy_signal_used(dev->vqs[i]);
+
+		if (dev->vqs[i]->error_ctx)
+			eventfd_ctx_put(dev->vqs[i]->error_ctx);
+		if (dev->vqs[i]->error)
+			fput(dev->vqs[i]->error);
+		if (dev->vqs[i]->kick)
+			fput(dev->vqs[i]->kick);
+		if (dev->vqs[i]->call_ctx)
+			eventfd_ctx_put(dev->vqs[i]->call_ctx);
+		if (dev->vqs[i]->call)
+			fput(dev->vqs[i]->call);
+		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
 	if (dev->log_ctx)
@@ -477,11 +636,15 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 					locked ==
 						lockdep_is_held(&dev->mutex)));
 	RCU_INIT_POINTER(dev->memory, NULL);
+
+	/* fixme,It will be considered and fixed in next verion */
 	WARN_ON(!list_empty(&dev->work_list));
 	if (dev->worker) {
 		kthread_stop(dev->worker);
 		dev->worker = NULL;
 	}
+	/* end*/
+
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
@@ -534,14 +697,14 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_memory *mem,
 
 	for (i = 0; i < d->nvqs; ++i) {
 		int ok;
-		mutex_lock(&d->vqs[i].mutex);
+		mutex_lock(&d->vqs[i]->mutex);
 		/* If ring is inactive, will check when it's enabled. */
-		if (d->vqs[i].private_data)
-			ok = vq_memory_access_ok(d->vqs[i].log_base, mem,
+		if (d->vqs[i]->private_data)
+			ok = vq_memory_access_ok(d->vqs[i]->log_base, mem,
 						 log_all);
 		else
 			ok = 1;
-		mutex_unlock(&d->vqs[i].mutex);
+		mutex_unlock(&d->vqs[i]->mutex);
 		if (!ok)
 			return 0;
 	}
@@ -650,8 +813,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		return r;
 	if (idx >= d->nvqs)
 		return -ENOBUFS;
-
-	vq = d->vqs + idx;
+	vq = d->vqs[idx];
 
 	mutex_lock(&vq->mutex);
 
@@ -750,6 +912,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		vq->log_addr = a.log_guest_addr;
 		vq->used = (void __user *)(unsigned long)a.used_user_addr;
 		break;
+
 	case VHOST_SET_VRING_KICK:
 		if (copy_from_user(&f, argp, sizeof f)) {
 			r = -EFAULT;
@@ -766,6 +929,7 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 		} else
 			filep = eventfp;
 		break;
+
 	case VHOST_SET_VRING_CALL:
 		if (copy_from_user(&f, argp, sizeof f)) {
 			r = -EFAULT;
@@ -863,7 +1027,7 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 		for (i = 0; i < d->nvqs; ++i) {
 			struct vhost_virtqueue *vq;
 			void __user *base = (void __user *)(unsigned long)p;
-			vq = d->vqs + i;
+			vq = d->vqs[i];
 			mutex_lock(&vq->mutex);
 			/* If ring is inactive, will check when it's enabled. */
 			if (vq->private_data && !vq_log_access_ok(d, vq, base))
@@ -890,9 +1054,9 @@ long vhost_dev_ioctl(struct vhost_dev *d, unsigned int ioctl, unsigned long arg)
 		} else
 			filep = eventfp;
 		for (i = 0; i < d->nvqs; ++i) {
-			mutex_lock(&d->vqs[i].mutex);
-			d->vqs[i].log_ctx = d->log_ctx;
-			mutex_unlock(&d->vqs[i].mutex);
+			mutex_lock(&d->vqs[i]->mutex);
+			d->vqs[i]->log_ctx = d->log_ctx;
+			mutex_unlock(&d->vqs[i]->mutex);
 		}
 		if (ctx)
 			eventfd_ctx_put(ctx);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8de1fd5..12d4237 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,12 +13,13 @@
 #include <linux/virtio_ring.h>
 #include <linux/atomic.h>
 
+#define VHOST_NUMA
 /* This is for zerocopy, used buffer len is set to 1 when lower device DMA
  * done */
 #define VHOST_DMA_DONE_LEN	1
 #define VHOST_DMA_CLEAR_LEN	0
 
-struct vhost_device;
+struct vhost_dev;
 
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -32,6 +33,8 @@ struct vhost_work {
 	unsigned		  done_seq;
 };
 
+struct vhost_sub_dev;
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
@@ -40,11 +43,13 @@ struct vhost_poll {
 	wait_queue_t              wait;
 	struct vhost_work	  work;
 	unsigned long		  mask;
-	struct vhost_dev	 *dev;
+	struct vhost_sub_dev *subdev;
 };
 
+void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
+			    poll_table *pt);
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_sub_dev *dev);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -70,7 +75,7 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *);
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
-
+	int node_id;
 	/* The actual ring of buffers. */
 	struct mutex mutex;
 	unsigned int num;
@@ -143,6 +148,14 @@ struct vhost_virtqueue {
 	struct vhost_ubuf_ref *ubufs;
 };
 
+struct vhost_sub_dev {
+	struct vhost_dev *owner;
+	int node_id;
+	spinlock_t work_lock;
+	struct list_head work_list;
+	struct task_struct *worker;
+};
+
 struct vhost_dev {
 	/* Readers use RCU to access memory table pointer
 	 * log base pointer and features.
@@ -151,16 +164,24 @@ struct vhost_dev {
 	struct mm_struct *mm;
 	struct mutex mutex;
 	unsigned acked_features;
-	struct vhost_virtqueue *vqs;
+	struct vhost_virtqueue **vqs;
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
-	spinlock_t work_lock;
-	struct list_head work_list;
-	struct task_struct *worker;
+	/* todo, change it to bitmap */
+	unsigned long allow_map;
+	unsigned long node_cnt;
+	unsigned long zcopy_mask;
+	struct vhost_sub_dev **sub_devs;
 };
 
-long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
+int check_numa_bmp(unsigned long *numa_bmp, int sz);
+int vhost_dev_alloc_subdevs(struct vhost_dev *dev, unsigned long *numa_map,
+	int sz);
+void vhost_dev_free_subdevs(struct vhost_dev *dev);
+int vhost_dev_alloc_vqs(struct vhost_dev *dev, struct vhost_virtqueue **vqs,
+	int cnt, int *vqs_map, int sz, vhost_work_fn_t *handle_kick);
+long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
 void vhost_dev_cleanup(struct vhost_dev *, bool locked);
@@ -216,6 +237,6 @@ static inline int vhost_has_feature(struct vhost_dev *dev, int bit)
 	return acked_features & (1 << bit);
 }
 
-void vhost_enable_zcopy(int vq);
+void vhost_enable_zcopy(struct vhost_dev *dev, int rx);
 
 #endif
diff --git a/include/linux/vhost.h b/include/linux/vhost.h
index e847f1e..d8c76f1 100644
--- a/include/linux/vhost.h
+++ b/include/linux/vhost.h
@@ -120,7 +120,7 @@ struct vhost_memory {
  * used for transmit.  Pass fd -1 to unbind from the socket and the transmit
  * device.  This can be used to stop the ring (e.g. for migration). */
 #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
-
+#define VHOST_NET_SET_NUMA  _IOW(VHOST_VIRTIO, 0x31, unsigned long)
 /* Feature bits */
 /* Log all write descriptors. Can be changed while device is active. */
 #define VHOST_F_LOG_ALL 26
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] [kvm/vhost-net]: make vhost net own NUMA attribute
  2012-05-17  9:20 [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Liu Ping Fan
  2012-05-17  9:20 ` [PATCH 1/2] [kvm/vhost]: make vhost support NUMA model Liu Ping Fan
@ 2012-05-17  9:20 ` Liu Ping Fan
  2012-05-17  9:20 ` [PATCH 1/2] [kvm/virtio]: make virtio support NUMA attr Liu Ping Fan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: Krishna Kumar, Shirley Ma, Tom Lendacky, Michael S. Tsirkin,
	qemu-devel, Rusty Russell, Srivatsa Vaddagiri, linux-kernel,
	Ryan Harper, Avi Kivity, Anthony Liguori

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Make vhost net support to spread on host node according the command.
And consider the whole vhost_net componsed of lots of logic net units.
for each node, there is a unit, which includes a vhost_worker thread,
rx/tx vhost_virtqueue.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/vhost/net.c |  388 ++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 258 insertions(+), 130 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1f21d2a..770933e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -55,8 +55,19 @@ enum vhost_net_poll_state {
 
 struct vhost_net {
 	struct vhost_dev dev;
-	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
+	int numa_init;
+	int vqcnt;
+	struct vhost_virtqueue **vqs;
+	/* one for tx, one for rx */
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	int token[VHOST_NET_VQ_MAX];
+	/* fix me, Although tun.socket.sock can be parrell, but _maybe_, we need to record
+	 * wmem_alloc independly for each subdev.
+	 */
+	struct mutex mutex;
+	struct socket __rcu *tx_sock;
+	struct socket __rcu *rx_sock;
+
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -112,7 +123,9 @@ static void tx_poll_stop(struct vhost_net *net)
 {
 	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
 		return;
+
 	vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
+
 	net->tx_poll_state = VHOST_NET_POLL_STOPPED;
 }
 
@@ -121,15 +134,15 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 {
 	if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
 		return;
+
 	vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_net *net, struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
 	unsigned out, in, s;
 	int head;
 	struct msghdr msg = {
@@ -148,15 +161,15 @@ static void handle_tx(struct vhost_net *net)
 	bool zcopy;
 
 	/* TODO: check that we are running from vhost_worker? */
-	sock = rcu_dereference_check(vq->private_data, 1);
+	sock = rcu_dereference_check(net->tx_sock, 1);
 	if (!sock)
 		return;
 
 	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 	if (wmem >= sock->sk->sk_sndbuf) {
-		mutex_lock(&vq->mutex);
+		mutex_lock(&net->mutex);
 		tx_poll_start(net, sock);
-		mutex_unlock(&vq->mutex);
+		mutex_unlock(&net->mutex);
 		return;
 	}
 
@@ -165,6 +178,7 @@ static void handle_tx(struct vhost_net *net)
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
+
 	hdr_size = vq->vhost_hlen;
 	zcopy = vhost_sock_zcopy(sock);
 
@@ -186,8 +200,10 @@ static void handle_tx(struct vhost_net *net)
 
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
+				mutex_lock(&net->mutex);
 				tx_poll_start(net, sock);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				mutex_unlock(&net->mutex);
 				break;
 			}
 			/* If more outstanding DMAs, queue the work.
@@ -197,8 +213,10 @@ static void handle_tx(struct vhost_net *net)
 				    (vq->upend_idx - vq->done_idx) :
 				    (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
 			if (unlikely(num_pends > VHOST_MAX_PEND)) {
+				mutex_lock(&net->mutex);
 				tx_poll_start(net, sock);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				mutex_unlock(&net->mutex);
 				break;
 			}
 			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
@@ -353,9 +371,8 @@ err:
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx(struct vhost_net *net)
+static void handle_rx(struct vhost_net *net, struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned uninitialized_var(in), log;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -375,11 +392,10 @@ static void handle_rx(struct vhost_net *net)
 	size_t vhost_hlen, sock_hlen;
 	size_t vhost_len, sock_len;
 	/* TODO: check that we are running from vhost_worker? */
-	struct socket *sock = rcu_dereference_check(vq->private_data, 1);
+	struct socket *sock = rcu_dereference_check(net->tx_sock, 1);
 
 	if (!sock)
 		return;
-
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(&net->dev, vq);
 	vhost_hlen = vq->vhost_hlen;
@@ -465,8 +481,7 @@ static void handle_tx_kick(struct vhost_work *work)
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
 	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
-
-	handle_tx(net);
+	handle_tx(net, vq);
 }
 
 static void handle_rx_kick(struct vhost_work *work)
@@ -475,103 +490,115 @@ static void handle_rx_kick(struct vhost_work *work)
 						  poll.work);
 	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_rx(net);
+	handle_rx(net, vq);
 }
 
-static void handle_tx_net(struct vhost_work *work)
+/* Get sock->file event, then pick up a vhost_worker to wake up.
+ * Currently ,we are round robin, maybe in future, we know which
+ * numa-node the skb from tap want to go.
+ */
+static int deliver_worker(struct vhost_net *net, int rx)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_TX].work);
-	handle_tx(net);
+	int i = rx ? VHOST_NET_VQ_RX : VHOST_NET_VQ_TX;
+	int idx = ((net->token[i]++<<1)+i)%net->vqcnt;
+	vhost_poll_queue(&net->vqs[idx]->poll);
+	return 0;
 }
 
-static void handle_rx_net(struct vhost_work *work)
+static int net_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
+			     void *key)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_RX].work);
-	handle_rx(net);
+	struct vhost_poll *poll = container_of(wait, struct vhost_poll, wait);
+	struct vhost_poll *head = (poll->mask == POLLIN) ? poll : poll-1;
+	struct vhost_net *net = container_of(head, struct vhost_net, poll[0]);
+
+	if (!((unsigned long)key & poll->mask))
+		return 0;
+
+	if (poll->mask == POLLIN)
+		deliver_worker(net, 1);
+	else
+		deliver_worker(net, 0);
+	return 0;
+}
+
+static void net_poll_init(struct vhost_poll *poll, unsigned long mask)
+{
+	init_waitqueue_func_entry(&poll->wait, net_poll_wakeup);
+	init_poll_funcptr(&poll->table, vhost_poll_func);
+	poll->mask = mask;
+	poll->subdev = NULL;
 }
 
 static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
-	struct vhost_dev *dev;
-	int r;
-
 	if (!n)
 		return -ENOMEM;
-
-	dev = &n->dev;
-	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
-	if (r < 0) {
-		kfree(n);
-		return r;
-	}
-
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
-	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
-
 	f->private_data = n;
-
 	return 0;
 }
 
-static void vhost_net_disable_vq(struct vhost_net *n,
-				 struct vhost_virtqueue *vq)
+static void vhost_net_disable_xmit(struct vhost_net *n, int rx)
 {
-	if (!vq->private_data)
-		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
+	if (rx  == 0) {
 		tx_poll_stop(n);
 		n->tx_poll_state = VHOST_NET_POLL_DISABLED;
 	} else
-		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+		vhost_poll_stop(n->poll+VHOST_NET_VQ_RX);
 }
 
-static void vhost_net_enable_vq(struct vhost_net *n,
-				struct vhost_virtqueue *vq)
+static void vhost_net_enable_xmit(struct vhost_net *n, int rx)
 {
 	struct socket *sock;
 
-	sock = rcu_dereference_protected(vq->private_data,
-					 lockdep_is_held(&vq->mutex));
-	if (!sock)
-		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
+	if (rx == 0) {
+		sock = rcu_dereference_protected(n->tx_sock,
+					 lockdep_is_held(&n->mutex));
+		if (!sock)
+			return;
 		n->tx_poll_state = VHOST_NET_POLL_STOPPED;
 		tx_poll_start(n, sock);
-	} else
+	} else {
+		sock = rcu_dereference_protected(n->rx_sock,
+					 lockdep_is_held(&n->mutex));
+		if (!sock)
+			return;
 		vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+	}
 }
 
-static struct socket *vhost_net_stop_vq(struct vhost_net *n,
-					struct vhost_virtqueue *vq)
+static int vhost_net_stop_xmit(struct vhost_net *n, int rx)
 {
-	struct socket *sock;
-
-	mutex_lock(&vq->mutex);
-	sock = rcu_dereference_protected(vq->private_data,
-					 lockdep_is_held(&vq->mutex));
-	vhost_net_disable_vq(n, vq);
-	rcu_assign_pointer(vq->private_data, NULL);
-	mutex_unlock(&vq->mutex);
-	return sock;
+	mutex_lock(&n->mutex);
+	vhost_net_disable_xmit(n, rx);
+	mutex_unlock(&n->mutex);
+	return 0;
 }
 
-static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
-			   struct socket **rx_sock)
+static void vhost_net_stop(struct vhost_net *n)
 {
-	*tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
-	*rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
+	vhost_net_stop_xmit(n, 0);
+	vhost_net_stop_xmit(n, 1);
 }
 
-static void vhost_net_flush_vq(struct vhost_net *n, int index)
+/* We wait for vhost_work on all vqs to finish gp. And n->poll[] 
+ * are not vhost_work any longer
+ */
+static void vhost_net_flush_vq(struct vhost_net *n, int rx)
 {
-	vhost_poll_flush(n->poll + index);
-	vhost_poll_flush(&n->dev.vqs[index].poll);
+	int i, idx;
+	if (rx == 0) {
+		for (i = 0; i < n->dev.node_cnt; i++) {
+			idx = (i<<1) + VHOST_NET_VQ_TX;
+			vhost_poll_flush(&n->dev.vqs[idx]->poll);
+		}
+	} else {
+		for (i = 0; i < n->dev.node_cnt; i++) {
+			idx = (i<<1) + VHOST_NET_VQ_RX;
+			vhost_poll_flush(&n->dev.vqs[idx]->poll);
+		}
+	}
 }
 
 static void vhost_net_flush(struct vhost_net *n)
@@ -583,16 +610,16 @@ static void vhost_net_flush(struct vhost_net *n)
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
-	struct socket *tx_sock;
-	struct socket *rx_sock;
 
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n);
 	vhost_net_flush(n);
 	vhost_dev_cleanup(&n->dev, false);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+
+	if (n->tx_sock)
+		fput(n->tx_sock->file);
+	if (n->rx_sock)
+		fput(n->rx_sock->file);
+
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
@@ -665,30 +692,27 @@ static struct socket *get_socket(int fd)
 	return ERR_PTR(-ENOTSOCK);
 }
 
-static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
+static long vhost_net_set_backend(struct vhost_net *n, unsigned rx, int fd)
 {
 	struct socket *sock, *oldsock;
 	struct vhost_virtqueue *vq;
-	struct vhost_ubuf_ref *ubufs, *oldubufs = NULL;
-	int r;
+	struct vhost_ubuf_ref *ubufs, *old, **oldubufs = NULL;
+	int r, i;
+	struct vhost_poll *poll;
+	struct socket **target;
 
+	oldubufs = kmalloc(sizeof(void *)*n->dev.node_cnt, GFP_KERNEL);
+	if (oldubufs == NULL)
+		return -ENOMEM;
 	mutex_lock(&n->dev.mutex);
 	r = vhost_dev_check_owner(&n->dev);
 	if (r)
 		goto err;
+	if (rx)
+		target = &n->rx_sock;
+	else
+		target = &n->tx_sock;
 
-	if (index >= VHOST_NET_VQ_MAX) {
-		r = -ENOBUFS;
-		goto err;
-	}
-	vq = n->vqs + index;
-	mutex_lock(&vq->mutex);
-
-	/* Verify that ring has been setup correctly. */
-	if (!vhost_vq_access_ok(vq)) {
-		r = -EFAULT;
-		goto err_vq;
-	}
 	sock = get_socket(fd);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
@@ -696,70 +720,106 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	}
 
 	/* start polling new socket */
-	oldsock = rcu_dereference_protected(vq->private_data,
-					    lockdep_is_held(&vq->mutex));
+	if (rx == 1)
+		/* todo, consider about protection, hold net->mutex? */
+		oldsock = rcu_dereference_protected(n->rx_sock, 1);
+	else
+		oldsock = rcu_dereference_protected(n->tx_sock, 1);
+
 	if (sock != oldsock) {
-		ubufs = vhost_ubuf_alloc(vq, sock && vhost_sock_zcopy(sock));
-		if (IS_ERR(ubufs)) {
-			r = PTR_ERR(ubufs);
-			goto err_ubufs;
+		if (rx == 1)
+			poll = &n->poll[0];
+		else
+			poll = &n->poll[1];
+
+		/* todo, consider about protection, hold net->mutex? */
+		vhost_poll_stop(poll);
+
+		for (i = 0; i < n->dev.node_cnt; i++) {
+			if (rx == 0)
+				vq = n->vqs[(i<<1)+VHOST_NET_VQ_TX];
+			else
+				vq = n->vqs[(i<<1)+VHOST_NET_VQ_RX];
+
+			mutex_lock(&vq->mutex);
+			ubufs = vhost_ubuf_alloc(vq, sock && vhost_sock_zcopy(sock));
+			if (IS_ERR(ubufs)) {
+				r = PTR_ERR(ubufs);
+				mutex_unlock(&vq->mutex);
+				goto err_ubufs;
+			}
+			oldubufs[i] = vq->ubufs;
+			vq->ubufs = ubufs;
+			r = vhost_init_used(vq);
+			mutex_unlock(&vq->mutex);
+			if (r)
+				goto err_vq;
 		}
-		oldubufs = vq->ubufs;
-		vq->ubufs = ubufs;
-		vhost_net_disable_vq(n, vq);
-		rcu_assign_pointer(vq->private_data, sock);
-		vhost_net_enable_vq(n, vq);
-
-		r = vhost_init_used(vq);
-		if (r)
-			goto err_vq;
+
+		mutex_lock(&n->mutex);
+		vhost_net_disable_xmit(n, rx);
+		if (rx == 1)
+			rcu_assign_pointer(n->rx_sock, sock);
+		else
+			rcu_assign_pointer(n->tx_sock, sock);
+		vhost_net_enable_xmit(n, rx);
+		mutex_unlock(&n->mutex);
+
+		/* todo, consider about protection, hold net->mutex? */
+		vhost_poll_start(poll, sock->file);
 	}
 
-	mutex_unlock(&vq->mutex);
+	for (i = 0; i < n->dev.node_cnt; i++) {
+		old = oldubufs[i];
+		if (rx == 0)
+			vq = n->vqs[(i<<1)+VHOST_NET_VQ_TX];
+		else
+			vq = n->vqs[(i<<1)+VHOST_NET_VQ_RX];
 
-	if (oldubufs) {
-		vhost_ubuf_put_and_wait(oldubufs);
-		mutex_lock(&vq->mutex);
-		vhost_zerocopy_signal_used(vq);
-		mutex_unlock(&vq->mutex);
+		if (old) {
+			vhost_ubuf_put_and_wait(old);
+			mutex_lock(&vq->mutex);
+			vhost_zerocopy_signal_used(vq);
+			mutex_unlock(&vq->mutex);
+		}
 	}
 
 	if (oldsock) {
-		vhost_net_flush_vq(n, index);
+		vhost_net_flush_vq(n, rx);
 		fput(oldsock->file);
 	}
 
 	mutex_unlock(&n->dev.mutex);
+	kfree(oldubufs);
 	return 0;
 
 err_ubufs:
 	fput(sock->file);
 err_vq:
-	mutex_unlock(&vq->mutex);
+	mutex_unlock(&n->mutex);
 err:
 	mutex_unlock(&n->dev.mutex);
+	kfree(oldubufs);
 	return r;
 }
 
 static long vhost_net_reset_owner(struct vhost_net *n)
 {
-	struct socket *tx_sock = NULL;
-	struct socket *rx_sock = NULL;
 	long err;
 
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
 	if (err)
 		goto done;
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n);
 	vhost_net_flush(n);
 	err = vhost_dev_reset_owner(&n->dev);
 done:
 	mutex_unlock(&n->dev.mutex);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+	if (n->tx_sock)
+		fput(n->tx_sock->file);
+	if (n->rx_sock)
+		fput(n->rx_sock->file);
 	return err;
 }
 
@@ -788,17 +848,72 @@ static int vhost_net_set_features(struct vhost_net *n, u64 features)
 	}
 	n->dev.acked_features = features;
 	smp_wmb();
-	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
-		mutex_lock(&n->vqs[i].mutex);
-		n->vqs[i].vhost_hlen = vhost_hlen;
-		n->vqs[i].sock_hlen = sock_hlen;
-		mutex_unlock(&n->vqs[i].mutex);
+	for (i = 0; i < n->vqcnt; ++i) {
+		mutex_lock(&n->vqs[i]->mutex);
+		n->vqs[i]->vhost_hlen = vhost_hlen;
+		n->vqs[i]->sock_hlen = sock_hlen;
+		mutex_unlock(&n->vqs[i]->mutex);
 	}
 	vhost_net_flush(n);
 	mutex_unlock(&n->dev.mutex);
 	return 0;
 }
 
+static int vhost_netdev_init(struct vhost_net *n)
+{
+	struct vhost_dev *dev;
+	vhost_work_fn_t *handle_kicks;
+	int r, i;
+	int cur, prev = 0;
+	int sz = 64;
+	int vqcnt;
+	int *vqs_map;
+	dev = &n->dev;
+	vqcnt = dev->node_cnt * 2;
+	n->vqs =  kmalloc(vqcnt*sizeof(void *), GFP_KERNEL);
+	handle_kicks = kmalloc(vqcnt*sizeof(void *), GFP_KERNEL);
+	vqs_map = kmalloc(vqcnt*sizeof(int), GFP_KERNEL);
+	for (i = 0; i < vqcnt;) {
+		cur = find_next_bit(&n->dev.allow_map, sz, prev);
+		prev = cur;
+		handle_kicks[i++] = handle_rx_kick;
+		vqs_map[i] = cur;
+		handle_kicks[i++] = handle_tx_kick;
+		vqs_map[i] = cur;
+
+	}
+
+	r = vhost_dev_alloc_subdevs(dev, &n->dev.allow_map, sz);
+	if (r < 0) {
+		/* todo, err handling */
+		return r;
+	}
+	r = vhost_dev_alloc_vqs(dev, n->vqs, vqcnt, vqs_map, vqcnt, handle_kicks);
+	if (r < 0) {
+		/* todo, err handling */
+		return r;
+	}
+	r = vhost_dev_init(dev, n->vqs, vqcnt);
+	if (r < 0)
+		goto exit;
+	if (experimental_zcopytx)
+		vhost_enable_zcopy(dev, 0);
+
+	net_poll_init(n->poll+VHOST_NET_VQ_TX, POLLOUT);
+	net_poll_init(n->poll+VHOST_NET_VQ_RX, POLLIN);
+	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->numa_init = 1;
+	r = 0;
+exit:
+	kfree(handle_kicks);
+	kfree(vqs_map);
+	if (r == 0)
+		return 0;
+	kfree(n->vqs);
+	kfree(n);
+	return r;
+}
+
 static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 			    unsigned long arg)
 {
@@ -808,8 +923,23 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
 	struct vhost_vring_file backend;
 	u64 features;
 	int r;
+	/* todo, dynamic allocated */
+	unsigned long bmp, sz = 64;
+
+	if (!n->numa_init && ioctl != VHOST_NET_SET_NUMA)
+		return -EOPNOTSUPP;
 
 	switch (ioctl) {
+	case VHOST_NET_SET_NUMA:
+		/* 4 must be extended. */
+		if (copy_from_user(&bmp, argp, 4))
+			return -EFAULT;
+		r = check_numa_bmp(&bmp, sz);
+		if (r < 0)
+			return -EINVAL;
+		n->dev.allow_map = bmp;
+		r = vhost_netdev_init(n);
+		return r;
 	case VHOST_NET_SET_BACKEND:
 		if (copy_from_user(&backend, argp, sizeof backend))
 			return -EFAULT;
@@ -863,8 +993,6 @@ static struct miscdevice vhost_net_misc = {
 
 static int vhost_net_init(void)
 {
-	if (experimental_zcopytx)
-		vhost_enable_zcopy(VHOST_NET_VQ_TX);
 	return misc_register(&vhost_net_misc);
 }
 module_init(vhost_net_init);
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 1/2] [kvm/virtio]: make virtio support NUMA attr
  2012-05-17  9:20 [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Liu Ping Fan
  2012-05-17  9:20 ` [PATCH 1/2] [kvm/vhost]: make vhost support NUMA model Liu Ping Fan
  2012-05-17  9:20 ` [PATCH 2/2] [kvm/vhost-net]: make vhost net own NUMA attribute Liu Ping Fan
@ 2012-05-17  9:20 ` Liu Ping Fan
  2012-05-17  9:20 ` [PATCH 2/2] [net/virtio_net]: make virtio_net support NUMA info Liu Ping Fan
  2012-05-18 16:14 ` [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Shirley Ma
  4 siblings, 0 replies; 11+ messages in thread
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

For each numa node reported by vhost, we alloc a pair of i/o vq,
and assign them msix IRQ, and set irq affinity to a set of vcpu
in the same node.
Also we alloc vqs on PAGE_SIZE align, so they will be allocated by
host when pg fault happen on different node.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c       |    2 +-
 drivers/virtio/virtio_pci.c   |   35 +++++++++++++++++++++++++++++++++--
 drivers/virtio/virtio_ring.c  |    9 ++++++---
 include/linux/virtio.h        |    9 +++++++++
 include/linux/virtio_config.h |    1 +
 include/linux/virtio_pci.h    |    9 +++++++++
 6 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 984c501..79e873f 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -136,7 +136,7 @@ static int virtio_dev_probe(struct device *_d)
 			set_bit(i, dev->features);
 
 	dev->config->finalize_features(dev);
-
+	dev->config->get_numa_map(dev);
 	err = drv->probe(dev);
 	if (err)
 		add_status(dev, VIRTIO_CONFIG_S_FAILED);
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 2e03d41..5bb8a97 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -129,6 +129,24 @@ static void vp_finalize_features(struct virtio_device *vdev)
 	iowrite32(vdev->features[0], vp_dev->ioaddr+VIRTIO_PCI_GUEST_FEATURES);
 }
 
+static void vp_get_numa_map(struct virtio_device *vdev)
+{
+	int i, cnt,  sz = 32;
+	int cur, prev = 0;
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+
+	/* We only support 32 numa bits. */
+	vdev->allow_map = ioread32(vp_dev->ioaddr+VIRTIO_PCI_NUMA_MAP);
+	for (i = 0; i < sz; i++) {
+		cur = find_next_bit(&vdev->allow_map, sz, prev);
+		prev = cur;
+		if (cur >= sz)
+			break;
+		cnt++;
+	}
+	vdev->node_cnt = cnt;
+}
+
 /* virtio config->get() implementation */
 static void vp_get(struct virtio_device *vdev, unsigned offset,
 		   void *buf, unsigned len)
@@ -516,6 +534,8 @@ static int vp_try_to_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
 	u16 msix_vec;
 	int i, err, nvectors, allocated_vectors;
+	int irq, next, prev = 0;
+	struct cpumask *mask;
 
 	if (!use_msix) {
 		/* Old style: one normal interrupt for change and all vqs. */
@@ -562,14 +582,24 @@ static int vp_try_to_find_vqs(struct virtio_device *vdev, unsigned nvqs,
 			 sizeof *vp_dev->msix_names,
 			 "%s-%s",
 			 dev_name(&vp_dev->vdev.dev), names[i]);
-		err = request_irq(vp_dev->msix_entries[msix_vec].vector,
-				  vring_interrupt, 0,
+		irq = vp_dev->msix_entries[msix_vec].vector;
+		err = request_irq(irq, vring_interrupt, 0,
 				  vp_dev->msix_names[msix_vec],
 				  vqs[i]);
 		if (err) {
 			vp_del_vq(vqs[i]);
 			goto error_find;
 		}
+		if (i == vdev->node_cnt)
+			prev = 0;
+		/* fix me the @size */
+		next = find_next_bit(vdev->allow_map, 64, prev);
+		prev = next;
+		if (next < 64) {
+			mask = vnode_to_vcpumask(next);
+			mask = cpumask_and(mask, cpu_online_mask, mask);
+			irq_set_affinity(irq, mask);
+		}
 	}
 	return 0;
 
@@ -619,6 +649,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.del_vqs	= vp_del_vqs,
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
+	.get_numa_map = vp_get_numa_map,
 	.bus_name	= vp_bus_name,
 };
 
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..5baa949 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -626,15 +626,18 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
 				      const char *name)
 {
 	struct vring_virtqueue *vq;
-	unsigned int i;
+	unsigned int i, size, max;
 
 	/* We assume num is a power of 2. */
 	if (num & (num - 1)) {
 		dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num);
 		return NULL;
 	}
-
-	vq = kmalloc(sizeof(*vq) + sizeof(void *)*num, GFP_KERNEL);
+	size = PAGE_ALIGN (sizeof(*vq) + sizeof(void *)*num);
+	/* Allocate on PAGE boundary, so host can locate them at proper
+	 * node
+	 */
+	vq = kmalloc(size, GFP_KERNEL);
 	if (!vq)
 		return NULL;
 
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8efd28a..ec992c9 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -9,6 +9,12 @@
 #include <linux/mod_devicetable.h>
 #include <linux/gfp.h>
 
+struct virtio_node {
+	int node_id;
+	struct virtqueue *rvq;
+	struct virtqueue *svq;
+};
+
 /**
  * virtqueue - a queue to register buffers for sending or receiving.
  * @list: the chain of virtqueues for this device
@@ -22,6 +28,7 @@ struct virtqueue {
 	void (*callback)(struct virtqueue *vq);
 	const char *name;
 	struct virtio_device *vdev;
+	struct virtio_node *node;
 	void *priv;
 };
 
@@ -66,6 +73,8 @@ struct virtio_device {
 	struct virtio_device_id id;
 	struct virtio_config_ops *config;
 	struct list_head vqs;
+	int node_cnt;
+	unsigned long allow_map;
 	/* Note that this is a Linux set_bit-style bitmap. */
 	unsigned long features[1];
 	void *priv;
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 7323a33..5e2fd77 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -124,6 +124,7 @@ struct virtio_config_ops {
 	void (*del_vqs)(struct virtio_device *);
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
+	void (*get_numa_map)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
 };
 
diff --git a/include/linux/virtio_pci.h b/include/linux/virtio_pci.h
index ea66f3f..1426717 100644
--- a/include/linux/virtio_pci.h
+++ b/include/linux/virtio_pci.h
@@ -78,9 +78,18 @@
 /* Vector value used to disable MSI for queue */
 #define VIRTIO_MSI_NO_VECTOR            0xffff
 
+#ifdef VIRTIO_NUMA
+/* 32bits to show allowed numa */
+#define VIRTIO_PCI_NUMA_MAP         24
+
+/* The remaining space is defined by each driver as the per-driver
+ * configuration space */
+#define VIRTIO_PCI_CONFIG(dev)		28
+#else
 /* The remaining space is defined by each driver as the per-driver
  * configuration space */
 #define VIRTIO_PCI_CONFIG(dev)		((dev)->msix_enabled ? 24 : 20)
+#endif
 
 /* Virtio ABI version, this must match exactly */
 #define VIRTIO_PCI_ABI_VERSION		0
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] [net/virtio_net]: make virtio_net support NUMA info
  2012-05-17  9:20 [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Liu Ping Fan
                   ` (2 preceding siblings ...)
  2012-05-17  9:20 ` [PATCH 1/2] [kvm/virtio]: make virtio support NUMA attr Liu Ping Fan
@ 2012-05-17  9:20 ` Liu Ping Fan
  2012-05-18 16:14 ` [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Shirley Ma
  4 siblings, 0 replies; 11+ messages in thread
From: Liu Ping Fan @ 2012-05-17  9:20 UTC (permalink / raw)
  To: kvm, netdev
  Cc: linux-kernel, qemu-devel, Avi Kivity, Michael S. Tsirkin,
	Srivatsa Vaddagiri, Rusty Russell, Anthony Liguori, Ryan Harper,
	Shirley Ma, Krishna Kumar, Tom Lendacky

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Vhost net uses separate transfer logic unit in different node.
Virtio net must determine which logic unit it will talk with,
so we can improve the performance.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/net/virtio_net.c |  425 ++++++++++++++++++++++++++++++++++------------
 1 files changed, 314 insertions(+), 111 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index af8acc8..31abafa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -50,16 +50,32 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
+struct napi_info {
+	struct napi_struct napi;
+	struct work_struct enable_napi;
+};
+
+struct vnet_virtio_node {
+	struct virtio_node vnode;
+	int demo_cpu;
+	struct napi_info info;
+	struct delayed_work refill;
+	struct virtnet_info *owner;
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
+	/* we want to scatter in different host nodes */
+	struct virtqueue **vqs, **rvqs, **svqs;
+	struct virtqueue *cvq;
+	/* we want to scatter in different host nodes */
+	struct vnet_virtio_node **vnet_nodes;
 	struct net_device *dev;
-	struct napi_struct napi;
+
 	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
-
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
@@ -69,9 +85,6 @@ struct virtnet_info {
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
@@ -136,7 +149,6 @@ static void skb_xmit_done(struct virtqueue *svq)
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(svq);
-
 	/* We were probably waiting for more output buffers. */
 	netif_wake_queue(vi->dev);
 }
@@ -220,7 +232,8 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	return skb;
 }
 
-static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
+static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb,
+	struct virtqueue *rvq)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	struct page *page;
@@ -234,7 +247,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 			skb->dev->stats.rx_length_errors++;
 			return -EINVAL;
 		}
-		page = virtqueue_get_buf(vi->rvq, &len);
+		page = virtqueue_get_buf(rvq, &len);
 		if (!page) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 skb->dev->name, hdr->mhdr.num_buffers);
@@ -252,7 +265,8 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 	return 0;
 }
 
-static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
+static void receive_buf(struct net_device *dev, void *buf, unsigned int len,
+	struct virtqueue *rvq)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
@@ -283,7 +297,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 			return;
 		}
 		if (vi->mergeable_rx_bufs)
-			if (receive_mergeable(vi, skb)) {
+			if (receive_mergeable(vi, skb, rvq)) {
 				dev_kfree_skb(skb);
 				return;
 			}
@@ -353,7 +367,67 @@ frame_err:
 	dev_kfree_skb(skb);
 }
 
-static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
+/* todo, this will be redesign, and as a part of exporting host numa info to
+  * guest scheduler  */
+/* fix me, host numa node id directly exposed to guest? */
+
+/* fill in by host */
+static s16 __vapicid_to_vnode[MAX_LOCAL_APIC];
+/* fix me, HOST_NUMNODES is defined by host */
+#define  HOST_NUMNODES  128
+static struct cpumask vnode_to_vcpumask_map[HOST_NUMNODES];
+DECLARE_PER_CPU(int, vcpu_to_vnode_map);
+
+void init_vnode_map(void)
+{
+	int cpu, apicid, vnode;
+	for_each_possible_cpu(cpu) {
+		apicid = cpu_physical_id(cpu);
+		vnode = __vapicid_to_vnode[apicid];
+		per_cpu(vcpu_to_vnode_map, cpu) = vnode;
+	}
+}
+
+struct cpumask *vnode_to_vcpumask(int virtio_node)
+{
+	struct cpumask *msk = &vnode_to_vcpumask_map[virtio_node];
+	return msk;
+}
+
+static int first_vcpu_on_virtio_node(int virtio_node)
+{
+	 struct cpumask *msk = vnode_to_vcpumask(virtio_node);
+	 return cpumask_first(msk);
+}
+
+static int vcpu_to_virtio_node(void)
+{
+	int vnode = __get_cpu_var(vcpu_to_vnode_map);
+	return vnode;
+}
+/* end of todo */
+
+static int virtqueue_pickup(struct virtnet_info *vi, struct virtqueue **vq, int rx)
+{
+	int node;
+	int i;
+	struct vnet_virtio_node *vnnode;
+	node = vcpu_to_virtio_node();
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		if (vnnode->vnode.node_id == node) {
+			if (rx == 0)
+				*vq = vnnode->vnode.svq;
+			else
+				*vq = vnnode->vnode.rvq;
+			return 0;
+		}
+	}
+	*vq = NULL;
+	return -1;
+}
+
+static int add_recvbuf_small(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
@@ -369,15 +443,14 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
 	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
-
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
 	return err;
 }
 
-static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_big(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct page *first, *list = NULL;
 	char *p;
@@ -415,7 +488,8 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
+
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
 				first, gfp);
 	if (err < 0)
 		give_pages(vi, first);
@@ -423,7 +497,7 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_mergeable(struct virtnet_info *vi, struct virtqueue *vq, gfp_t gfp)
 {
 	struct page *page;
 	int err;
@@ -433,8 +507,7 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
 		return -ENOMEM;
 
 	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
-
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(vq, vi->rx_sg, 0, 1, page, gfp);
 	if (err < 0)
 		give_pages(vi, page);
 
@@ -448,18 +521,17 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
  * before we're receiving packets, or from refill_work which is
  * careful to disable receiving (using napi_disable).
  */
-static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
+static bool try_fill_recv(struct virtnet_info *vi, struct virtqueue *rvq, gfp_t gfp)
 {
 	int err;
 	bool oom;
-
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(vi, gfp);
+			err = add_recvbuf_mergeable(vi, rvq, gfp);
 		else if (vi->big_packets)
-			err = add_recvbuf_big(vi, gfp);
+			err = add_recvbuf_big(vi, rvq, gfp);
 		else
-			err = add_recvbuf_small(vi, gfp);
+			err = add_recvbuf_small(vi, rvq, gfp);
 
 		oom = err == -ENOMEM;
 		if (err < 0)
@@ -468,31 +540,79 @@ static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
 	} while (err > 0);
 	if (unlikely(vi->num > vi->max))
 		vi->max = vi->num;
-	virtqueue_kick(vi->rvq);
+
+	virtqueue_kick(rvq);
 	return !oom;
 }
 
+static void try_fill_all_recv(struct virtnet_info *vi, gfp_t gfp)
+{
+	int i, cpu, err;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		err = try_fill_recv(vi, vnnode->vnode.rvq, gfp);
+		if (err) {
+			cpu = first_vcpu_on_virtio_node(vnnode->vnode.node_id);
+			queue_delayed_work_on(cpu, system_nrt_wq, &vnnode->refill, 0);
+		}
+	}
+	return;
+}
+
 static void skb_recv_done(struct virtqueue *rvq)
 {
-	struct virtnet_info *vi = rvq->vdev->priv;
+	struct vnet_virtio_node *vnet_node = container_of(rvq->node, struct vnet_virtio_node, vnode);
+	struct napi_struct *napi = &vnet_node->info.napi;
+
 	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&vi->napi)) {
+	if (napi_schedule_prep(napi)) {
 		virtqueue_disable_cb(rvq);
-		__napi_schedule(&vi->napi);
+		__napi_schedule(napi);
 	}
 }
 
-static void virtnet_napi_enable(struct virtnet_info *vi)
+static void virtnet_napi_enable(struct napi_struct *napi, struct virtqueue *rvq)
 {
-	napi_enable(&vi->napi);
+	napi_enable(napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
 	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
 	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
-		__napi_schedule(&vi->napi);
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(rvq);
+		__napi_schedule(napi);
+	}
+}
+
+static void virtnet_napis_disable(struct virtnet_info *vi)
+{
+	int i;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		napi_disable(&vnnode->info.napi);
+	}
+}
+
+static void napi_enable_worker(struct work_struct *work)
+{
+	struct vnet_virtio_node *vnnode = container_of(work,
+		struct vnet_virtio_node, refill.work);
+	struct virtqueue *rvq = vnnode->vnode.rvq;
+	virtnet_napi_enable(&vnnode->info.napi, rvq);
+}
+
+static void virtnet_napis_enable(struct virtnet_info *vi)
+{
+	int i;
+	struct work_struct *work;
+	struct vnet_virtio_node *vnnode;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = vi->vnet_nodes[i];
+		work = &vnnode->info.enable_napi;
+		queue_work_on(vnnode->demo_cpu, system_wq, work);
 	}
 }
 
@@ -500,43 +620,52 @@ static void refill_work(struct work_struct *work)
 {
 	struct virtnet_info *vi;
 	bool still_empty;
+	struct napi_struct *napi;
+	struct virtqueue *rvq;
+	struct vnet_virtio_node *vnnode = container_of(work,
+		struct vnet_virtio_node, refill.work);
 
-	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
+	vi = vnnode->owner;
+	napi = &vnnode->info.napi;
+	rvq = vnnode->vnode.rvq;
+	napi_disable(napi);
+
+	still_empty = !try_fill_recv(vi, rvq, GFP_KERNEL);
+	virtnet_napi_enable(napi, rvq);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
 	if (still_empty)
-		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
+		queue_delayed_work_on(vnnode->demo_cpu, system_nrt_wq, &vnnode->refill, HZ/2);
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
-	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
+	struct virtnet_info *vi;
 	void *buf;
 	unsigned int len, received = 0;
-
+	struct vnet_virtio_node *vnnode = container_of(napi, struct vnet_virtio_node, info.napi);
+	struct virtqueue *rvq = vnnode->vnode.rvq;
+	vi = vnnode->owner;
 again:
 	while (received < budget &&
-	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
-		receive_buf(vi->dev, buf, len);
+	       (buf = virtqueue_get_buf(rvq, &len)) != NULL) {
+		receive_buf(vi->dev, buf, len, rvq);
 		--vi->num;
 		received++;
 	}
 
 	if (vi->num < vi->max / 2) {
-		if (!try_fill_recv(vi, GFP_ATOMIC))
-			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+		if (!try_fill_recv(vi, rvq, GFP_ATOMIC))
+			queue_delayed_work(system_nrt_wq, &vnnode->refill, 0);
 	}
 
 	/* Out of packets? */
 	if (received < budget) {
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
+		if (unlikely(!virtqueue_enable_cb(rvq)) &&
 		    napi_schedule_prep(napi)) {
-			virtqueue_disable_cb(vi->rvq);
+			virtqueue_disable_cb(rvq);
 			__napi_schedule(napi);
 			goto again;
 		}
@@ -545,13 +674,13 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi, struct virtqueue *svq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -565,7 +694,7 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct virtqueue *svq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -608,7 +737,8 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
 	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+
+	return virtqueue_add_buf(svq, vi->tx_sg, hdr->num_sg,
 				 0, skb, GFP_ATOMIC);
 }
 
@@ -616,12 +746,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
+	struct virtqueue *svq;
+	virtqueue_pickup(vi, &svq, 0);
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, svq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, svq, skb);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
@@ -640,7 +772,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(svq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -650,12 +782,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
 		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
+		if (unlikely(!virtqueue_enable_cb_delayed(svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
 				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				virtqueue_disable_cb(svq);
 			}
 		}
 	}
@@ -718,20 +850,15 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	napi_schedule(&vi->napi);
+	virtnet_napis_enable(vi);
 }
 #endif
 
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
-
-	virtnet_napi_enable(vi);
+	try_fill_all_recv(vi, GFP_KERNEL);
+	virtnet_napis_enable(vi);
 	return 0;
 }
 
@@ -783,11 +910,10 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-
-	/* Make sure refill_work doesn't re-enable napi! */
-	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->napi);
-
+	int i;
+	for (i = 0; i < vi->vdev->node_cnt; i++)
+		cancel_delayed_work_sync(&vi->vnet_nodes[i]->refill);
+	virtnet_napis_disable(vi);
 	return 0;
 }
 
@@ -897,9 +1023,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
 				struct ethtool_ringparam *ring)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct vnet_virtio_node *vnnode =  vi->vnet_nodes[0];
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vnnode->vnode.rvq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vnnode->vnode.svq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
 
@@ -986,29 +1113,61 @@ static void virtnet_config_changed(struct virtio_device *vdev)
 
 static int init_vqs(struct virtnet_info *vi)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
+	struct virtqueue **vqs;
 	const char *names[] = { "input", "output", "control" };
-	int nvqs, err;
-
+	const char **name_array;
+	vq_callback_t **callbacks;
+	int node_cnt, nvqs, err =  -ENOMEM;
+	int i;
 	/* We expect two virtqueues, receive then send,
 	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
+	node_cnt = vi->vdev->node_cnt;
+	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)? node_cnt*2+1 :
+		node_cnt*2;
+	callbacks = kzalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	for (i = 0; i < node_cnt; i++)
+		callbacks[i] = skb_recv_done;
+	for (; i < node_cnt*2; i++)
+		callbacks[i] = skb_xmit_done;
+
+	name_array = kmalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	if ( name_array == NULL)
+		goto free_callbacks;
+
+	for (i = 0; i < node_cnt; i++)
+		name_array[i] = names[0];
+	for (; i <  node_cnt*2; i++)
+		name_array[i] = names[1];
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
+		name_array[i] = names[2];
+
+	vqs = kmalloc(sizeof(void *)*nvqs, GFP_KERNEL);
+	if (vqs == NULL)
+		goto free_name;
 
 	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
 	if (err)
-		return err;
+		goto free_vqs;
 
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
+	vi->vqs = vqs;
+	vi->rvqs = vi->vqs;
+	vi->svqs = vi->vqs + vi->vdev->node_cnt;
 
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
+		vi->cvq = vi->vqs[vi->vdev->node_cnt*2];
 
 		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
 			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
 	}
-	return 0;
+	err = 0;
+free_vqs:
+	if (err)
+		kfree(vqs);
+free_name:
+	kfree(name_array);
+free_callbacks:
+	kfree(callbacks);
+	return err;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
@@ -1016,6 +1175,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	int err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
+	int i, size, cur, prev = 0;
+	struct vnet_virtio_node *vnnode;
 
 	/* Allocate ourselves a network device with room for our info */
 	dev = alloc_etherdev(sizeof(struct virtnet_info));
@@ -1064,7 +1225,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
+
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
@@ -1074,7 +1235,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (vi->stats == NULL)
 		goto free;
 
-	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
 	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
@@ -1086,19 +1246,46 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
-
 	err = init_vqs(vi);
 	if (err)
 		goto free_stats;
 
+	/* Which host node napi_struct will be on, determined by page fault handled by KVM.
+	  * So allocate them seperately!
+	 */
+	vi->vnet_nodes = kmalloc(sizeof(void *) * vi->vdev->node_cnt, GFP_KERNEL);
+	size = PAGE_ALIGN(sizeof(struct vnet_virtio_node));
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		vnnode = kmalloc(size, GFP_KERNEL);
+		if (vnnode == NULL) {
+			err = -ENOMEM;
+			goto free_napi;
+		}
+		cur = find_next_bit(&vi->vdev->allow_map, 64, prev);
+		prev = cur;
+		vnnode->vnode.node_id = cur;
+		vnnode->owner = vi;
+		vnnode->vnode.rvq = vi->rvqs[i];
+		vnnode->vnode.svq = vi->svqs[i];
+		vnnode->demo_cpu = first_vcpu_on_virtio_node(cur);
+
+		vi->rvqs[i]->node = &vnnode->vnode;
+		vi->svqs[i]->node = &vnnode->vnode;
+
+		INIT_WORK(&vnnode->info.enable_napi, napi_enable_worker);
+		netif_napi_add(dev, &vnnode->info.napi, virtnet_poll, napi_weight);
+		INIT_DELAYED_WORK(&vnnode->refill, refill_work);
+		vi->vnet_nodes[i] = vnnode;
+	}
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
 		goto free_vqs;
 	}
 
-	/* Last of all, set up some receive buffers. */
-	try_fill_recv(vi, GFP_KERNEL);
+	try_fill_all_recv(vi, GFP_KERNEL);
+
 
 	/* If we didn't even get one input buffer, we're useless. */
 	if (vi->num == 0) {
@@ -1121,6 +1308,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 unregister:
 	unregister_netdev(dev);
+free_napi:
+	for (; i  >  0; --i) {
+		vnnode = vi->vnet_nodes[i];
+		netif_napi_del(&vnnode->info.napi);
+		kfree(vnnode);
+	}
 free_vqs:
 	vdev->config->del_vqs(vdev);
 free_stats:
@@ -1133,32 +1326,39 @@ free:
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
-	}
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rvq);
-		if (!buf)
-			break;
-		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
-		else
+	int i;
+	struct virtqueue *svq, *rvq;
+	for (i = 0; i < vi->vdev->node_cnt; i++) {
+		svq = vi->svqs[i];
+		rvq = vi->rvqs[i];
+
+		while (1) {
+			buf = virtqueue_detach_unused_buf(svq);
+			if (!buf)
+				break;
 			dev_kfree_skb(buf);
-		--vi->num;
+		}
+		while (1) {
+			buf = virtqueue_detach_unused_buf(rvq);
+			if (!buf)
+				break;
+			if (vi->mergeable_rx_bufs || vi->big_packets)
+				give_pages(vi, buf);
+			else
+				dev_kfree_skb(buf);
+			--vi->num;
+		}
 	}
 	BUG_ON(vi->num != 0);
 }
 
+
 static void remove_vq_common(struct virtnet_info *vi)
 {
 	vi->vdev->config->reset(vi->vdev);
 
 	/* Free unused buffers in both send and recv, if any. */
 	free_unused_bufs(vi);
-
 	vi->vdev->config->del_vqs(vi->vdev);
 
 	while (vi->pages)
@@ -1172,7 +1372,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 	unregister_netdev(vi->dev);
 
 	remove_vq_common(vi);
-
+	kfree(vi->vqs);
+	kfree(vi->vnet_nodes);
 	free_percpu(vi->stats);
 	free_netdev(vi->dev);
 }
@@ -1181,17 +1382,22 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 static int virtnet_freeze(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
+	int i;
 
-	virtqueue_disable_cb(vi->rvq);
-	virtqueue_disable_cb(vi->svq);
+	for (i = 0; i < vdev->node_cnt; i++) {
+		virtqueue_disable_cb(vi->rvqs[i]);
+		virtqueue_disable_cb(vi->svqs[i]);
+	}
 	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ))
 		virtqueue_disable_cb(vi->cvq);
 
 	netif_device_detach(vi->dev);
-	cancel_delayed_work_sync(&vi->refill);
+
+	for (i = 0; i < vdev->node_cnt; i++)
+		cancel_delayed_work_sync(&vi->vnet_nodes[i]->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->napi);
+		virtnet_napis_disable(vi);
 
 	remove_vq_common(vi);
 
@@ -1208,13 +1414,10 @@ static int virtnet_restore(struct virtio_device *vdev)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(vi);
+		virtnet_napis_enable(vi);
 
 	netif_device_attach(vi->dev);
-
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
-
+	try_fill_all_recv(vi, GFP_KERNEL);
 	return 0;
 }
 #endif
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
  2012-05-17  9:20 [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Liu Ping Fan
                   ` (3 preceding siblings ...)
  2012-05-17  9:20 ` [PATCH 2/2] [net/virtio_net]: make virtio_net support NUMA info Liu Ping Fan
@ 2012-05-18 16:14 ` Shirley Ma
  2012-05-22  9:28   ` Liu ping fan
  4 siblings, 1 reply; 11+ messages in thread
From: Shirley Ma @ 2012-05-18 16:14 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, netdev, linux-kernel, qemu-devel, Avi Kivity,
	Michael S. Tsirkin, Srivatsa Vaddagiri, Rusty Russell,
	Anthony Liguori, Ryan Harper, Shirley Ma, Krishna Kumar,
	Tom Lendacky

On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
> Currently, the guest can not know the NUMA info of the vcpu, which
> will
> result in performance drawback.
> 
> This is the discovered and experiment by
>         Shirley Ma <xma@us.ibm.com>
>         Krishna Kumar <krkumar2@in.ibm.com>
>         Tom Lendacky <toml@us.ibm.com>
> Refer to -
> http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
> we can see the big perfermance gap between NUMA aware and unaware.
> 
> Enlightened by their discovery, I think, we can do more work -- that
> is to
> export NUMA info of host to guest.

There three problems we've found:

1. KVM doesn't support NUMA load balancer. Even there are no other
workloads in the system, and the number of vcpus on the guest is smaller
than the number of cpus per node, the vcpus could be scheduled on
different nodes. 

Someone is working on in-kernel solution. Andrew Theurer has a working
user-space NUMA aware VM balancer, it requires libvirt and cgroups
(which is default for RHEL6 systems).

2. The host scheduler is not aware the relationship between guest vCPUs
and vhost. So it's possible for host scheduler to schedule per-device
vhost thread on the same cpu on which the vCPU kick a TX packet, or
schecule vhost thread on different node than the vCPU for; For RX packet
it's possible for vhost delivers RX packet on the vCPU running on
different node too.

3. per-device vhost thread is not scaled.

So the problems are in host scheduling and vhost thread scalability. I
am not sure how much help from exposing NUMA info from host to guest.

Have you tested these patched? How much performance gain here?

Thanks
Shirley 

> So here comes the idea:
> 1. export host numa info through guest's sched domain to its scheduler
>   Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
>   has been handled by host).  So the guest's lb will consider the
> cost.
>   I am still working on this, and my original idea is to export these
> info
>   through "static struct sched_domain_topology_level
> *sched_domain_topology"
>   to guest.
> 
> 2. Do a better emulation of virt mach exported to guest.
>   In real world, the devices are limited by kinds of reasons to own
> the NUMA
>   property. But as to Qemu, the device is emulated by thread, which
> inherit
>   the NUMA attr in nature.  We can implement the device as components
> of many
>   logic units, each of the unit is backed by a thread in different
> host node.
>   Currently, I want to start the work on vhost. But I think, maybe in
>   future, the iothread in Qemu can also has such attr.
> 
> 
> Forgive me, for the limited time, I can not have more better
> understand of
> vhost/virtio_net drivers. These patches are just draft, _FAR_, _FAR_
> from work.
> I will do more detail work for them in future.
> 
> To easy the review, the following is the sum up of the 2nd point of
> the idea.
> As for the 1st point of the idea, it is not reflected in the patches.
> 
> --spread/shrink the vhost_workers over the host nodes as demanded from
> Qemu.
>   And we can consider each vhost_worker as an independent net logic
> device
>   embeded in physical device "vhost_net".  At the meanwhile, we spread
> vcpu
>   threads over the host node. 
>   The vrings on guest are allocated PAGE_SIZE align separately, so
> they can 
>   will only be mapped into different host node, so vhost_worker in the
> same
>   node can access it with the least cost. So does the vq on guest.
> 
> --virtio_net driver will changes and talk with the logic device. And
> which
>   logic device it will talk to is determined by on which vcpu it is
> scheduled.
> 
> --the binding of vcpus and vhost_worker is implemented by: 
>   for call direction, vq-a in the node-A will have a dedicated irq-a.
> And 
>   we set the irq-a's affinity to vcpus in node-A.
>   for kick direction, kick register-b trigger different eventfd-b
> which wake up
>   vhost_worker-b.
> 
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
  2012-05-18 16:14 ` [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Shirley Ma
@ 2012-05-22  9:28   ` Liu ping fan
  2012-05-23 14:52     ` Andrew Theurer
  0 siblings, 1 reply; 11+ messages in thread
From: Liu ping fan @ 2012-05-22  9:28 UTC (permalink / raw)
  To: Shirley Ma
  Cc: kvm, netdev, linux-kernel, qemu-devel, Avi Kivity,
	Michael S. Tsirkin, Srivatsa Vaddagiri, Rusty Russell,
	Anthony Liguori, Ryan Harper, Shirley Ma, Krishna Kumar,
	Tom Lendacky

On Sat, May 19, 2012 at 12:14 AM, Shirley Ma <mashirle@us.ibm.com> wrote:
> On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
>> Currently, the guest can not know the NUMA info of the vcpu, which
>> will
>> result in performance drawback.
>>
>> This is the discovered and experiment by
>>         Shirley Ma <xma@us.ibm.com>
>>         Krishna Kumar <krkumar2@in.ibm.com>
>>         Tom Lendacky <toml@us.ibm.com>
>> Refer to -
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
>> we can see the big perfermance gap between NUMA aware and unaware.
>>
>> Enlightened by their discovery, I think, we can do more work -- that
>> is to
>> export NUMA info of host to guest.
>
> There three problems we've found:
>
> 1. KVM doesn't support NUMA load balancer. Even there are no other
> workloads in the system, and the number of vcpus on the guest is smaller
> than the number of cpus per node, the vcpus could be scheduled on
> different nodes.
>
> Someone is working on in-kernel solution. Andrew Theurer has a working
> user-space NUMA aware VM balancer, it requires libvirt and cgroups
> (which is default for RHEL6 systems).
>
Interesting, and I found that "sched/numa: Introduce
sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
But I think from the guest view, it can not tell whether the two vcpus
are on the same host node. For example,
vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
expensive if it pull_task from vcpu-a and
choose vcpu-b to push.  And my idea is to export such info to guest,
still working on it.


> 2. The host scheduler is not aware the relationship between guest vCPUs
> and vhost. So it's possible for host scheduler to schedule per-device
> vhost thread on the same cpu on which the vCPU kick a TX packet, or
> schecule vhost thread on different node than the vCPU for; For RX packet
> it's possible for vhost delivers RX packet on the vCPU running on
> different node too.
>
Yes. I notice this point in your original patch.

> 3. per-device vhost thread is not scaled.
>
What about the scale-ability of per-vm * host_NUMA_NODE? When we make
advantage of multi-core,  we produce mulit vcpu threads for one VM.
So what about the emulated device? Is it acceptable to scale to take
advantage of host NUMA attr.  After all, how many nodes on which the
VM
can be run on are the user's control.  It is a balance of
scale-ability and performance.

> So the problems are in host scheduling and vhost thread scalability. I
> am not sure how much help from exposing NUMA info from host to guest.
>
> Have you tested these patched? How much performance gain here?
>
Sorry, not yet.  As you have mentioned, the vhost thread scalability
is a big problem. So I want to see others' opinion before going on.

Thanks and regards,
pingfan


> Thanks
> Shirley
>
>> So here comes the idea:
>> 1. export host numa info through guest's sched domain to its scheduler
>>   Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
>>   has been handled by host).  So the guest's lb will consider the
>> cost.
>>   I am still working on this, and my original idea is to export these
>> info
>>   through "static struct sched_domain_topology_level
>> *sched_domain_topology"
>>   to guest.
>>
>> 2. Do a better emulation of virt mach exported to guest.
>>   In real world, the devices are limited by kinds of reasons to own
>> the NUMA
>>   property. But as to Qemu, the device is emulated by thread, which
>> inherit
>>   the NUMA attr in nature.  We can implement the device as components
>> of many
>>   logic units, each of the unit is backed by a thread in different
>> host node.
>>   Currently, I want to start the work on vhost. But I think, maybe in
>>   future, the iothread in Qemu can also has such attr.
>>
>>
>> Forgive me, for the limited time, I can not have more better
>> understand of
>> vhost/virtio_net drivers. These patches are just draft, _FAR_, _FAR_
>> from work.
>> I will do more detail work for them in future.
>>
>> To easy the review, the following is the sum up of the 2nd point of
>> the idea.
>> As for the 1st point of the idea, it is not reflected in the patches.
>>
>> --spread/shrink the vhost_workers over the host nodes as demanded from
>> Qemu.
>>   And we can consider each vhost_worker as an independent net logic
>> device
>>   embeded in physical device "vhost_net".  At the meanwhile, we spread
>> vcpu
>>   threads over the host node.
>>   The vrings on guest are allocated PAGE_SIZE align separately, so
>> they can
>>   will only be mapped into different host node, so vhost_worker in the
>> same
>>   node can access it with the least cost. So does the vq on guest.
>>
>> --virtio_net driver will changes and talk with the logic device. And
>> which
>>   logic device it will talk to is determined by on which vcpu it is
>> scheduled.
>>
>> --the binding of vcpus and vhost_worker is implemented by:
>>   for call direction, vq-a in the node-A will have a dedicated irq-a.
>> And
>>   we set the irq-a's affinity to vcpus in node-A.
>>   for kick direction, kick register-b trigger different eventfd-b
>> which wake up
>>   vhost_worker-b.
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
  2012-05-22  9:28   ` Liu ping fan
@ 2012-05-23 14:52     ` Andrew Theurer
  2012-05-23 15:16       ` Michael S. Tsirkin
  2012-05-25  4:05       ` Liu ping fan
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Theurer @ 2012-05-23 14:52 UTC (permalink / raw)
  To: Liu ping fan
  Cc: Shirley Ma, kvm, netdev, linux-kernel, qemu-devel, Avi Kivity,
	Michael S. Tsirkin, Srivatsa Vaddagiri, Rusty Russell,
	Anthony Liguori, Ryan Harper, Shirley Ma, Krishna Kumar,
	Tom Lendacky

On 05/22/2012 04:28 AM, Liu ping fan wrote:
> On Sat, May 19, 2012 at 12:14 AM, Shirley Ma<mashirle@us.ibm.com>  wrote:
>> On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
>>> Currently, the guest can not know the NUMA info of the vcpu, which
>>> will
>>> result in performance drawback.
>>>
>>> This is the discovered and experiment by
>>>          Shirley Ma<xma@us.ibm.com>
>>>          Krishna Kumar<krkumar2@in.ibm.com>
>>>          Tom Lendacky<toml@us.ibm.com>
>>> Refer to -
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
>>> we can see the big perfermance gap between NUMA aware and unaware.
>>>
>>> Enlightened by their discovery, I think, we can do more work -- that
>>> is to
>>> export NUMA info of host to guest.
>>
>> There three problems we've found:
>>
>> 1. KVM doesn't support NUMA load balancer. Even there are no other
>> workloads in the system, and the number of vcpus on the guest is smaller
>> than the number of cpus per node, the vcpus could be scheduled on
>> different nodes.
>>
>> Someone is working on in-kernel solution. Andrew Theurer has a working
>> user-space NUMA aware VM balancer, it requires libvirt and cgroups
>> (which is default for RHEL6 systems).
>>
> Interesting, and I found that "sched/numa: Introduce
> sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
> But I think from the guest view, it can not tell whether the two vcpus
> are on the same host node. For example,
> vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
> expensive if it pull_task from vcpu-a and
> choose vcpu-b to push.  And my idea is to export such info to guest,
> still working on it.

The long term solution is to two-fold:
1) Guests that are quite large (in that they cannot fit in a host NUMA 
node) must have static mulit-node NUMA topology implemented by Qemu. 
That is here today, but we do not do it automatically, which is probably 
going to be a VM management responsibility.
2) Host scheduler and NUMA code must be enhanced to get better placement 
of Qemu memory and threads.  For single-node vNUMA guests, this is easy, 
put it all in one node.  For mulit-node vNUMA guests, the host must 
understand that some Qemu memory belongs with certain vCPU threads 
(which make up one of the guests vNUMA nodes), and then place that 
memory/threads in a specific host node (and continue for other 
memory/threads for each Qemu vNUMA node).

Note that even if a guest's memory/threads for a vNUMA node are 
relocated to another host node (which will be necessary) the NUMA 
characteristics of guest are still maintained (as all those vCPUs and 
memory are still "close" to each other).

The problem with exposing the host's NUMA info directly to the guest is 
that (1) vCPUs will get relocated, so their topology info in the guest 
will have to change over time. IMO that is a bad idea.  We have a hard 
enough time getting applications to work with a static NUMA info.  To 
get applications to react to changing NUMA topology is not going to turn 
out well. (2) Every single guest would have to have the same number of 
NUMA nodes defined as the host.  That is overkill, especially for small 
guests.
>
>
>> 2. The host scheduler is not aware the relationship between guest vCPUs
>> and vhost. So it's possible for host scheduler to schedule per-device
>> vhost thread on the same cpu on which the vCPU kick a TX packet, or
>> schecule vhost thread on different node than the vCPU for; For RX packet
>> it's possible for vhost delivers RX packet on the vCPU running on
>> different node too.
>>
> Yes. I notice this point in your original patch.
>
>> 3. per-device vhost thread is not scaled.
>>
> What about the scale-ability of per-vm * host_NUMA_NODE? When we make
> advantage of multi-core,  we produce mulit vcpu threads for one VM.
> So what about the emulated device? Is it acceptable to scale to take
> advantage of host NUMA attr.  After all, how many nodes on which the
> VM
> can be run on are the user's control.  It is a balance of
> scale-ability and performance.
>
>> So the problems are in host scheduling and vhost thread scalability. I
>> am not sure how much help from exposing NUMA info from host to guest.
>>
>> Have you tested these patched? How much performance gain here?
>>
> Sorry, not yet.  As you have mentioned, the vhost thread scalability
> is a big problem. So I want to see others' opinion before going on.
>
> Thanks and regards,
> pingfan
>
>
>> Thanks
>> Shirley
>>
>>> So here comes the idea:
>>> 1. export host numa info through guest's sched domain to its scheduler
>>>    Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
>>>    has been handled by host).  So the guest's lb will consider the
>>> cost.
>>>    I am still working on this, and my original idea is to export these
>>> info
>>>    through "static struct sched_domain_topology_level
>>> *sched_domain_topology"
>>>    to guest.
>>>
>>> 2. Do a better emulation of virt mach exported to guest.
>>>    In real world, the devices are limited by kinds of reasons to own
>>> the NUMA
>>>    property. But as to Qemu, the device is emulated by thread, which
>>> inherit
>>>    the NUMA attr in nature.  We can implement the device as components
>>> of many
>>>    logic units, each of the unit is backed by a thread in different
>>> host node.
>>>    Currently, I want to start the work on vhost. But I think, maybe in
>>>    future, the iothread in Qemu can also has such attr.
>>>
>>>
>>> Forgive me, for the limited time, I can not have more better
>>> understand of
>>> vhost/virtio_net drivers. These patches are just draft, _FAR_, _FAR_
>>> from work.
>>> I will do more detail work for them in future.
>>>
>>> To easy the review, the following is the sum up of the 2nd point of
>>> the idea.
>>> As for the 1st point of the idea, it is not reflected in the patches.
>>>
>>> --spread/shrink the vhost_workers over the host nodes as demanded from
>>> Qemu.
>>>    And we can consider each vhost_worker as an independent net logic
>>> device
>>>    embeded in physical device "vhost_net".  At the meanwhile, we spread
>>> vcpu
>>>    threads over the host node.
>>>    The vrings on guest are allocated PAGE_SIZE align separately, so
>>> they can
>>>    will only be mapped into different host node, so vhost_worker in the
>>> same
>>>    node can access it with the least cost. So does the vq on guest.
>>>
>>> --virtio_net driver will changes and talk with the logic device. And
>>> which
>>>    logic device it will talk to is determined by on which vcpu it is
>>> scheduled.
>>>
>>> --the binding of vcpus and vhost_worker is implemented by:
>>>    for call direction, vq-a in the node-A will have a dedicated irq-a.
>>> And
>>>    we set the irq-a's affinity to vcpus in node-A.
>>>    for kick direction, kick register-b trigger different eventfd-b
>>> which wake up
>>>    vhost_worker-b.
>>>
-Andrew Theurer

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
  2012-05-23 14:52     ` Andrew Theurer
@ 2012-05-23 15:16       ` Michael S. Tsirkin
  2012-05-25  3:29         ` Liu ping fan
  2012-05-25  4:05       ` Liu ping fan
  1 sibling, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2012-05-23 15:16 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Krishna Kumar, Rusty Russell, Shirley Ma, kvm, netdev,
	Shirley Ma, qemu-devel, Liu ping fan, linux-kernel, Tom Lendacky,
	Ryan Harper, Avi Kivity, Anthony Liguori, Srivatsa Vaddagiri

On Wed, May 23, 2012 at 09:52:15AM -0500, Andrew Theurer wrote:
> On 05/22/2012 04:28 AM, Liu ping fan wrote:
> >On Sat, May 19, 2012 at 12:14 AM, Shirley Ma<mashirle@us.ibm.com>  wrote:
> >>On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
> >>>Currently, the guest can not know the NUMA info of the vcpu, which
> >>>will
> >>>result in performance drawback.
> >>>
> >>>This is the discovered and experiment by
> >>>         Shirley Ma<xma@us.ibm.com>
> >>>         Krishna Kumar<krkumar2@in.ibm.com>
> >>>         Tom Lendacky<toml@us.ibm.com>
> >>>Refer to -
> >>>http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
> >>>we can see the big perfermance gap between NUMA aware and unaware.
> >>>
> >>>Enlightened by their discovery, I think, we can do more work -- that
> >>>is to
> >>>export NUMA info of host to guest.
> >>
> >>There three problems we've found:
> >>
> >>1. KVM doesn't support NUMA load balancer. Even there are no other
> >>workloads in the system, and the number of vcpus on the guest is smaller
> >>than the number of cpus per node, the vcpus could be scheduled on
> >>different nodes.
> >>
> >>Someone is working on in-kernel solution. Andrew Theurer has a working
> >>user-space NUMA aware VM balancer, it requires libvirt and cgroups
> >>(which is default for RHEL6 systems).
> >>
> >Interesting, and I found that "sched/numa: Introduce
> >sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
> >But I think from the guest view, it can not tell whether the two vcpus
> >are on the same host node. For example,
> >vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
> >expensive if it pull_task from vcpu-a and
> >choose vcpu-b to push.  And my idea is to export such info to guest,
> >still working on it.
> 
> The long term solution is to two-fold:
> 1) Guests that are quite large (in that they cannot fit in a host
> NUMA node) must have static mulit-node NUMA topology implemented by
> Qemu. That is here today, but we do not do it automatically, which
> is probably going to be a VM management responsibility.
> 2) Host scheduler and NUMA code must be enhanced to get better
> placement of Qemu memory and threads.  For single-node vNUMA guests,
> this is easy, put it all in one node.  For mulit-node vNUMA guests,
> the host must understand that some Qemu memory belongs with certain
> vCPU threads (which make up one of the guests vNUMA nodes), and then
> place that memory/threads in a specific host node (and continue for
> other memory/threads for each Qemu vNUMA node).

And for IO, we need multiqueue devices such that each
node can have its own queue in its local memory.

-- 
MST

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
  2012-05-23 15:16       ` Michael S. Tsirkin
@ 2012-05-25  3:29         ` Liu ping fan
  0 siblings, 0 replies; 11+ messages in thread
From: Liu ping fan @ 2012-05-25  3:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Krishna Kumar, Andrew Theurer, Rusty Russell, Shirley Ma, kvm,
	netdev, Shirley Ma, qemu-devel, linux-kernel, Tom Lendacky,
	Ryan Harper, Avi Kivity, Anthony Liguori, Srivatsa Vaddagiri

On Wed, May 23, 2012 at 11:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, May 23, 2012 at 09:52:15AM -0500, Andrew Theurer wrote:
>> On 05/22/2012 04:28 AM, Liu ping fan wrote:
>> >On Sat, May 19, 2012 at 12:14 AM, Shirley Ma<mashirle@us.ibm.com>  wrote:
>> >>On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
>> >>>Currently, the guest can not know the NUMA info of the vcpu, which
>> >>>will
>> >>>result in performance drawback.
>> >>>
>> >>>This is the discovered and experiment by
>> >>>         Shirley Ma<xma@us.ibm.com>
>> >>>         Krishna Kumar<krkumar2@in.ibm.com>
>> >>>         Tom Lendacky<toml@us.ibm.com>
>> >>>Refer to -
>> >>>http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
>> >>>we can see the big perfermance gap between NUMA aware and unaware.
>> >>>
>> >>>Enlightened by their discovery, I think, we can do more work -- that
>> >>>is to
>> >>>export NUMA info of host to guest.
>> >>
>> >>There three problems we've found:
>> >>
>> >>1. KVM doesn't support NUMA load balancer. Even there are no other
>> >>workloads in the system, and the number of vcpus on the guest is smaller
>> >>than the number of cpus per node, the vcpus could be scheduled on
>> >>different nodes.
>> >>
>> >>Someone is working on in-kernel solution. Andrew Theurer has a working
>> >>user-space NUMA aware VM balancer, it requires libvirt and cgroups
>> >>(which is default for RHEL6 systems).
>> >>
>> >Interesting, and I found that "sched/numa: Introduce
>> >sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
>> >But I think from the guest view, it can not tell whether the two vcpus
>> >are on the same host node. For example,
>> >vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
>> >expensive if it pull_task from vcpu-a and
>> >choose vcpu-b to push.  And my idea is to export such info to guest,
>> >still working on it.
>>
>> The long term solution is to two-fold:
>> 1) Guests that are quite large (in that they cannot fit in a host
>> NUMA node) must have static mulit-node NUMA topology implemented by
>> Qemu. That is here today, but we do not do it automatically, which
>> is probably going to be a VM management responsibility.
>> 2) Host scheduler and NUMA code must be enhanced to get better
>> placement of Qemu memory and threads.  For single-node vNUMA guests,
>> this is easy, put it all in one node.  For mulit-node vNUMA guests,
>> the host must understand that some Qemu memory belongs with certain
>> vCPU threads (which make up one of the guests vNUMA nodes), and then
>> place that memory/threads in a specific host node (and continue for
>> other memory/threads for each Qemu vNUMA node).
>
> And for IO, we need multiqueue devices such that each
> node can have its own queue in its local memory.
>
Yes, my patches include such solution. Independent device sub logic
units are seated in different NUMA node, "subdev" in the patches
stands for the logic unit. And each of they are backed by a
vhost-thread.  On the other hand, for virtio-guest, the vqs(including
vrings) are allocated align at the PAGE_SIZE, so their NUMA problem
will be resolved automatically by KVM(maybe a little more effort
needed here).

I had thought to export the real host NUMA info to virtio layer (not
scheduler,that is another topic). So we can create the exact num of
logic unit as needed.
And we even can increase/decrease the logic unit.

But what hesitate me to move on is that is it acceptable to create
independent vhost-thread for each node as the user's demand?
And the scalability is perVM *demand_node_num.  Object?


Thanks,
pingfan



> --
> MST

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
  2012-05-23 14:52     ` Andrew Theurer
  2012-05-23 15:16       ` Michael S. Tsirkin
@ 2012-05-25  4:05       ` Liu ping fan
  1 sibling, 0 replies; 11+ messages in thread
From: Liu ping fan @ 2012-05-25  4:05 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Shirley Ma, kvm, netdev, linux-kernel, qemu-devel, Avi Kivity,
	Michael S. Tsirkin, Srivatsa Vaddagiri, Rusty Russell,
	Anthony Liguori, Ryan Harper, Shirley Ma, Krishna Kumar,
	Tom Lendacky

On Wed, May 23, 2012 at 10:52 PM, Andrew Theurer
<habanero@linux.vnet.ibm.com> wrote:
> On 05/22/2012 04:28 AM, Liu ping fan wrote:
>>
>> On Sat, May 19, 2012 at 12:14 AM, Shirley Ma<mashirle@us.ibm.com>  wrote:
>>>
>>> On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
>>>>
>>>> Currently, the guest can not know the NUMA info of the vcpu, which
>>>> will
>>>> result in performance drawback.
>>>>
>>>> This is the discovered and experiment by
>>>>         Shirley Ma<xma@us.ibm.com>
>>>>         Krishna Kumar<krkumar2@in.ibm.com>
>>>>         Tom Lendacky<toml@us.ibm.com>
>>>> Refer to -
>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
>>>> we can see the big perfermance gap between NUMA aware and unaware.
>>>>
>>>> Enlightened by their discovery, I think, we can do more work -- that
>>>> is to
>>>> export NUMA info of host to guest.
>>>
>>>
>>> There three problems we've found:
>>>
>>> 1. KVM doesn't support NUMA load balancer. Even there are no other
>>> workloads in the system, and the number of vcpus on the guest is smaller
>>> than the number of cpus per node, the vcpus could be scheduled on
>>> different nodes.
>>>
>>> Someone is working on in-kernel solution. Andrew Theurer has a working
>>> user-space NUMA aware VM balancer, it requires libvirt and cgroups
>>> (which is default for RHEL6 systems).
>>>
>> Interesting, and I found that "sched/numa: Introduce
>> sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
>> But I think from the guest view, it can not tell whether the two vcpus
>> are on the same host node. For example,
>> vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
>> expensive if it pull_task from vcpu-a and
>> choose vcpu-b to push.  And my idea is to export such info to guest,
>> still working on it.
>
>
> The long term solution is to two-fold:
> 1) Guests that are quite large (in that they cannot fit in a host NUMA node)
> must have static mulit-node NUMA topology implemented by Qemu. That is here
> today, but we do not do it automatically, which is probably going to be a VM
> management responsibility.
> 2) Host scheduler and NUMA code must be enhanced to get better placement of
> Qemu memory and threads.  For single-node vNUMA guests, this is easy, put it
> all in one node.  For mulit-node vNUMA guests, the host must understand that
> some Qemu memory belongs with certain vCPU threads (which make up one of the
> guests vNUMA nodes), and then place that memory/threads in a specific host
> node (and continue for other memory/threads for each Qemu vNUMA node).
>
> Note that even if a guest's memory/threads for a vNUMA node are relocated to
> another host node (which will be necessary) the NUMA characteristics of
> guest are still maintained (as all those vCPUs and memory are still "close"
> to each other).
>
Yeah, I see Peter's work on tip/sched/numa
> The problem with exposing the host's NUMA info directly to the guest is that
> (1) vCPUs will get relocated, so their topology info in the guest will have
> to change over time. IMO that is a bad idea.  We have a hard enough time
> getting applications to work with a static NUMA info.  To get applications

I original think that vCPUS get relocated only on user demand. And
this can happen on hotplug, not happen frequently, otherwise user will
deserve the drawback.
But forget it, Peter has said no to dynamic-NUMA.
> to react to changing NUMA topology is not going to turn out well. (2) Every
> single guest would have to have the same number of NUMA nodes defined as the
> host.  That is overkill, especially for small guests.
>
Thanks for your comment
pingfan
>>
>>
>>> 2. The host scheduler is not aware the relationship between guest vCPUs
>>> and vhost. So it's possible for host scheduler to schedule per-device
>>> vhost thread on the same cpu on which the vCPU kick a TX packet, or
>>> schecule vhost thread on different node than the vCPU for; For RX packet
>>> it's possible for vhost delivers RX packet on the vCPU running on
>>> different node too.
>>>
>> Yes. I notice this point in your original patch.
>>
>>> 3. per-device vhost thread is not scaled.
>>>
>> What about the scale-ability of per-vm * host_NUMA_NODE? When we make
>> advantage of multi-core,  we produce mulit vcpu threads for one VM.
>> So what about the emulated device? Is it acceptable to scale to take
>> advantage of host NUMA attr.  After all, how many nodes on which the
>> VM
>> can be run on are the user's control.  It is a balance of
>> scale-ability and performance.
>>
>>> So the problems are in host scheduling and vhost thread scalability. I
>>> am not sure how much help from exposing NUMA info from host to guest.
>>>
>>> Have you tested these patched? How much performance gain here?
>>>
>> Sorry, not yet.  As you have mentioned, the vhost thread scalability
>> is a big problem. So I want to see others' opinion before going on.
>>
>> Thanks and regards,
>> pingfan
>>
>>
>>> Thanks
>>> Shirley
>>>
>>>> So here comes the idea:
>>>> 1. export host numa info through guest's sched domain to its scheduler
>>>>   Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
>>>>   has been handled by host).  So the guest's lb will consider the
>>>> cost.
>>>>   I am still working on this, and my original idea is to export these
>>>> info
>>>>   through "static struct sched_domain_topology_level
>>>> *sched_domain_topology"
>>>>   to guest.
>>>>
>>>> 2. Do a better emulation of virt mach exported to guest.
>>>>   In real world, the devices are limited by kinds of reasons to own
>>>> the NUMA
>>>>   property. But as to Qemu, the device is emulated by thread, which
>>>> inherit
>>>>   the NUMA attr in nature.  We can implement the device as components
>>>> of many
>>>>   logic units, each of the unit is backed by a thread in different
>>>> host node.
>>>>   Currently, I want to start the work on vhost. But I think, maybe in
>>>>   future, the iothread in Qemu can also has such attr.
>>>>
>>>>
>>>> Forgive me, for the limited time, I can not have more better
>>>> understand of
>>>> vhost/virtio_net drivers. These patches are just draft, _FAR_, _FAR_
>>>> from work.
>>>> I will do more detail work for them in future.
>>>>
>>>> To easy the review, the following is the sum up of the 2nd point of
>>>> the idea.
>>>> As for the 1st point of the idea, it is not reflected in the patches.
>>>>
>>>> --spread/shrink the vhost_workers over the host nodes as demanded from
>>>> Qemu.
>>>>   And we can consider each vhost_worker as an independent net logic
>>>> device
>>>>   embeded in physical device "vhost_net".  At the meanwhile, we spread
>>>> vcpu
>>>>   threads over the host node.
>>>>   The vrings on guest are allocated PAGE_SIZE align separately, so
>>>> they can
>>>>   will only be mapped into different host node, so vhost_worker in the
>>>> same
>>>>   node can access it with the least cost. So does the vq on guest.
>>>>
>>>> --virtio_net driver will changes and talk with the logic device. And
>>>> which
>>>>   logic device it will talk to is determined by on which vcpu it is
>>>> scheduled.
>>>>
>>>> --the binding of vcpus and vhost_worker is implemented by:
>>>>   for call direction, vq-a in the node-A will have a dedicated irq-a.
>>>> And
>>>>   we set the irq-a's affinity to vcpus in node-A.
>>>>   for kick direction, kick register-b trigger different eventfd-b
>>>> which wake up
>>>>   vhost_worker-b.
>>>>
> -Andrew Theurer
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-05-25  4:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-17  9:20 [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Liu Ping Fan
2012-05-17  9:20 ` [PATCH 1/2] [kvm/vhost]: make vhost support NUMA model Liu Ping Fan
2012-05-17  9:20 ` [PATCH 2/2] [kvm/vhost-net]: make vhost net own NUMA attribute Liu Ping Fan
2012-05-17  9:20 ` [PATCH 1/2] [kvm/virtio]: make virtio support NUMA attr Liu Ping Fan
2012-05-17  9:20 ` [PATCH 2/2] [net/virtio_net]: make virtio_net support NUMA info Liu Ping Fan
2012-05-18 16:14 ` [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr Shirley Ma
2012-05-22  9:28   ` Liu ping fan
2012-05-23 14:52     ` Andrew Theurer
2012-05-23 15:16       ` Michael S. Tsirkin
2012-05-25  3:29         ` Liu ping fan
2012-05-25  4:05       ` Liu ping fan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).