All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] vhost: Add polling mode
       [not found] <1407659404-razya@il.ibm.com>
  2014-08-10  8:30   ` Razya Ladelsky
@ 2014-08-10  8:30 ` Razya Ladelsky
  2014-08-10 19:45     ` Michael S. Tsirkin
                     ` (2 more replies)
  2014-08-10  8:30 ` Razya Ladelsky
  2014-08-10  8:30 ` Razya Ladelsky
  3 siblings, 3 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-10  8:30 UTC (permalink / raw)
  To: mst, kvm
  Cc: GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon, linux-kernel,
	netdev, virtualization

From: Razya Ladelsky <razya@il.ibm.com>
Date: Thu, 31 Jul 2014 09:47:20 +0300
Subject: [PATCH] vhost: Add polling mode

When vhost is waiting for buffers from the guest driver (e.g., more packets to
send in vhost-net's transmit queue), it normally goes to sleep and waits for the
guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
(and possibly userspace involvement in translating this PIO exit into a file
descriptor event), all of which hurts performance.

If the system is under-utilized (has cpu time to spare), vhost can continuously
poll the virtqueues for new buffers, and avoid asking the guest to kick us.
This patch adds an optional polling mode to vhost, that can be enabled via a
kernel module parameter, "poll_start_rate".

When polling is active for a virtqueue, the guest is asked to disable
notification (kicks), and the worker thread continuously checks for new buffers.
When it does discover new buffers, it simulates a "kick" by invoking the
underlying backend driver (such as vhost-net), which thinks it got a real kick
from the guest, and acts accordingly. If the underlying driver asks not to be
kicked, we disable polling on this virtqueue.

We start polling on a virtqueue when we notice it has work to do. Polling on
this virtqueue is later disabled after 3 seconds of polling turning up no new
work, as in this case we are better off returning to the exit-based notification
mechanism. The default timeout of 3 seconds can be changed with the
"poll_stop_idle" kernel module parameter.

This polling approach makes lot of sense for new HW with posted-interrupts for
which we have exitless host-to-guest notifications. But even with support for
posted interrupts, guest-to-host communication still causes exits. Polling adds
the missing part.

When systems are overloaded, there won't be enough cpu time for the various
vhost threads to poll their guests' devices. For these scenarios, we plan to add
support for vhost threads that can be shared by multiple devices, even of
multiple vms.
Our ultimate goal is to implement the I/O acceleration features described in:
KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
https://www.youtube.com/watch?v=9EyweibHfEs
and
https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html

I ran some experiments with TCP stream netperf and filebench (having 2 threads
performing random reads) benchmarks on an IBM System x3650 M4.
I have two machines, A and B. A hosts the vms, B runs the netserver.
The vms (on A) run netperf, its destination server is running on B.
All runs loaded the guests in a way that they were (cpu) saturated. For example,
I ran netperf with 64B messages, which is heavily loading the vm (which is why
its throughput is low).
The idea was to get it 100% loaded, so we can see that the polling is getting it
to produce higher throughput.

The system had two cores per guest, as to allow for both the vcpu and the vhost
thread to run concurrently for maximum throughput (but I didn't pin the threads
to specific cores).
My experiments were fair in a sense that for both cases, with or without
polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
way). The only difference was whether polling was enabled/disabled.

Results:

Netperf, 1 vm:
The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
Number of exits/sec decreased 6x.
The same improvement was shown when I tested with 3 vms running netperf
(4086 MB/sec -> 5545 MB/sec).

filebench, 1 vm:
ops/sec improved by 13% with the polling patch. Number of exits was reduced by
31%.
The same experiment with 3 vms running filebench showed similar numbers.

Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
---
 drivers/vhost/net.c   |    6 +-
 drivers/vhost/scsi.c  |    6 +-
 drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   38 +++++++-
 4 files changed, 277 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 971a760..558aecb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4f4ffa4..665eeeb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
-	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
-
+	vhost_work_init(&vs->vs_completion_work, NULL,
+						vhost_scsi_complete_cmd_work);
+	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
 	vs->vs_events_nr = 0;
 	vs->vs_events_missed = false;
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..fbe8174 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -24,9 +24,17 @@
 #include <linux/slab.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 #include <linux/module.h>
 
 #include "vhost.h"
+static int poll_start_rate = 0;
+module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
+
+static int poll_stop_idle = 3*HZ; /* 3 seconds */
+module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
 
 enum {
 	VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn)
 {
 	INIT_LIST_HEAD(&work->node);
 	work->fn = fn;
 	init_waitqueue_head(&work->done);
 	work->flushing = 0;
 	work->queue_seq = work->done_seq = 0;
+	work->vq = vq;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->dev = vq->dev;
 	poll->wqh = NULL;
-
-	vhost_work_init(&poll->work, fn);
+	vhost_work_init(&poll->work, vq, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
+ *
+ * Enabling this mode it tells the guest not to notify ("kick") us when its
+ * has made more work available on this virtqueue; Rather, we will continuously
+ * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
+ * the worker thread polls them all, e.g., in a round-robin fashion.
+ * Note that vqpoll.enabled doesn't always mean that this virtqueue is
+ * actually being polled: The backend (e.g., net.c) may temporarily disable it
+ * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
+ *
+ * It is assumed that these functions are called relatively rarely, when vhost
+ * notices that this virtqueue's usage pattern significantly changed in a way
+ * that makes polling more efficient than notification, or vice versa.
+ * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
+ * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
+ * reclaimed.
+ */
+static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (vq->vqpoll.enabled)
+		return; /* already enabled, nothing to do */
+	if (!vq->handle_kick)
+		return; /* polling will be a waste of time if no callback! */
+	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
+		/* vq has guest notifications enabled. Disable them,
+		   and instead add vq to the polling list */
+		vhost_disable_notify(vq->dev, vq);
+		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
+	}
+	vq->vqpoll.jiffies_last_kick = jiffies;
+	__get_user(vq->avail_idx, &vq->avail->idx);
+	vq->vqpoll.enabled = true;
+
+	/* Map userspace's vq->avail to the kernel's memory space. */
+	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
+		&vq->vqpoll.avail_page) != 1) {
+		/* TODO: can this happen, as we check access
+		to vq->avail in advance? */
+		BUG();
+	}
+	vq->vqpoll.avail_mapped = (struct vring_avail *) (
+		(unsigned long)kmap(vq->vqpoll.avail_page) |
+		((unsigned long)vq->avail & ~PAGE_MASK));
+}
+
+/*
+ * This function doesn't always succeed in changing the mode. Sometimes
+ * a temporary race condition prevents turning on guest notifications, so
+ * vq should be polled next time again.
+ */
+static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (!vq->vqpoll.enabled)
+		return; /* already disabled, nothing to do */
+
+	vq->vqpoll.enabled = false;
+
+	if (!list_empty(&vq->vqpoll.link)) {
+		/* vq is on the polling list, remove it from this list and
+		 * instead enable guest notifications. */
+		list_del_init(&vq->vqpoll.link);
+		if (unlikely(vhost_enable_notify(vq->dev, vq))
+			&& !vq->vqpoll.shutdown) {
+			/* Race condition: guest wrote before we enabled
+			 * notification, so we'll never get a notification for
+			 * this work - so continue polling mode for a while. */
+			vhost_disable_notify(vq->dev, vq);
+			vq->vqpoll.enabled = true;
+			vhost_enable_notify(vq->dev, vq);
+			return;
+		}
+	}
+
+	if (vq->vqpoll.avail_mapped) {
+		kunmap(vq->vqpoll.avail_page);
+		put_page(vq->vqpoll.avail_page);
+		vq->vqpoll.avail_mapped = 0;
+	}
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
 			   struct vhost_virtqueue *vq)
 {
@@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call = NULL;
 	vq->log_ctx = NULL;
 	vq->memory = NULL;
+	INIT_LIST_HEAD(&vq->vqpoll.link);
+	vq->vqpoll.enabled = false;
+	vq->vqpoll.shutdown = false;
+	vq->vqpoll.avail_mapped = NULL;
+}
+
+/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
+ * virtqueues which the caller should kick, or NULL in case none should be
+ * kicked. roundrobin_poll() also disables polling on a virtqueue which has
+ * been polled for too long without success.
+ *
+ * This current implementation (the "round-robin" implementation) only
+ * polls the first vq in the list, returning it or NULL as appropriate, and
+ * moves this vq to the end of the list, so next time a different one is
+ * polled.
+ */
+static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
+{
+	struct vhost_virtqueue *vq;
+	u16 avail_idx;
+
+	if (list_empty(list))
+		return NULL;
+
+	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
+	WARN_ON(!vq->vqpoll.enabled);
+	list_move_tail(&vq->vqpoll.link, list);
+
+	/* See if there is any new work available from the guest. */
+	/* TODO: can check the optional idx feature, and if we haven't
+	* reached that idx yet, don't kick... */
+	avail_idx = vq->vqpoll.avail_mapped->idx;
+	if (avail_idx != vq->last_avail_idx)
+		return vq;
+
+	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
+		/* We've been polling this virtqueue for a long time with no
+		* results, so switch back to guest notification
+		*/
+		vhost_vq_disable_vqpoll(vq);
+	}
+	return NULL;
 }
 
 static int vhost_worker(void *data)
@@ -237,12 +368,62 @@ static int vhost_worker(void *data)
 		spin_unlock_irq(&dev->work_lock);
 
 		if (work) {
+			struct vhost_virtqueue *vq = work->vq;
 			__set_current_state(TASK_RUNNING);
 			work->fn(work);
+			/* Keep track of the work rate, for deciding when to
+			 * enable polling */
+			if (vq) {
+				if (vq->vqpoll.jiffies_last_work != jiffies) {
+					vq->vqpoll.jiffies_last_work = jiffies;
+					vq->vqpoll.work_this_jiffy = 0;
+				}
+				vq->vqpoll.work_this_jiffy++;
+			}
+			/* If vq is in the round-robin list of virtqueues being
+			 * constantly checked by this thread, move vq the end
+			 * of the queue, because it had its fair chance now.
+			 */
+			if (vq && !list_empty(&vq->vqpoll.link)) {
+				list_move_tail(&vq->vqpoll.link,
+					&dev->vqpoll_list);
+			}
+			/* Otherwise, if this vq is looking for notifications
+			 * but vq polling is not enabled for it, do it now.
+			 */
+			else if (poll_start_rate && vq && vq->handle_kick &&
+				!vq->vqpoll.enabled &&
+				!vq->vqpoll.shutdown &&
+				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
+				vq->vqpoll.work_this_jiffy >=
+					poll_start_rate) {
+				vhost_vq_enable_vqpoll(vq);
+			}
+		}
+		/* Check one virtqueue from the round-robin list */
+		if (!list_empty(&dev->vqpoll_list)) {
+			struct vhost_virtqueue *vq;
+
+			vq = roundrobin_poll(&dev->vqpoll_list);
+
+			if (vq) {
+				vq->handle_kick(&vq->poll.work);
+				vq->vqpoll.jiffies_last_kick = jiffies;
+			}
+
+			/* If our polling list isn't empty, ask to continue
+			 * running this thread, don't yield.
+			 */
+			__set_current_state(TASK_RUNNING);
 			if (need_resched())
 				schedule();
-		} else
-			schedule();
+		} else {
+			if (work) {
+				if (need_resched())
+					schedule();
+			} else
+				schedule();
+		}
 
 	}
 	unuse_mm(dev->mm);
@@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
+	INIT_LIST_HEAD(&dev->vqpoll_list);
 	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
@@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					POLLIN, dev);
+					POLLIN, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
 	vhost_work_queue(dev, &attach.work);
 	vhost_work_flush(dev, &attach.work);
 	return attach.ret;
@@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
+ * mode for a given virtqueue which is itself being shut down. We ask the
+ * worker thread to do this rather than doing it directly, so that we don't
+ * race with the worker thread's use of the queue.
+ */
+static void shutdown_vqpoll_work(struct vhost_work *work)
+{
+	work->vq->vqpoll.shutdown = true;
+	vhost_vq_disable_vqpoll(work->vq);
+	WARN_ON(work->vq->vqpoll.avail_mapped);
+}
+
+static void shutdown_vqpoll(struct vhost_virtqueue *vq)
+{
+	struct vhost_work work;
+
+	vhost_work_init(&work, vq, shutdown_vqpoll_work);
+	vhost_work_queue(vq->dev, &work);
+	vhost_work_flush(vq->dev, &work);
+}
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 			eventfd_ctx_put(dev->vqs[i]->call_ctx);
 		if (dev->vqs[i]->call)
 			fput(dev->vqs[i]->call);
+		shutdown_vqpoll(dev->vqs[i]);
 		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
@@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	u16 avail_idx;
 	int r;
 
+	/* In polling mode, when the backend (e.g., net.c) asks to enable
+	 * notifications, we don't enable guest notifications. Instead, start
+	 * polling on this vq by adding it to the round-robin list.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (list_empty(&vq->vqpoll.link)) {
+			list_add_tail(&vq->vqpoll.link,
+				&vq->dev->vqpoll_list);
+			vq->vqpoll.jiffies_last_kick = jiffies;
+		}
+		return false;
+	}
+
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
@@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
+	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
+	 * will generate notifications even if the guest is asked not to send
+	 * them. So we must remove it from the round-robin polling list.
+	 * Note that vqpoll.enabled remains set.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (!list_empty(&vq->vqpoll.link))
+			list_del_init(&vq->vqpoll.link);
+		return;
+	}
+
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..11aaaf4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -24,6 +24,7 @@ struct vhost_work {
 	int			  flushing;
 	unsigned		  queue_seq;
 	unsigned		  done_seq;
+	struct vhost_virtqueue    *vq;
 };
 
 /* Poll a file (eventfd or socket) */
@@ -37,11 +38,12 @@ struct vhost_poll {
 	struct vhost_dev	 *dev;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue  *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -54,8 +56,6 @@ struct vhost_log {
 	u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -110,6 +110,35 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	struct {
+      /* When a virtqueue is in vqpoll.enabled mode, it declares
+       * that instead of using guest notifications (kicks) to
+       * discover new work, we prefer to continuously poll this
+       * virtqueue in the worker thread.
+       * If !enabled, the rest of the fields below are undefined.
+       */
+		bool enabled;
+      /* vqpoll.enabled doesn't always mean that this virtqueue is
+       * actually being polled: The backend (e.g., net.c) may
+       * temporarily disable it using vhost_disable/enable_notify().
+       * vqpoll.link is used to maintain the thread's round-robin
+       * list of virtqueues that actually need to be polled.
+       * Note list_empty(link) means this virtqueue isn't polled.
+       */
+		struct list_head link;
+      /* If this flag is true, the virtqueue is being shut down,
+       * so vqpoll should not be re-enabled.
+       */
+		bool shutdown;
+      /* Various counters used to decide when to enter polling mode
+       * or leave it and return to notification mode.
+       */
+		unsigned long jiffies_last_kick;
+		unsigned long jiffies_last_work;
+		int work_this_jiffy;
+		struct page *avail_page;
+		volatile struct vring_avail *avail_mapped;
+	} vqpoll;
 };
 
 struct vhost_dev {
@@ -123,6 +152,7 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct list_head vqpoll_list;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] vhost: Add polling mode
       [not found] <1407659404-razya@il.ibm.com>
                   ` (2 preceding siblings ...)
  2014-08-10  8:30 ` Razya Ladelsky
@ 2014-08-10  8:30 ` Razya Ladelsky
  3 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-10  8:30 UTC (permalink / raw)
  To: mst, kvm
  Cc: GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon, linux-kernel,
	netdev, virtualization

From: Razya Ladelsky <razya@il.ibm.com>
Date: Thu, 31 Jul 2014 09:47:20 +0300
Subject: [PATCH] vhost: Add polling mode

When vhost is waiting for buffers from the guest driver (e.g., more packets to
send in vhost-net's transmit queue), it normally goes to sleep and waits for the
guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
(and possibly userspace involvement in translating this PIO exit into a file
descriptor event), all of which hurts performance.

If the system is under-utilized (has cpu time to spare), vhost can continuously
poll the virtqueues for new buffers, and avoid asking the guest to kick us.
This patch adds an optional polling mode to vhost, that can be enabled via a
kernel module parameter, "poll_start_rate".

When polling is active for a virtqueue, the guest is asked to disable
notification (kicks), and the worker thread continuously checks for new buffers.
When it does discover new buffers, it simulates a "kick" by invoking the
underlying backend driver (such as vhost-net), which thinks it got a real kick
from the guest, and acts accordingly. If the underlying driver asks not to be
kicked, we disable polling on this virtqueue.

We start polling on a virtqueue when we notice it has work to do. Polling on
this virtqueue is later disabled after 3 seconds of polling turning up no new
work, as in this case we are better off returning to the exit-based notification
mechanism. The default timeout of 3 seconds can be changed with the
"poll_stop_idle" kernel module parameter.

This polling approach makes lot of sense for new HW with posted-interrupts for
which we have exitless host-to-guest notifications. But even with support for
posted interrupts, guest-to-host communication still causes exits. Polling adds
the missing part.

When systems are overloaded, there won't be enough cpu time for the various
vhost threads to poll their guests' devices. For these scenarios, we plan to add
support for vhost threads that can be shared by multiple devices, even of
multiple vms.
Our ultimate goal is to implement the I/O acceleration features described in:
KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
https://www.youtube.com/watch?v=9EyweibHfEs
and
https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html

I ran some experiments with TCP stream netperf and filebench (having 2 threads
performing random reads) benchmarks on an IBM System x3650 M4.
I have two machines, A and B. A hosts the vms, B runs the netserver.
The vms (on A) run netperf, its destination server is running on B.
All runs loaded the guests in a way that they were (cpu) saturated. For example,
I ran netperf with 64B messages, which is heavily loading the vm (which is why
its throughput is low).
The idea was to get it 100% loaded, so we can see that the polling is getting it
to produce higher throughput.

The system had two cores per guest, as to allow for both the vcpu and the vhost
thread to run concurrently for maximum throughput (but I didn't pin the threads
to specific cores).
My experiments were fair in a sense that for both cases, with or without
polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
way). The only difference was whether polling was enabled/disabled.

Results:

Netperf, 1 vm:
The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
Number of exits/sec decreased 6x.
The same improvement was shown when I tested with 3 vms running netperf
(4086 MB/sec -> 5545 MB/sec).

filebench, 1 vm:
ops/sec improved by 13% with the polling patch. Number of exits was reduced by
31%.
The same experiment with 3 vms running filebench showed similar numbers.

Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
---
 drivers/vhost/net.c   |    6 +-
 drivers/vhost/scsi.c  |    6 +-
 drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   38 +++++++-
 4 files changed, 277 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 971a760..558aecb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4f4ffa4..665eeeb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
-	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
-
+	vhost_work_init(&vs->vs_completion_work, NULL,
+						vhost_scsi_complete_cmd_work);
+	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
 	vs->vs_events_nr = 0;
 	vs->vs_events_missed = false;
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..fbe8174 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -24,9 +24,17 @@
 #include <linux/slab.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 #include <linux/module.h>
 
 #include "vhost.h"
+static int poll_start_rate = 0;
+module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
+
+static int poll_stop_idle = 3*HZ; /* 3 seconds */
+module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
 
 enum {
 	VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn)
 {
 	INIT_LIST_HEAD(&work->node);
 	work->fn = fn;
 	init_waitqueue_head(&work->done);
 	work->flushing = 0;
 	work->queue_seq = work->done_seq = 0;
+	work->vq = vq;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->dev = vq->dev;
 	poll->wqh = NULL;
-
-	vhost_work_init(&poll->work, fn);
+	vhost_work_init(&poll->work, vq, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
+ *
+ * Enabling this mode it tells the guest not to notify ("kick") us when its
+ * has made more work available on this virtqueue; Rather, we will continuously
+ * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
+ * the worker thread polls them all, e.g., in a round-robin fashion.
+ * Note that vqpoll.enabled doesn't always mean that this virtqueue is
+ * actually being polled: The backend (e.g., net.c) may temporarily disable it
+ * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
+ *
+ * It is assumed that these functions are called relatively rarely, when vhost
+ * notices that this virtqueue's usage pattern significantly changed in a way
+ * that makes polling more efficient than notification, or vice versa.
+ * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
+ * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
+ * reclaimed.
+ */
+static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (vq->vqpoll.enabled)
+		return; /* already enabled, nothing to do */
+	if (!vq->handle_kick)
+		return; /* polling will be a waste of time if no callback! */
+	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
+		/* vq has guest notifications enabled. Disable them,
+		   and instead add vq to the polling list */
+		vhost_disable_notify(vq->dev, vq);
+		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
+	}
+	vq->vqpoll.jiffies_last_kick = jiffies;
+	__get_user(vq->avail_idx, &vq->avail->idx);
+	vq->vqpoll.enabled = true;
+
+	/* Map userspace's vq->avail to the kernel's memory space. */
+	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
+		&vq->vqpoll.avail_page) != 1) {
+		/* TODO: can this happen, as we check access
+		to vq->avail in advance? */
+		BUG();
+	}
+	vq->vqpoll.avail_mapped = (struct vring_avail *) (
+		(unsigned long)kmap(vq->vqpoll.avail_page) |
+		((unsigned long)vq->avail & ~PAGE_MASK));
+}
+
+/*
+ * This function doesn't always succeed in changing the mode. Sometimes
+ * a temporary race condition prevents turning on guest notifications, so
+ * vq should be polled next time again.
+ */
+static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (!vq->vqpoll.enabled)
+		return; /* already disabled, nothing to do */
+
+	vq->vqpoll.enabled = false;
+
+	if (!list_empty(&vq->vqpoll.link)) {
+		/* vq is on the polling list, remove it from this list and
+		 * instead enable guest notifications. */
+		list_del_init(&vq->vqpoll.link);
+		if (unlikely(vhost_enable_notify(vq->dev, vq))
+			&& !vq->vqpoll.shutdown) {
+			/* Race condition: guest wrote before we enabled
+			 * notification, so we'll never get a notification for
+			 * this work - so continue polling mode for a while. */
+			vhost_disable_notify(vq->dev, vq);
+			vq->vqpoll.enabled = true;
+			vhost_enable_notify(vq->dev, vq);
+			return;
+		}
+	}
+
+	if (vq->vqpoll.avail_mapped) {
+		kunmap(vq->vqpoll.avail_page);
+		put_page(vq->vqpoll.avail_page);
+		vq->vqpoll.avail_mapped = 0;
+	}
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
 			   struct vhost_virtqueue *vq)
 {
@@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call = NULL;
 	vq->log_ctx = NULL;
 	vq->memory = NULL;
+	INIT_LIST_HEAD(&vq->vqpoll.link);
+	vq->vqpoll.enabled = false;
+	vq->vqpoll.shutdown = false;
+	vq->vqpoll.avail_mapped = NULL;
+}
+
+/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
+ * virtqueues which the caller should kick, or NULL in case none should be
+ * kicked. roundrobin_poll() also disables polling on a virtqueue which has
+ * been polled for too long without success.
+ *
+ * This current implementation (the "round-robin" implementation) only
+ * polls the first vq in the list, returning it or NULL as appropriate, and
+ * moves this vq to the end of the list, so next time a different one is
+ * polled.
+ */
+static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
+{
+	struct vhost_virtqueue *vq;
+	u16 avail_idx;
+
+	if (list_empty(list))
+		return NULL;
+
+	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
+	WARN_ON(!vq->vqpoll.enabled);
+	list_move_tail(&vq->vqpoll.link, list);
+
+	/* See if there is any new work available from the guest. */
+	/* TODO: can check the optional idx feature, and if we haven't
+	* reached that idx yet, don't kick... */
+	avail_idx = vq->vqpoll.avail_mapped->idx;
+	if (avail_idx != vq->last_avail_idx)
+		return vq;
+
+	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
+		/* We've been polling this virtqueue for a long time with no
+		* results, so switch back to guest notification
+		*/
+		vhost_vq_disable_vqpoll(vq);
+	}
+	return NULL;
 }
 
 static int vhost_worker(void *data)
@@ -237,12 +368,62 @@ static int vhost_worker(void *data)
 		spin_unlock_irq(&dev->work_lock);
 
 		if (work) {
+			struct vhost_virtqueue *vq = work->vq;
 			__set_current_state(TASK_RUNNING);
 			work->fn(work);
+			/* Keep track of the work rate, for deciding when to
+			 * enable polling */
+			if (vq) {
+				if (vq->vqpoll.jiffies_last_work != jiffies) {
+					vq->vqpoll.jiffies_last_work = jiffies;
+					vq->vqpoll.work_this_jiffy = 0;
+				}
+				vq->vqpoll.work_this_jiffy++;
+			}
+			/* If vq is in the round-robin list of virtqueues being
+			 * constantly checked by this thread, move vq the end
+			 * of the queue, because it had its fair chance now.
+			 */
+			if (vq && !list_empty(&vq->vqpoll.link)) {
+				list_move_tail(&vq->vqpoll.link,
+					&dev->vqpoll_list);
+			}
+			/* Otherwise, if this vq is looking for notifications
+			 * but vq polling is not enabled for it, do it now.
+			 */
+			else if (poll_start_rate && vq && vq->handle_kick &&
+				!vq->vqpoll.enabled &&
+				!vq->vqpoll.shutdown &&
+				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
+				vq->vqpoll.work_this_jiffy >=
+					poll_start_rate) {
+				vhost_vq_enable_vqpoll(vq);
+			}
+		}
+		/* Check one virtqueue from the round-robin list */
+		if (!list_empty(&dev->vqpoll_list)) {
+			struct vhost_virtqueue *vq;
+
+			vq = roundrobin_poll(&dev->vqpoll_list);
+
+			if (vq) {
+				vq->handle_kick(&vq->poll.work);
+				vq->vqpoll.jiffies_last_kick = jiffies;
+			}
+
+			/* If our polling list isn't empty, ask to continue
+			 * running this thread, don't yield.
+			 */
+			__set_current_state(TASK_RUNNING);
 			if (need_resched())
 				schedule();
-		} else
-			schedule();
+		} else {
+			if (work) {
+				if (need_resched())
+					schedule();
+			} else
+				schedule();
+		}
 
 	}
 	unuse_mm(dev->mm);
@@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
+	INIT_LIST_HEAD(&dev->vqpoll_list);
 	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
@@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					POLLIN, dev);
+					POLLIN, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
 	vhost_work_queue(dev, &attach.work);
 	vhost_work_flush(dev, &attach.work);
 	return attach.ret;
@@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
+ * mode for a given virtqueue which is itself being shut down. We ask the
+ * worker thread to do this rather than doing it directly, so that we don't
+ * race with the worker thread's use of the queue.
+ */
+static void shutdown_vqpoll_work(struct vhost_work *work)
+{
+	work->vq->vqpoll.shutdown = true;
+	vhost_vq_disable_vqpoll(work->vq);
+	WARN_ON(work->vq->vqpoll.avail_mapped);
+}
+
+static void shutdown_vqpoll(struct vhost_virtqueue *vq)
+{
+	struct vhost_work work;
+
+	vhost_work_init(&work, vq, shutdown_vqpoll_work);
+	vhost_work_queue(vq->dev, &work);
+	vhost_work_flush(vq->dev, &work);
+}
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 			eventfd_ctx_put(dev->vqs[i]->call_ctx);
 		if (dev->vqs[i]->call)
 			fput(dev->vqs[i]->call);
+		shutdown_vqpoll(dev->vqs[i]);
 		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
@@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	u16 avail_idx;
 	int r;
 
+	/* In polling mode, when the backend (e.g., net.c) asks to enable
+	 * notifications, we don't enable guest notifications. Instead, start
+	 * polling on this vq by adding it to the round-robin list.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (list_empty(&vq->vqpoll.link)) {
+			list_add_tail(&vq->vqpoll.link,
+				&vq->dev->vqpoll_list);
+			vq->vqpoll.jiffies_last_kick = jiffies;
+		}
+		return false;
+	}
+
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
@@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
+	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
+	 * will generate notifications even if the guest is asked not to send
+	 * them. So we must remove it from the round-robin polling list.
+	 * Note that vqpoll.enabled remains set.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (!list_empty(&vq->vqpoll.link))
+			list_del_init(&vq->vqpoll.link);
+		return;
+	}
+
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..11aaaf4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -24,6 +24,7 @@ struct vhost_work {
 	int			  flushing;
 	unsigned		  queue_seq;
 	unsigned		  done_seq;
+	struct vhost_virtqueue    *vq;
 };
 
 /* Poll a file (eventfd or socket) */
@@ -37,11 +38,12 @@ struct vhost_poll {
 	struct vhost_dev	 *dev;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue  *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -54,8 +56,6 @@ struct vhost_log {
 	u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -110,6 +110,35 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	struct {
+      /* When a virtqueue is in vqpoll.enabled mode, it declares
+       * that instead of using guest notifications (kicks) to
+       * discover new work, we prefer to continuously poll this
+       * virtqueue in the worker thread.
+       * If !enabled, the rest of the fields below are undefined.
+       */
+		bool enabled;
+      /* vqpoll.enabled doesn't always mean that this virtqueue is
+       * actually being polled: The backend (e.g., net.c) may
+       * temporarily disable it using vhost_disable/enable_notify().
+       * vqpoll.link is used to maintain the thread's round-robin
+       * list of virtqueues that actually need to be polled.
+       * Note list_empty(link) means this virtqueue isn't polled.
+       */
+		struct list_head link;
+      /* If this flag is true, the virtqueue is being shut down,
+       * so vqpoll should not be re-enabled.
+       */
+		bool shutdown;
+      /* Various counters used to decide when to enter polling mode
+       * or leave it and return to notification mode.
+       */
+		unsigned long jiffies_last_kick;
+		unsigned long jiffies_last_work;
+		int work_this_jiffy;
+		struct page *avail_page;
+		volatile struct vring_avail *avail_mapped;
+	} vqpoll;
 };
 
 struct vhost_dev {
@@ -123,6 +152,7 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct list_head vqpoll_list;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] vhost: Add polling mode
       [not found] <1407659404-razya@il.ibm.com>
  2014-08-10  8:30   ` Razya Ladelsky
  2014-08-10  8:30 ` Razya Ladelsky
@ 2014-08-10  8:30 ` Razya Ladelsky
  2014-08-10  8:30 ` Razya Ladelsky
  3 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-10  8:30 UTC (permalink / raw)
  To: mst, kvm
  Cc: GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon, linux-kernel,
	netdev, virtualization

From: Razya Ladelsky <razya@il.ibm.com>
Date: Thu, 31 Jul 2014 09:47:20 +0300
Subject: [PATCH] vhost: Add polling mode

When vhost is waiting for buffers from the guest driver (e.g., more packets to
send in vhost-net's transmit queue), it normally goes to sleep and waits for the
guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
(and possibly userspace involvement in translating this PIO exit into a file
descriptor event), all of which hurts performance.

If the system is under-utilized (has cpu time to spare), vhost can continuously
poll the virtqueues for new buffers, and avoid asking the guest to kick us.
This patch adds an optional polling mode to vhost, that can be enabled via a
kernel module parameter, "poll_start_rate".

When polling is active for a virtqueue, the guest is asked to disable
notification (kicks), and the worker thread continuously checks for new buffers.
When it does discover new buffers, it simulates a "kick" by invoking the
underlying backend driver (such as vhost-net), which thinks it got a real kick
from the guest, and acts accordingly. If the underlying driver asks not to be
kicked, we disable polling on this virtqueue.

We start polling on a virtqueue when we notice it has work to do. Polling on
this virtqueue is later disabled after 3 seconds of polling turning up no new
work, as in this case we are better off returning to the exit-based notification
mechanism. The default timeout of 3 seconds can be changed with the
"poll_stop_idle" kernel module parameter.

This polling approach makes lot of sense for new HW with posted-interrupts for
which we have exitless host-to-guest notifications. But even with support for
posted interrupts, guest-to-host communication still causes exits. Polling adds
the missing part.

When systems are overloaded, there won't be enough cpu time for the various
vhost threads to poll their guests' devices. For these scenarios, we plan to add
support for vhost threads that can be shared by multiple devices, even of
multiple vms.
Our ultimate goal is to implement the I/O acceleration features described in:
KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
https://www.youtube.com/watch?v=9EyweibHfEs
and
https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html

I ran some experiments with TCP stream netperf and filebench (having 2 threads
performing random reads) benchmarks on an IBM System x3650 M4.
I have two machines, A and B. A hosts the vms, B runs the netserver.
The vms (on A) run netperf, its destination server is running on B.
All runs loaded the guests in a way that they were (cpu) saturated. For example,
I ran netperf with 64B messages, which is heavily loading the vm (which is why
its throughput is low).
The idea was to get it 100% loaded, so we can see that the polling is getting it
to produce higher throughput.

The system had two cores per guest, as to allow for both the vcpu and the vhost
thread to run concurrently for maximum throughput (but I didn't pin the threads
to specific cores).
My experiments were fair in a sense that for both cases, with or without
polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
way). The only difference was whether polling was enabled/disabled.

Results:

Netperf, 1 vm:
The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
Number of exits/sec decreased 6x.
The same improvement was shown when I tested with 3 vms running netperf
(4086 MB/sec -> 5545 MB/sec).

filebench, 1 vm:
ops/sec improved by 13% with the polling patch. Number of exits was reduced by
31%.
The same experiment with 3 vms running filebench showed similar numbers.

Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
---
 drivers/vhost/net.c   |    6 +-
 drivers/vhost/scsi.c  |    6 +-
 drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   38 +++++++-
 4 files changed, 277 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 971a760..558aecb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4f4ffa4..665eeeb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
-	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
-
+	vhost_work_init(&vs->vs_completion_work, NULL,
+						vhost_scsi_complete_cmd_work);
+	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
 	vs->vs_events_nr = 0;
 	vs->vs_events_missed = false;
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..fbe8174 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -24,9 +24,17 @@
 #include <linux/slab.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 #include <linux/module.h>
 
 #include "vhost.h"
+static int poll_start_rate = 0;
+module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
+
+static int poll_stop_idle = 3*HZ; /* 3 seconds */
+module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
 
 enum {
 	VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn)
 {
 	INIT_LIST_HEAD(&work->node);
 	work->fn = fn;
 	init_waitqueue_head(&work->done);
 	work->flushing = 0;
 	work->queue_seq = work->done_seq = 0;
+	work->vq = vq;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->dev = vq->dev;
 	poll->wqh = NULL;
-
-	vhost_work_init(&poll->work, fn);
+	vhost_work_init(&poll->work, vq, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
+ *
+ * Enabling this mode it tells the guest not to notify ("kick") us when its
+ * has made more work available on this virtqueue; Rather, we will continuously
+ * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
+ * the worker thread polls them all, e.g., in a round-robin fashion.
+ * Note that vqpoll.enabled doesn't always mean that this virtqueue is
+ * actually being polled: The backend (e.g., net.c) may temporarily disable it
+ * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
+ *
+ * It is assumed that these functions are called relatively rarely, when vhost
+ * notices that this virtqueue's usage pattern significantly changed in a way
+ * that makes polling more efficient than notification, or vice versa.
+ * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
+ * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
+ * reclaimed.
+ */
+static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (vq->vqpoll.enabled)
+		return; /* already enabled, nothing to do */
+	if (!vq->handle_kick)
+		return; /* polling will be a waste of time if no callback! */
+	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
+		/* vq has guest notifications enabled. Disable them,
+		   and instead add vq to the polling list */
+		vhost_disable_notify(vq->dev, vq);
+		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
+	}
+	vq->vqpoll.jiffies_last_kick = jiffies;
+	__get_user(vq->avail_idx, &vq->avail->idx);
+	vq->vqpoll.enabled = true;
+
+	/* Map userspace's vq->avail to the kernel's memory space. */
+	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
+		&vq->vqpoll.avail_page) != 1) {
+		/* TODO: can this happen, as we check access
+		to vq->avail in advance? */
+		BUG();
+	}
+	vq->vqpoll.avail_mapped = (struct vring_avail *) (
+		(unsigned long)kmap(vq->vqpoll.avail_page) |
+		((unsigned long)vq->avail & ~PAGE_MASK));
+}
+
+/*
+ * This function doesn't always succeed in changing the mode. Sometimes
+ * a temporary race condition prevents turning on guest notifications, so
+ * vq should be polled next time again.
+ */
+static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (!vq->vqpoll.enabled)
+		return; /* already disabled, nothing to do */
+
+	vq->vqpoll.enabled = false;
+
+	if (!list_empty(&vq->vqpoll.link)) {
+		/* vq is on the polling list, remove it from this list and
+		 * instead enable guest notifications. */
+		list_del_init(&vq->vqpoll.link);
+		if (unlikely(vhost_enable_notify(vq->dev, vq))
+			&& !vq->vqpoll.shutdown) {
+			/* Race condition: guest wrote before we enabled
+			 * notification, so we'll never get a notification for
+			 * this work - so continue polling mode for a while. */
+			vhost_disable_notify(vq->dev, vq);
+			vq->vqpoll.enabled = true;
+			vhost_enable_notify(vq->dev, vq);
+			return;
+		}
+	}
+
+	if (vq->vqpoll.avail_mapped) {
+		kunmap(vq->vqpoll.avail_page);
+		put_page(vq->vqpoll.avail_page);
+		vq->vqpoll.avail_mapped = 0;
+	}
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
 			   struct vhost_virtqueue *vq)
 {
@@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call = NULL;
 	vq->log_ctx = NULL;
 	vq->memory = NULL;
+	INIT_LIST_HEAD(&vq->vqpoll.link);
+	vq->vqpoll.enabled = false;
+	vq->vqpoll.shutdown = false;
+	vq->vqpoll.avail_mapped = NULL;
+}
+
+/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
+ * virtqueues which the caller should kick, or NULL in case none should be
+ * kicked. roundrobin_poll() also disables polling on a virtqueue which has
+ * been polled for too long without success.
+ *
+ * This current implementation (the "round-robin" implementation) only
+ * polls the first vq in the list, returning it or NULL as appropriate, and
+ * moves this vq to the end of the list, so next time a different one is
+ * polled.
+ */
+static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
+{
+	struct vhost_virtqueue *vq;
+	u16 avail_idx;
+
+	if (list_empty(list))
+		return NULL;
+
+	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
+	WARN_ON(!vq->vqpoll.enabled);
+	list_move_tail(&vq->vqpoll.link, list);
+
+	/* See if there is any new work available from the guest. */
+	/* TODO: can check the optional idx feature, and if we haven't
+	* reached that idx yet, don't kick... */
+	avail_idx = vq->vqpoll.avail_mapped->idx;
+	if (avail_idx != vq->last_avail_idx)
+		return vq;
+
+	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
+		/* We've been polling this virtqueue for a long time with no
+		* results, so switch back to guest notification
+		*/
+		vhost_vq_disable_vqpoll(vq);
+	}
+	return NULL;
 }
 
 static int vhost_worker(void *data)
@@ -237,12 +368,62 @@ static int vhost_worker(void *data)
 		spin_unlock_irq(&dev->work_lock);
 
 		if (work) {
+			struct vhost_virtqueue *vq = work->vq;
 			__set_current_state(TASK_RUNNING);
 			work->fn(work);
+			/* Keep track of the work rate, for deciding when to
+			 * enable polling */
+			if (vq) {
+				if (vq->vqpoll.jiffies_last_work != jiffies) {
+					vq->vqpoll.jiffies_last_work = jiffies;
+					vq->vqpoll.work_this_jiffy = 0;
+				}
+				vq->vqpoll.work_this_jiffy++;
+			}
+			/* If vq is in the round-robin list of virtqueues being
+			 * constantly checked by this thread, move vq the end
+			 * of the queue, because it had its fair chance now.
+			 */
+			if (vq && !list_empty(&vq->vqpoll.link)) {
+				list_move_tail(&vq->vqpoll.link,
+					&dev->vqpoll_list);
+			}
+			/* Otherwise, if this vq is looking for notifications
+			 * but vq polling is not enabled for it, do it now.
+			 */
+			else if (poll_start_rate && vq && vq->handle_kick &&
+				!vq->vqpoll.enabled &&
+				!vq->vqpoll.shutdown &&
+				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
+				vq->vqpoll.work_this_jiffy >=
+					poll_start_rate) {
+				vhost_vq_enable_vqpoll(vq);
+			}
+		}
+		/* Check one virtqueue from the round-robin list */
+		if (!list_empty(&dev->vqpoll_list)) {
+			struct vhost_virtqueue *vq;
+
+			vq = roundrobin_poll(&dev->vqpoll_list);
+
+			if (vq) {
+				vq->handle_kick(&vq->poll.work);
+				vq->vqpoll.jiffies_last_kick = jiffies;
+			}
+
+			/* If our polling list isn't empty, ask to continue
+			 * running this thread, don't yield.
+			 */
+			__set_current_state(TASK_RUNNING);
 			if (need_resched())
 				schedule();
-		} else
-			schedule();
+		} else {
+			if (work) {
+				if (need_resched())
+					schedule();
+			} else
+				schedule();
+		}
 
 	}
 	unuse_mm(dev->mm);
@@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
+	INIT_LIST_HEAD(&dev->vqpoll_list);
 	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
@@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					POLLIN, dev);
+					POLLIN, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
 	vhost_work_queue(dev, &attach.work);
 	vhost_work_flush(dev, &attach.work);
 	return attach.ret;
@@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
+ * mode for a given virtqueue which is itself being shut down. We ask the
+ * worker thread to do this rather than doing it directly, so that we don't
+ * race with the worker thread's use of the queue.
+ */
+static void shutdown_vqpoll_work(struct vhost_work *work)
+{
+	work->vq->vqpoll.shutdown = true;
+	vhost_vq_disable_vqpoll(work->vq);
+	WARN_ON(work->vq->vqpoll.avail_mapped);
+}
+
+static void shutdown_vqpoll(struct vhost_virtqueue *vq)
+{
+	struct vhost_work work;
+
+	vhost_work_init(&work, vq, shutdown_vqpoll_work);
+	vhost_work_queue(vq->dev, &work);
+	vhost_work_flush(vq->dev, &work);
+}
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 			eventfd_ctx_put(dev->vqs[i]->call_ctx);
 		if (dev->vqs[i]->call)
 			fput(dev->vqs[i]->call);
+		shutdown_vqpoll(dev->vqs[i]);
 		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
@@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	u16 avail_idx;
 	int r;
 
+	/* In polling mode, when the backend (e.g., net.c) asks to enable
+	 * notifications, we don't enable guest notifications. Instead, start
+	 * polling on this vq by adding it to the round-robin list.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (list_empty(&vq->vqpoll.link)) {
+			list_add_tail(&vq->vqpoll.link,
+				&vq->dev->vqpoll_list);
+			vq->vqpoll.jiffies_last_kick = jiffies;
+		}
+		return false;
+	}
+
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
@@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
+	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
+	 * will generate notifications even if the guest is asked not to send
+	 * them. So we must remove it from the round-robin polling list.
+	 * Note that vqpoll.enabled remains set.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (!list_empty(&vq->vqpoll.link))
+			list_del_init(&vq->vqpoll.link);
+		return;
+	}
+
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..11aaaf4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -24,6 +24,7 @@ struct vhost_work {
 	int			  flushing;
 	unsigned		  queue_seq;
 	unsigned		  done_seq;
+	struct vhost_virtqueue    *vq;
 };
 
 /* Poll a file (eventfd or socket) */
@@ -37,11 +38,12 @@ struct vhost_poll {
 	struct vhost_dev	 *dev;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue  *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -54,8 +56,6 @@ struct vhost_log {
 	u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -110,6 +110,35 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	struct {
+      /* When a virtqueue is in vqpoll.enabled mode, it declares
+       * that instead of using guest notifications (kicks) to
+       * discover new work, we prefer to continuously poll this
+       * virtqueue in the worker thread.
+       * If !enabled, the rest of the fields below are undefined.
+       */
+		bool enabled;
+      /* vqpoll.enabled doesn't always mean that this virtqueue is
+       * actually being polled: The backend (e.g., net.c) may
+       * temporarily disable it using vhost_disable/enable_notify().
+       * vqpoll.link is used to maintain the thread's round-robin
+       * list of virtqueues that actually need to be polled.
+       * Note list_empty(link) means this virtqueue isn't polled.
+       */
+		struct list_head link;
+      /* If this flag is true, the virtqueue is being shut down,
+       * so vqpoll should not be re-enabled.
+       */
+		bool shutdown;
+      /* Various counters used to decide when to enter polling mode
+       * or leave it and return to notification mode.
+       */
+		unsigned long jiffies_last_kick;
+		unsigned long jiffies_last_work;
+		int work_this_jiffy;
+		struct page *avail_page;
+		volatile struct vring_avail *avail_mapped;
+	} vqpoll;
 };
 
 struct vhost_dev {
@@ -123,6 +152,7 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct list_head vqpoll_list;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] vhost: Add polling mode
       [not found] <1407659404-razya@il.ibm.com>
@ 2014-08-10  8:30   ` Razya Ladelsky
  2014-08-10  8:30 ` Razya Ladelsky
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-10  8:30 UTC (permalink / raw)
  To: mst, kvm
  Cc: ERANRA, netdev, linux-kernel, GLIKSON, YOSSIKU, abel.gordon,
	JOELN, virtualization

From: Razya Ladelsky <razya@il.ibm.com>
Date: Thu, 31 Jul 2014 09:47:20 +0300
Subject: [PATCH] vhost: Add polling mode

When vhost is waiting for buffers from the guest driver (e.g., more packets to
send in vhost-net's transmit queue), it normally goes to sleep and waits for the
guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
(and possibly userspace involvement in translating this PIO exit into a file
descriptor event), all of which hurts performance.

If the system is under-utilized (has cpu time to spare), vhost can continuously
poll the virtqueues for new buffers, and avoid asking the guest to kick us.
This patch adds an optional polling mode to vhost, that can be enabled via a
kernel module parameter, "poll_start_rate".

When polling is active for a virtqueue, the guest is asked to disable
notification (kicks), and the worker thread continuously checks for new buffers.
When it does discover new buffers, it simulates a "kick" by invoking the
underlying backend driver (such as vhost-net), which thinks it got a real kick
from the guest, and acts accordingly. If the underlying driver asks not to be
kicked, we disable polling on this virtqueue.

We start polling on a virtqueue when we notice it has work to do. Polling on
this virtqueue is later disabled after 3 seconds of polling turning up no new
work, as in this case we are better off returning to the exit-based notification
mechanism. The default timeout of 3 seconds can be changed with the
"poll_stop_idle" kernel module parameter.

This polling approach makes lot of sense for new HW with posted-interrupts for
which we have exitless host-to-guest notifications. But even with support for
posted interrupts, guest-to-host communication still causes exits. Polling adds
the missing part.

When systems are overloaded, there won't be enough cpu time for the various
vhost threads to poll their guests' devices. For these scenarios, we plan to add
support for vhost threads that can be shared by multiple devices, even of
multiple vms.
Our ultimate goal is to implement the I/O acceleration features described in:
KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
https://www.youtube.com/watch?v=9EyweibHfEs
and
https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html

I ran some experiments with TCP stream netperf and filebench (having 2 threads
performing random reads) benchmarks on an IBM System x3650 M4.
I have two machines, A and B. A hosts the vms, B runs the netserver.
The vms (on A) run netperf, its destination server is running on B.
All runs loaded the guests in a way that they were (cpu) saturated. For example,
I ran netperf with 64B messages, which is heavily loading the vm (which is why
its throughput is low).
The idea was to get it 100% loaded, so we can see that the polling is getting it
to produce higher throughput.

The system had two cores per guest, as to allow for both the vcpu and the vhost
thread to run concurrently for maximum throughput (but I didn't pin the threads
to specific cores).
My experiments were fair in a sense that for both cases, with or without
polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
way). The only difference was whether polling was enabled/disabled.

Results:

Netperf, 1 vm:
The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
Number of exits/sec decreased 6x.
The same improvement was shown when I tested with 3 vms running netperf
(4086 MB/sec -> 5545 MB/sec).

filebench, 1 vm:
ops/sec improved by 13% with the polling patch. Number of exits was reduced by
31%.
The same experiment with 3 vms running filebench showed similar numbers.

Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
---
 drivers/vhost/net.c   |    6 +-
 drivers/vhost/scsi.c  |    6 +-
 drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   38 +++++++-
 4 files changed, 277 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 971a760..558aecb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4f4ffa4..665eeeb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
-	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
-
+	vhost_work_init(&vs->vs_completion_work, NULL,
+						vhost_scsi_complete_cmd_work);
+	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
 	vs->vs_events_nr = 0;
 	vs->vs_events_missed = false;
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..fbe8174 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -24,9 +24,17 @@
 #include <linux/slab.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 #include <linux/module.h>
 
 #include "vhost.h"
+static int poll_start_rate = 0;
+module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
+
+static int poll_stop_idle = 3*HZ; /* 3 seconds */
+module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
 
 enum {
 	VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn)
 {
 	INIT_LIST_HEAD(&work->node);
 	work->fn = fn;
 	init_waitqueue_head(&work->done);
 	work->flushing = 0;
 	work->queue_seq = work->done_seq = 0;
+	work->vq = vq;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->dev = vq->dev;
 	poll->wqh = NULL;
-
-	vhost_work_init(&poll->work, fn);
+	vhost_work_init(&poll->work, vq, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
+ *
+ * Enabling this mode it tells the guest not to notify ("kick") us when its
+ * has made more work available on this virtqueue; Rather, we will continuously
+ * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
+ * the worker thread polls them all, e.g., in a round-robin fashion.
+ * Note that vqpoll.enabled doesn't always mean that this virtqueue is
+ * actually being polled: The backend (e.g., net.c) may temporarily disable it
+ * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
+ *
+ * It is assumed that these functions are called relatively rarely, when vhost
+ * notices that this virtqueue's usage pattern significantly changed in a way
+ * that makes polling more efficient than notification, or vice versa.
+ * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
+ * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
+ * reclaimed.
+ */
+static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (vq->vqpoll.enabled)
+		return; /* already enabled, nothing to do */
+	if (!vq->handle_kick)
+		return; /* polling will be a waste of time if no callback! */
+	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
+		/* vq has guest notifications enabled. Disable them,
+		   and instead add vq to the polling list */
+		vhost_disable_notify(vq->dev, vq);
+		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
+	}
+	vq->vqpoll.jiffies_last_kick = jiffies;
+	__get_user(vq->avail_idx, &vq->avail->idx);
+	vq->vqpoll.enabled = true;
+
+	/* Map userspace's vq->avail to the kernel's memory space. */
+	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
+		&vq->vqpoll.avail_page) != 1) {
+		/* TODO: can this happen, as we check access
+		to vq->avail in advance? */
+		BUG();
+	}
+	vq->vqpoll.avail_mapped = (struct vring_avail *) (
+		(unsigned long)kmap(vq->vqpoll.avail_page) |
+		((unsigned long)vq->avail & ~PAGE_MASK));
+}
+
+/*
+ * This function doesn't always succeed in changing the mode. Sometimes
+ * a temporary race condition prevents turning on guest notifications, so
+ * vq should be polled next time again.
+ */
+static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (!vq->vqpoll.enabled)
+		return; /* already disabled, nothing to do */
+
+	vq->vqpoll.enabled = false;
+
+	if (!list_empty(&vq->vqpoll.link)) {
+		/* vq is on the polling list, remove it from this list and
+		 * instead enable guest notifications. */
+		list_del_init(&vq->vqpoll.link);
+		if (unlikely(vhost_enable_notify(vq->dev, vq))
+			&& !vq->vqpoll.shutdown) {
+			/* Race condition: guest wrote before we enabled
+			 * notification, so we'll never get a notification for
+			 * this work - so continue polling mode for a while. */
+			vhost_disable_notify(vq->dev, vq);
+			vq->vqpoll.enabled = true;
+			vhost_enable_notify(vq->dev, vq);
+			return;
+		}
+	}
+
+	if (vq->vqpoll.avail_mapped) {
+		kunmap(vq->vqpoll.avail_page);
+		put_page(vq->vqpoll.avail_page);
+		vq->vqpoll.avail_mapped = 0;
+	}
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
 			   struct vhost_virtqueue *vq)
 {
@@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call = NULL;
 	vq->log_ctx = NULL;
 	vq->memory = NULL;
+	INIT_LIST_HEAD(&vq->vqpoll.link);
+	vq->vqpoll.enabled = false;
+	vq->vqpoll.shutdown = false;
+	vq->vqpoll.avail_mapped = NULL;
+}
+
+/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
+ * virtqueues which the caller should kick, or NULL in case none should be
+ * kicked. roundrobin_poll() also disables polling on a virtqueue which has
+ * been polled for too long without success.
+ *
+ * This current implementation (the "round-robin" implementation) only
+ * polls the first vq in the list, returning it or NULL as appropriate, and
+ * moves this vq to the end of the list, so next time a different one is
+ * polled.
+ */
+static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
+{
+	struct vhost_virtqueue *vq;
+	u16 avail_idx;
+
+	if (list_empty(list))
+		return NULL;
+
+	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
+	WARN_ON(!vq->vqpoll.enabled);
+	list_move_tail(&vq->vqpoll.link, list);
+
+	/* See if there is any new work available from the guest. */
+	/* TODO: can check the optional idx feature, and if we haven't
+	* reached that idx yet, don't kick... */
+	avail_idx = vq->vqpoll.avail_mapped->idx;
+	if (avail_idx != vq->last_avail_idx)
+		return vq;
+
+	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
+		/* We've been polling this virtqueue for a long time with no
+		* results, so switch back to guest notification
+		*/
+		vhost_vq_disable_vqpoll(vq);
+	}
+	return NULL;
 }
 
 static int vhost_worker(void *data)
@@ -237,12 +368,62 @@ static int vhost_worker(void *data)
 		spin_unlock_irq(&dev->work_lock);
 
 		if (work) {
+			struct vhost_virtqueue *vq = work->vq;
 			__set_current_state(TASK_RUNNING);
 			work->fn(work);
+			/* Keep track of the work rate, for deciding when to
+			 * enable polling */
+			if (vq) {
+				if (vq->vqpoll.jiffies_last_work != jiffies) {
+					vq->vqpoll.jiffies_last_work = jiffies;
+					vq->vqpoll.work_this_jiffy = 0;
+				}
+				vq->vqpoll.work_this_jiffy++;
+			}
+			/* If vq is in the round-robin list of virtqueues being
+			 * constantly checked by this thread, move vq the end
+			 * of the queue, because it had its fair chance now.
+			 */
+			if (vq && !list_empty(&vq->vqpoll.link)) {
+				list_move_tail(&vq->vqpoll.link,
+					&dev->vqpoll_list);
+			}
+			/* Otherwise, if this vq is looking for notifications
+			 * but vq polling is not enabled for it, do it now.
+			 */
+			else if (poll_start_rate && vq && vq->handle_kick &&
+				!vq->vqpoll.enabled &&
+				!vq->vqpoll.shutdown &&
+				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
+				vq->vqpoll.work_this_jiffy >=
+					poll_start_rate) {
+				vhost_vq_enable_vqpoll(vq);
+			}
+		}
+		/* Check one virtqueue from the round-robin list */
+		if (!list_empty(&dev->vqpoll_list)) {
+			struct vhost_virtqueue *vq;
+
+			vq = roundrobin_poll(&dev->vqpoll_list);
+
+			if (vq) {
+				vq->handle_kick(&vq->poll.work);
+				vq->vqpoll.jiffies_last_kick = jiffies;
+			}
+
+			/* If our polling list isn't empty, ask to continue
+			 * running this thread, don't yield.
+			 */
+			__set_current_state(TASK_RUNNING);
 			if (need_resched())
 				schedule();
-		} else
-			schedule();
+		} else {
+			if (work) {
+				if (need_resched())
+					schedule();
+			} else
+				schedule();
+		}
 
 	}
 	unuse_mm(dev->mm);
@@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
+	INIT_LIST_HEAD(&dev->vqpoll_list);
 	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
@@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					POLLIN, dev);
+					POLLIN, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
 	vhost_work_queue(dev, &attach.work);
 	vhost_work_flush(dev, &attach.work);
 	return attach.ret;
@@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
+ * mode for a given virtqueue which is itself being shut down. We ask the
+ * worker thread to do this rather than doing it directly, so that we don't
+ * race with the worker thread's use of the queue.
+ */
+static void shutdown_vqpoll_work(struct vhost_work *work)
+{
+	work->vq->vqpoll.shutdown = true;
+	vhost_vq_disable_vqpoll(work->vq);
+	WARN_ON(work->vq->vqpoll.avail_mapped);
+}
+
+static void shutdown_vqpoll(struct vhost_virtqueue *vq)
+{
+	struct vhost_work work;
+
+	vhost_work_init(&work, vq, shutdown_vqpoll_work);
+	vhost_work_queue(vq->dev, &work);
+	vhost_work_flush(vq->dev, &work);
+}
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 			eventfd_ctx_put(dev->vqs[i]->call_ctx);
 		if (dev->vqs[i]->call)
 			fput(dev->vqs[i]->call);
+		shutdown_vqpoll(dev->vqs[i]);
 		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
@@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	u16 avail_idx;
 	int r;
 
+	/* In polling mode, when the backend (e.g., net.c) asks to enable
+	 * notifications, we don't enable guest notifications. Instead, start
+	 * polling on this vq by adding it to the round-robin list.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (list_empty(&vq->vqpoll.link)) {
+			list_add_tail(&vq->vqpoll.link,
+				&vq->dev->vqpoll_list);
+			vq->vqpoll.jiffies_last_kick = jiffies;
+		}
+		return false;
+	}
+
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
@@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
+	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
+	 * will generate notifications even if the guest is asked not to send
+	 * them. So we must remove it from the round-robin polling list.
+	 * Note that vqpoll.enabled remains set.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (!list_empty(&vq->vqpoll.link))
+			list_del_init(&vq->vqpoll.link);
+		return;
+	}
+
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..11aaaf4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -24,6 +24,7 @@ struct vhost_work {
 	int			  flushing;
 	unsigned		  queue_seq;
 	unsigned		  done_seq;
+	struct vhost_virtqueue    *vq;
 };
 
 /* Poll a file (eventfd or socket) */
@@ -37,11 +38,12 @@ struct vhost_poll {
 	struct vhost_dev	 *dev;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue  *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -54,8 +56,6 @@ struct vhost_log {
 	u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -110,6 +110,35 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	struct {
+      /* When a virtqueue is in vqpoll.enabled mode, it declares
+       * that instead of using guest notifications (kicks) to
+       * discover new work, we prefer to continuously poll this
+       * virtqueue in the worker thread.
+       * If !enabled, the rest of the fields below are undefined.
+       */
+		bool enabled;
+      /* vqpoll.enabled doesn't always mean that this virtqueue is
+       * actually being polled: The backend (e.g., net.c) may
+       * temporarily disable it using vhost_disable/enable_notify().
+       * vqpoll.link is used to maintain the thread's round-robin
+       * list of virtqueues that actually need to be polled.
+       * Note list_empty(link) means this virtqueue isn't polled.
+       */
+		struct list_head link;
+      /* If this flag is true, the virtqueue is being shut down,
+       * so vqpoll should not be re-enabled.
+       */
+		bool shutdown;
+      /* Various counters used to decide when to enter polling mode
+       * or leave it and return to notification mode.
+       */
+		unsigned long jiffies_last_kick;
+		unsigned long jiffies_last_work;
+		int work_this_jiffy;
+		struct page *avail_page;
+		volatile struct vring_avail *avail_mapped;
+	} vqpoll;
 };
 
 struct vhost_dev {
@@ -123,6 +152,7 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct list_head vqpoll_list;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH] vhost: Add polling mode
@ 2014-08-10  8:30   ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-10  8:30 UTC (permalink / raw)
  To: mst, kvm
  Cc: ERANRA, netdev, linux-kernel, GLIKSON, YOSSIKU, abel.gordon,
	JOELN, virtualization

From: Razya Ladelsky <razya@il.ibm.com>
Date: Thu, 31 Jul 2014 09:47:20 +0300
Subject: [PATCH] vhost: Add polling mode

When vhost is waiting for buffers from the guest driver (e.g., more packets to
send in vhost-net's transmit queue), it normally goes to sleep and waits for the
guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
(and possibly userspace involvement in translating this PIO exit into a file
descriptor event), all of which hurts performance.

If the system is under-utilized (has cpu time to spare), vhost can continuously
poll the virtqueues for new buffers, and avoid asking the guest to kick us.
This patch adds an optional polling mode to vhost, that can be enabled via a
kernel module parameter, "poll_start_rate".

When polling is active for a virtqueue, the guest is asked to disable
notification (kicks), and the worker thread continuously checks for new buffers.
When it does discover new buffers, it simulates a "kick" by invoking the
underlying backend driver (such as vhost-net), which thinks it got a real kick
from the guest, and acts accordingly. If the underlying driver asks not to be
kicked, we disable polling on this virtqueue.

We start polling on a virtqueue when we notice it has work to do. Polling on
this virtqueue is later disabled after 3 seconds of polling turning up no new
work, as in this case we are better off returning to the exit-based notification
mechanism. The default timeout of 3 seconds can be changed with the
"poll_stop_idle" kernel module parameter.

This polling approach makes lot of sense for new HW with posted-interrupts for
which we have exitless host-to-guest notifications. But even with support for
posted interrupts, guest-to-host communication still causes exits. Polling adds
the missing part.

When systems are overloaded, there won't be enough cpu time for the various
vhost threads to poll their guests' devices. For these scenarios, we plan to add
support for vhost threads that can be shared by multiple devices, even of
multiple vms.
Our ultimate goal is to implement the I/O acceleration features described in:
KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
https://www.youtube.com/watch?v=9EyweibHfEs
and
https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html

I ran some experiments with TCP stream netperf and filebench (having 2 threads
performing random reads) benchmarks on an IBM System x3650 M4.
I have two machines, A and B. A hosts the vms, B runs the netserver.
The vms (on A) run netperf, its destination server is running on B.
All runs loaded the guests in a way that they were (cpu) saturated. For example,
I ran netperf with 64B messages, which is heavily loading the vm (which is why
its throughput is low).
The idea was to get it 100% loaded, so we can see that the polling is getting it
to produce higher throughput.

The system had two cores per guest, as to allow for both the vcpu and the vhost
thread to run concurrently for maximum throughput (but I didn't pin the threads
to specific cores).
My experiments were fair in a sense that for both cases, with or without
polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
way). The only difference was whether polling was enabled/disabled.

Results:

Netperf, 1 vm:
The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
Number of exits/sec decreased 6x.
The same improvement was shown when I tested with 3 vms running netperf
(4086 MB/sec -> 5545 MB/sec).

filebench, 1 vm:
ops/sec improved by 13% with the polling patch. Number of exits was reduced by
31%.
The same experiment with 3 vms running filebench showed similar numbers.

Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
---
 drivers/vhost/net.c   |    6 +-
 drivers/vhost/scsi.c  |    6 +-
 drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   38 +++++++-
 4 files changed, 277 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 971a760..558aecb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+			vqs[VHOST_NET_VQ_TX]);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+			vqs[VHOST_NET_VQ_RX]);
 
 	f->private_data = n;
 
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4f4ffa4..665eeeb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	if (!vqs)
 		goto err_vqs;
 
-	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
-	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
-
+	vhost_work_init(&vs->vs_completion_work, NULL,
+						vhost_scsi_complete_cmd_work);
+	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
 	vs->vs_events_nr = 0;
 	vs->vs_events_missed = false;
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..fbe8174 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -24,9 +24,17 @@
 #include <linux/slab.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 #include <linux/module.h>
 
 #include "vhost.h"
+static int poll_start_rate = 0;
+module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
+
+static int poll_stop_idle = 3*HZ; /* 3 seconds */
+module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
 
 enum {
 	VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 	return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn)
 {
 	INIT_LIST_HEAD(&work->node);
 	work->fn = fn;
 	init_waitqueue_head(&work->done);
 	work->flushing = 0;
 	work->queue_seq = work->done_seq = 0;
+	work->vq = vq;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->dev = vq->dev;
 	poll->wqh = NULL;
-
-	vhost_work_init(&poll->work, fn);
+	vhost_work_init(&poll->work, vq, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
+ *
+ * Enabling this mode it tells the guest not to notify ("kick") us when its
+ * has made more work available on this virtqueue; Rather, we will continuously
+ * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
+ * the worker thread polls them all, e.g., in a round-robin fashion.
+ * Note that vqpoll.enabled doesn't always mean that this virtqueue is
+ * actually being polled: The backend (e.g., net.c) may temporarily disable it
+ * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
+ *
+ * It is assumed that these functions are called relatively rarely, when vhost
+ * notices that this virtqueue's usage pattern significantly changed in a way
+ * that makes polling more efficient than notification, or vice versa.
+ * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
+ * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
+ * reclaimed.
+ */
+static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (vq->vqpoll.enabled)
+		return; /* already enabled, nothing to do */
+	if (!vq->handle_kick)
+		return; /* polling will be a waste of time if no callback! */
+	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
+		/* vq has guest notifications enabled. Disable them,
+		   and instead add vq to the polling list */
+		vhost_disable_notify(vq->dev, vq);
+		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
+	}
+	vq->vqpoll.jiffies_last_kick = jiffies;
+	__get_user(vq->avail_idx, &vq->avail->idx);
+	vq->vqpoll.enabled = true;
+
+	/* Map userspace's vq->avail to the kernel's memory space. */
+	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
+		&vq->vqpoll.avail_page) != 1) {
+		/* TODO: can this happen, as we check access
+		to vq->avail in advance? */
+		BUG();
+	}
+	vq->vqpoll.avail_mapped = (struct vring_avail *) (
+		(unsigned long)kmap(vq->vqpoll.avail_page) |
+		((unsigned long)vq->avail & ~PAGE_MASK));
+}
+
+/*
+ * This function doesn't always succeed in changing the mode. Sometimes
+ * a temporary race condition prevents turning on guest notifications, so
+ * vq should be polled next time again.
+ */
+static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
+{
+	if (!vq->vqpoll.enabled)
+		return; /* already disabled, nothing to do */
+
+	vq->vqpoll.enabled = false;
+
+	if (!list_empty(&vq->vqpoll.link)) {
+		/* vq is on the polling list, remove it from this list and
+		 * instead enable guest notifications. */
+		list_del_init(&vq->vqpoll.link);
+		if (unlikely(vhost_enable_notify(vq->dev, vq))
+			&& !vq->vqpoll.shutdown) {
+			/* Race condition: guest wrote before we enabled
+			 * notification, so we'll never get a notification for
+			 * this work - so continue polling mode for a while. */
+			vhost_disable_notify(vq->dev, vq);
+			vq->vqpoll.enabled = true;
+			vhost_enable_notify(vq->dev, vq);
+			return;
+		}
+	}
+
+	if (vq->vqpoll.avail_mapped) {
+		kunmap(vq->vqpoll.avail_page);
+		put_page(vq->vqpoll.avail_page);
+		vq->vqpoll.avail_mapped = 0;
+	}
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
 			   struct vhost_virtqueue *vq)
 {
@@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call = NULL;
 	vq->log_ctx = NULL;
 	vq->memory = NULL;
+	INIT_LIST_HEAD(&vq->vqpoll.link);
+	vq->vqpoll.enabled = false;
+	vq->vqpoll.shutdown = false;
+	vq->vqpoll.avail_mapped = NULL;
+}
+
+/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
+ * virtqueues which the caller should kick, or NULL in case none should be
+ * kicked. roundrobin_poll() also disables polling on a virtqueue which has
+ * been polled for too long without success.
+ *
+ * This current implementation (the "round-robin" implementation) only
+ * polls the first vq in the list, returning it or NULL as appropriate, and
+ * moves this vq to the end of the list, so next time a different one is
+ * polled.
+ */
+static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
+{
+	struct vhost_virtqueue *vq;
+	u16 avail_idx;
+
+	if (list_empty(list))
+		return NULL;
+
+	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
+	WARN_ON(!vq->vqpoll.enabled);
+	list_move_tail(&vq->vqpoll.link, list);
+
+	/* See if there is any new work available from the guest. */
+	/* TODO: can check the optional idx feature, and if we haven't
+	* reached that idx yet, don't kick... */
+	avail_idx = vq->vqpoll.avail_mapped->idx;
+	if (avail_idx != vq->last_avail_idx)
+		return vq;
+
+	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
+		/* We've been polling this virtqueue for a long time with no
+		* results, so switch back to guest notification
+		*/
+		vhost_vq_disable_vqpoll(vq);
+	}
+	return NULL;
 }
 
 static int vhost_worker(void *data)
@@ -237,12 +368,62 @@ static int vhost_worker(void *data)
 		spin_unlock_irq(&dev->work_lock);
 
 		if (work) {
+			struct vhost_virtqueue *vq = work->vq;
 			__set_current_state(TASK_RUNNING);
 			work->fn(work);
+			/* Keep track of the work rate, for deciding when to
+			 * enable polling */
+			if (vq) {
+				if (vq->vqpoll.jiffies_last_work != jiffies) {
+					vq->vqpoll.jiffies_last_work = jiffies;
+					vq->vqpoll.work_this_jiffy = 0;
+				}
+				vq->vqpoll.work_this_jiffy++;
+			}
+			/* If vq is in the round-robin list of virtqueues being
+			 * constantly checked by this thread, move vq the end
+			 * of the queue, because it had its fair chance now.
+			 */
+			if (vq && !list_empty(&vq->vqpoll.link)) {
+				list_move_tail(&vq->vqpoll.link,
+					&dev->vqpoll_list);
+			}
+			/* Otherwise, if this vq is looking for notifications
+			 * but vq polling is not enabled for it, do it now.
+			 */
+			else if (poll_start_rate && vq && vq->handle_kick &&
+				!vq->vqpoll.enabled &&
+				!vq->vqpoll.shutdown &&
+				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
+				vq->vqpoll.work_this_jiffy >=
+					poll_start_rate) {
+				vhost_vq_enable_vqpoll(vq);
+			}
+		}
+		/* Check one virtqueue from the round-robin list */
+		if (!list_empty(&dev->vqpoll_list)) {
+			struct vhost_virtqueue *vq;
+
+			vq = roundrobin_poll(&dev->vqpoll_list);
+
+			if (vq) {
+				vq->handle_kick(&vq->poll.work);
+				vq->vqpoll.jiffies_last_kick = jiffies;
+			}
+
+			/* If our polling list isn't empty, ask to continue
+			 * running this thread, don't yield.
+			 */
+			__set_current_state(TASK_RUNNING);
 			if (need_resched())
 				schedule();
-		} else
-			schedule();
+		} else {
+			if (work) {
+				if (need_resched())
+					schedule();
+			} else
+				schedule();
+		}
 
 	}
 	unuse_mm(dev->mm);
@@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
+	INIT_LIST_HEAD(&dev->vqpoll_list);
 	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
@@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vhost_vq_reset(dev, vq);
 		if (vq->handle_kick)
 			vhost_poll_init(&vq->poll, vq->handle_kick,
-					POLLIN, dev);
+					POLLIN, vq);
 	}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 	struct vhost_attach_cgroups_struct attach;
 
 	attach.owner = current;
-	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
 	vhost_work_queue(dev, &attach.work);
 	vhost_work_flush(dev, &attach.work);
 	return attach.ret;
@@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
+ * mode for a given virtqueue which is itself being shut down. We ask the
+ * worker thread to do this rather than doing it directly, so that we don't
+ * race with the worker thread's use of the queue.
+ */
+static void shutdown_vqpoll_work(struct vhost_work *work)
+{
+	work->vq->vqpoll.shutdown = true;
+	vhost_vq_disable_vqpoll(work->vq);
+	WARN_ON(work->vq->vqpoll.avail_mapped);
+}
+
+static void shutdown_vqpoll(struct vhost_virtqueue *vq)
+{
+	struct vhost_work work;
+
+	vhost_work_init(&work, vq, shutdown_vqpoll_work);
+	vhost_work_queue(vq->dev, &work);
+	vhost_work_flush(vq->dev, &work);
+}
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 			eventfd_ctx_put(dev->vqs[i]->call_ctx);
 		if (dev->vqs[i]->call)
 			fput(dev->vqs[i]->call);
+		shutdown_vqpoll(dev->vqs[i]);
 		vhost_vq_reset(dev, dev->vqs[i]);
 	}
 	vhost_dev_free_iovecs(dev);
@@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 	u16 avail_idx;
 	int r;
 
+	/* In polling mode, when the backend (e.g., net.c) asks to enable
+	 * notifications, we don't enable guest notifications. Instead, start
+	 * polling on this vq by adding it to the round-robin list.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (list_empty(&vq->vqpoll.link)) {
+			list_add_tail(&vq->vqpoll.link,
+				&vq->dev->vqpoll_list);
+			vq->vqpoll.jiffies_last_kick = jiffies;
+		}
+		return false;
+	}
+
 	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
 		return false;
 	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
@@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
 	int r;
 
+	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
+	 * will generate notifications even if the guest is asked not to send
+	 * them. So we must remove it from the round-robin polling list.
+	 * Note that vqpoll.enabled remains set.
+	 */
+	if (vq->vqpoll.enabled) {
+		if (!list_empty(&vq->vqpoll.link))
+			list_del_init(&vq->vqpoll.link);
+		return;
+	}
+
 	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
 		return;
 	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..11aaaf4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -24,6 +24,7 @@ struct vhost_work {
 	int			  flushing;
 	unsigned		  queue_seq;
 	unsigned		  done_seq;
+	struct vhost_virtqueue    *vq;
 };
 
 /* Poll a file (eventfd or socket) */
@@ -37,11 +38,12 @@ struct vhost_poll {
 	struct vhost_dev	 *dev;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
+							vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue  *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -54,8 +56,6 @@ struct vhost_log {
 	u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -110,6 +110,35 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	struct {
+      /* When a virtqueue is in vqpoll.enabled mode, it declares
+       * that instead of using guest notifications (kicks) to
+       * discover new work, we prefer to continuously poll this
+       * virtqueue in the worker thread.
+       * If !enabled, the rest of the fields below are undefined.
+       */
+		bool enabled;
+      /* vqpoll.enabled doesn't always mean that this virtqueue is
+       * actually being polled: The backend (e.g., net.c) may
+       * temporarily disable it using vhost_disable/enable_notify().
+       * vqpoll.link is used to maintain the thread's round-robin
+       * list of virtqueues that actually need to be polled.
+       * Note list_empty(link) means this virtqueue isn't polled.
+       */
+		struct list_head link;
+      /* If this flag is true, the virtqueue is being shut down,
+       * so vqpoll should not be re-enabled.
+       */
+		bool shutdown;
+      /* Various counters used to decide when to enter polling mode
+       * or leave it and return to notification mode.
+       */
+		unsigned long jiffies_last_kick;
+		unsigned long jiffies_last_work;
+		int work_this_jiffy;
+		struct page *avail_page;
+		volatile struct vring_avail *avail_mapped;
+	} vqpoll;
 };
 
 struct vhost_dev {
@@ -123,6 +152,7 @@ struct vhost_dev {
 	spinlock_t work_lock;
 	struct list_head work_list;
 	struct task_struct *worker;
+	struct list_head vqpoll_list;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-10  8:30 ` Razya Ladelsky
@ 2014-08-10 19:45     ` Michael S. Tsirkin
  2014-08-20  8:41     ` Christian Borntraeger
  2014-08-20 10:57     ` Michael S. Tsirkin
  2 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-10 19:45 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: kvm, GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon, linux-kernel,
	netdev, virtualization

On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> From: Razya Ladelsky <razya@il.ibm.com>
> Date: Thu, 31 Jul 2014 09:47:20 +0300
> Subject: [PATCH] vhost: Add polling mode
> 
> When vhost is waiting for buffers from the guest driver (e.g., more packets to
> send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> (and possibly userspace involvement in translating this PIO exit into a file
> descriptor event), all of which hurts performance.
> 
> If the system is under-utilized (has cpu time to spare), vhost can continuously
> poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled via a
> kernel module parameter, "poll_start_rate".
> 
> When polling is active for a virtqueue, the guest is asked to disable
> notification (kicks), and the worker thread continuously checks for new buffers.
> When it does discover new buffers, it simulates a "kick" by invoking the
> underlying backend driver (such as vhost-net), which thinks it got a real kick
> from the guest, and acts accordingly. If the underlying driver asks not to be
> kicked, we disable polling on this virtqueue.
> 
> We start polling on a virtqueue when we notice it has work to do. Polling on
> this virtqueue is later disabled after 3 seconds of polling turning up no new
> work, as in this case we are better off returning to the exit-based notification
> mechanism. The default timeout of 3 seconds can be changed with the
> "poll_stop_idle" kernel module parameter.
> 
> This polling approach makes lot of sense for new HW with posted-interrupts for
> which we have exitless host-to-guest notifications. But even with support for
> posted interrupts, guest-to-host communication still causes exits. Polling adds
> the missing part.
> 
> When systems are overloaded, there won't be enough cpu time for the various
> vhost threads to poll their guests' devices. For these scenarios, we plan to add
> support for vhost threads that can be shared by multiple devices, even of
> multiple vms.
> Our ultimate goal is to implement the I/O acceleration features described in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> 
> I ran some experiments with TCP stream netperf and filebench (having 2 threads
> performing random reads) benchmarks on an IBM System x3650 M4.
> I have two machines, A and B. A hosts the vms, B runs the netserver.
> The vms (on A) run netperf, its destination server is running on B.
> All runs loaded the guests in a way that they were (cpu) saturated. For example,
> I ran netperf with 64B messages, which is heavily loading the vm (which is why
> its throughput is low).
> The idea was to get it 100% loaded, so we can see that the polling is getting it
> to produce higher throughput.

And, did your tests actually produce 100% load on both host CPUs?

> The system had two cores per guest, as to allow for both the vcpu and the vhost
> thread to run concurrently for maximum throughput (but I didn't pin the threads
> to specific cores).
> My experiments were fair in a sense that for both cases, with or without
> polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> way). The only difference was whether polling was enabled/disabled.
> 
> Results:
> 
> Netperf, 1 vm:
> The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> Number of exits/sec decreased 6x.
> The same improvement was shown when I tested with 3 vms running netperf
> (4086 MB/sec -> 5545 MB/sec).
> 
> filebench, 1 vm:
> ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> 31%.
> The same experiment with 3 vms running filebench showed similar numbers.
> 
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> ---
>  drivers/vhost/net.c   |    6 +-
>  drivers/vhost/scsi.c  |    6 +-
>  drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   38 +++++++-
>  4 files changed, 277 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 971a760..558aecb 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	}
>  	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>  
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> +			vqs[VHOST_NET_VQ_TX]);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> +			vqs[VHOST_NET_VQ_RX]);
>  
>  	f->private_data = n;
>  
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4f4ffa4..665eeeb 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>  	if (!vqs)
>  		goto err_vqs;
>  
> -	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
> -	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> -
> +	vhost_work_init(&vs->vs_completion_work, NULL,
> +						vhost_scsi_complete_cmd_work);
> +	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
>  	vs->vs_events_nr = 0;
>  	vs->vs_events_missed = false;
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c90f437..fbe8174 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -24,9 +24,17 @@
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
>  #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>  #include <linux/module.h>
>  
>  #include "vhost.h"
> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
>  
>  enum {
>  	VHOST_MEMORY_MAX_NREGIONS = 64,
> @@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>  	return 0;
>  }
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn)
>  {
>  	INIT_LIST_HEAD(&work->node);
>  	work->fn = fn;
>  	init_waitqueue_head(&work->done);
>  	work->flushing = 0;
>  	work->queue_seq = work->done_seq = 0;
> +	work->vq = vq;
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_init);
>  
>  /* Init poll structure */
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev)
> +		     unsigned long mask, struct vhost_virtqueue *vq)
>  {
>  	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
>  	init_poll_funcptr(&poll->table, vhost_poll_func);
>  	poll->mask = mask;
> -	poll->dev = dev;
> +	poll->dev = vq->dev;
>  	poll->wqh = NULL;
> -
> -	vhost_work_init(&poll->work, fn);
> +	vhost_work_init(&poll->work, vq, fn);
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_init);
>  
> @@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_queue);
>  
> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
> + *
> + * Enabling this mode it tells the guest not to notify ("kick") us when its
> + * has made more work available on this virtqueue; Rather, we will continuously
> + * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
> + * the worker thread polls them all, e.g., in a round-robin fashion.
> + * Note that vqpoll.enabled doesn't always mean that this virtqueue is
> + * actually being polled: The backend (e.g., net.c) may temporarily disable it
> + * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
> + *
> + * It is assumed that these functions are called relatively rarely, when vhost
> + * notices that this virtqueue's usage pattern significantly changed in a way
> + * that makes polling more efficient than notification, or vice versa.
> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
> + * reclaimed.
> + */
> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (vq->vqpoll.enabled)
> +		return; /* already enabled, nothing to do */
> +	if (!vq->handle_kick)
> +		return; /* polling will be a waste of time if no callback! */
> +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> +		/* vq has guest notifications enabled. Disable them,
> +		   and instead add vq to the polling list */
> +		vhost_disable_notify(vq->dev, vq);
> +		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
> +	}
> +	vq->vqpoll.jiffies_last_kick = jiffies;
> +	__get_user(vq->avail_idx, &vq->avail->idx);
> +	vq->vqpoll.enabled = true;
> +
> +	/* Map userspace's vq->avail to the kernel's memory space. */
> +	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> +		&vq->vqpoll.avail_page) != 1) {
> +		/* TODO: can this happen, as we check access
> +		to vq->avail in advance? */
> +		BUG();
> +	}
> +	vq->vqpoll.avail_mapped = (struct vring_avail *) (
> +		(unsigned long)kmap(vq->vqpoll.avail_page) |
> +		((unsigned long)vq->avail & ~PAGE_MASK));
> +}
> +
> +/*
> + * This function doesn't always succeed in changing the mode. Sometimes
> + * a temporary race condition prevents turning on guest notifications, so
> + * vq should be polled next time again.
> + */
> +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (!vq->vqpoll.enabled)
> +		return; /* already disabled, nothing to do */
> +
> +	vq->vqpoll.enabled = false;
> +
> +	if (!list_empty(&vq->vqpoll.link)) {
> +		/* vq is on the polling list, remove it from this list and
> +		 * instead enable guest notifications. */
> +		list_del_init(&vq->vqpoll.link);
> +		if (unlikely(vhost_enable_notify(vq->dev, vq))
> +			&& !vq->vqpoll.shutdown) {
> +			/* Race condition: guest wrote before we enabled
> +			 * notification, so we'll never get a notification for
> +			 * this work - so continue polling mode for a while. */
> +			vhost_disable_notify(vq->dev, vq);
> +			vq->vqpoll.enabled = true;
> +			vhost_enable_notify(vq->dev, vq);
> +			return;
> +		}
> +	}
> +
> +	if (vq->vqpoll.avail_mapped) {
> +		kunmap(vq->vqpoll.avail_page);
> +		put_page(vq->vqpoll.avail_page);
> +		vq->vqpoll.avail_mapped = 0;
> +	}
> +}
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>  			   struct vhost_virtqueue *vq)
>  {
> @@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
>  	vq->memory = NULL;
> +	INIT_LIST_HEAD(&vq->vqpoll.link);
> +	vq->vqpoll.enabled = false;
> +	vq->vqpoll.shutdown = false;
> +	vq->vqpoll.avail_mapped = NULL;
> +}
> +
> +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
> + * virtqueues which the caller should kick, or NULL in case none should be
> + * kicked. roundrobin_poll() also disables polling on a virtqueue which has
> + * been polled for too long without success.
> + *
> + * This current implementation (the "round-robin" implementation) only
> + * polls the first vq in the list, returning it or NULL as appropriate, and
> + * moves this vq to the end of the list, so next time a different one is
> + * polled.
> + */
> +static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
> +{
> +	struct vhost_virtqueue *vq;
> +	u16 avail_idx;
> +
> +	if (list_empty(list))
> +		return NULL;
> +
> +	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> +	WARN_ON(!vq->vqpoll.enabled);
> +	list_move_tail(&vq->vqpoll.link, list);
> +
> +	/* See if there is any new work available from the guest. */
> +	/* TODO: can check the optional idx feature, and if we haven't
> +	* reached that idx yet, don't kick... */
> +	avail_idx = vq->vqpoll.avail_mapped->idx;
> +	if (avail_idx != vq->last_avail_idx)
> +		return vq;
> +
> +	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> +		/* We've been polling this virtqueue for a long time with no
> +		* results, so switch back to guest notification
> +		*/
> +		vhost_vq_disable_vqpoll(vq);
> +	}
> +	return NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -237,12 +368,62 @@ static int vhost_worker(void *data)
>  		spin_unlock_irq(&dev->work_lock);
>  
>  		if (work) {
> +			struct vhost_virtqueue *vq = work->vq;
>  			__set_current_state(TASK_RUNNING);
>  			work->fn(work);
> +			/* Keep track of the work rate, for deciding when to
> +			 * enable polling */
> +			if (vq) {
> +				if (vq->vqpoll.jiffies_last_work != jiffies) {
> +					vq->vqpoll.jiffies_last_work = jiffies;
> +					vq->vqpoll.work_this_jiffy = 0;
> +				}
> +				vq->vqpoll.work_this_jiffy++;
> +			}
> +			/* If vq is in the round-robin list of virtqueues being
> +			 * constantly checked by this thread, move vq the end
> +			 * of the queue, because it had its fair chance now.
> +			 */
> +			if (vq && !list_empty(&vq->vqpoll.link)) {
> +				list_move_tail(&vq->vqpoll.link,
> +					&dev->vqpoll_list);
> +			}
> +			/* Otherwise, if this vq is looking for notifications
> +			 * but vq polling is not enabled for it, do it now.
> +			 */
> +			else if (poll_start_rate && vq && vq->handle_kick &&
> +				!vq->vqpoll.enabled &&
> +				!vq->vqpoll.shutdown &&
> +				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
> +				vq->vqpoll.work_this_jiffy >=
> +					poll_start_rate) {
> +				vhost_vq_enable_vqpoll(vq);
> +			}
> +		}
> +		/* Check one virtqueue from the round-robin list */
> +		if (!list_empty(&dev->vqpoll_list)) {
> +			struct vhost_virtqueue *vq;
> +
> +			vq = roundrobin_poll(&dev->vqpoll_list);
> +
> +			if (vq) {
> +				vq->handle_kick(&vq->poll.work);
> +				vq->vqpoll.jiffies_last_kick = jiffies;
> +			}
> +
> +			/* If our polling list isn't empty, ask to continue
> +			 * running this thread, don't yield.
> +			 */
> +			__set_current_state(TASK_RUNNING);
>  			if (need_resched())
>  				schedule();
> -		} else
> -			schedule();
> +		} else {
> +			if (work) {
> +				if (need_resched())
> +					schedule();
> +			} else
> +				schedule();
> +		}
>  
>  	}
>  	unuse_mm(dev->mm);
> @@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
> +	INIT_LIST_HEAD(&dev->vqpoll_list);
>  	dev->worker = NULL;
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
> @@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
>  			vhost_poll_init(&vq->poll, vq->handle_kick,
> -					POLLIN, dev);
> +					POLLIN, vq);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_init);
> @@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
>  	struct vhost_attach_cgroups_struct attach;
>  
>  	attach.owner = current;
> -	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> +	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
>  	vhost_work_queue(dev, &attach.work);
>  	vhost_work_flush(dev, &attach.work);
>  	return attach.ret;
> @@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
> + * mode for a given virtqueue which is itself being shut down. We ask the
> + * worker thread to do this rather than doing it directly, so that we don't
> + * race with the worker thread's use of the queue.
> + */
> +static void shutdown_vqpoll_work(struct vhost_work *work)
> +{
> +	work->vq->vqpoll.shutdown = true;
> +	vhost_vq_disable_vqpoll(work->vq);
> +	WARN_ON(work->vq->vqpoll.avail_mapped);
> +}
> +
> +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	struct vhost_work work;
> +
> +	vhost_work_init(&work, vq, shutdown_vqpoll_work);
> +	vhost_work_queue(vq->dev, &work);
> +	vhost_work_flush(vq->dev, &work);
> +}
>  /* Caller should have device mutex if and only if locked is set */
>  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  {
> @@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  			eventfd_ctx_put(dev->vqs[i]->call_ctx);
>  		if (dev->vqs[i]->call)
>  			fput(dev->vqs[i]->call);
> +		shutdown_vqpoll(dev->vqs[i]);
>  		vhost_vq_reset(dev, dev->vqs[i]);
>  	}
>  	vhost_dev_free_iovecs(dev);
> @@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	u16 avail_idx;
>  	int r;
>  
> +	/* In polling mode, when the backend (e.g., net.c) asks to enable
> +	 * notifications, we don't enable guest notifications. Instead, start
> +	 * polling on this vq by adding it to the round-robin list.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (list_empty(&vq->vqpoll.link)) {
> +			list_add_tail(&vq->vqpoll.link,
> +				&vq->dev->vqpoll_list);
> +			vq->vqpoll.jiffies_last_kick = jiffies;
> +		}
> +		return false;
> +	}
> +
>  	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>  		return false;
>  	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> @@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
>  	int r;
>  
> +	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
> +	 * will generate notifications even if the guest is asked not to send
> +	 * them. So we must remove it from the round-robin polling list.
> +	 * Note that vqpoll.enabled remains set.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (!list_empty(&vq->vqpoll.link))
> +			list_del_init(&vq->vqpoll.link);
> +		return;
> +	}
> +
>  	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
>  		return;
>  	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3eda654..11aaaf4 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -24,6 +24,7 @@ struct vhost_work {
>  	int			  flushing;
>  	unsigned		  queue_seq;
>  	unsigned		  done_seq;
> +	struct vhost_virtqueue    *vq;
>  };
>  
>  /* Poll a file (eventfd or socket) */
> @@ -37,11 +38,12 @@ struct vhost_poll {
>  	struct vhost_dev	 *dev;
>  };
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn);
>  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
>  
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev);
> +		     unsigned long mask, struct vhost_virtqueue  *vq);
>  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
>  void vhost_poll_stop(struct vhost_poll *poll);
>  void vhost_poll_flush(struct vhost_poll *poll);
> @@ -54,8 +56,6 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> -struct vhost_virtqueue;
> -
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -110,6 +110,35 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	struct {
> +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> +       * that instead of using guest notifications (kicks) to
> +       * discover new work, we prefer to continuously poll this
> +       * virtqueue in the worker thread.
> +       * If !enabled, the rest of the fields below are undefined.
> +       */
> +		bool enabled;
> +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> +       * actually being polled: The backend (e.g., net.c) may
> +       * temporarily disable it using vhost_disable/enable_notify().
> +       * vqpoll.link is used to maintain the thread's round-robin
> +       * list of virtqueues that actually need to be polled.
> +       * Note list_empty(link) means this virtqueue isn't polled.
> +       */
> +		struct list_head link;
> +      /* If this flag is true, the virtqueue is being shut down,
> +       * so vqpoll should not be re-enabled.
> +       */
> +		bool shutdown;
> +      /* Various counters used to decide when to enter polling mode
> +       * or leave it and return to notification mode.
> +       */
> +		unsigned long jiffies_last_kick;
> +		unsigned long jiffies_last_work;
> +		int work_this_jiffy;
> +		struct page *avail_page;
> +		volatile struct vring_avail *avail_mapped;
> +	} vqpoll;
>  };
>  
>  struct vhost_dev {
> @@ -123,6 +152,7 @@ struct vhost_dev {
>  	spinlock_t work_lock;
>  	struct list_head work_list;
>  	struct task_struct *worker;
> +	struct list_head vqpoll_list;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> -- 
> 1.7.9.5

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-10 19:45     ` Michael S. Tsirkin
  0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-10 19:45 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: ERANRA, kvm, linux-kernel, GLIKSON, abel.gordon, YOSSIKU, JOELN,
	netdev, virtualization

On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> From: Razya Ladelsky <razya@il.ibm.com>
> Date: Thu, 31 Jul 2014 09:47:20 +0300
> Subject: [PATCH] vhost: Add polling mode
> 
> When vhost is waiting for buffers from the guest driver (e.g., more packets to
> send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> (and possibly userspace involvement in translating this PIO exit into a file
> descriptor event), all of which hurts performance.
> 
> If the system is under-utilized (has cpu time to spare), vhost can continuously
> poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled via a
> kernel module parameter, "poll_start_rate".
> 
> When polling is active for a virtqueue, the guest is asked to disable
> notification (kicks), and the worker thread continuously checks for new buffers.
> When it does discover new buffers, it simulates a "kick" by invoking the
> underlying backend driver (such as vhost-net), which thinks it got a real kick
> from the guest, and acts accordingly. If the underlying driver asks not to be
> kicked, we disable polling on this virtqueue.
> 
> We start polling on a virtqueue when we notice it has work to do. Polling on
> this virtqueue is later disabled after 3 seconds of polling turning up no new
> work, as in this case we are better off returning to the exit-based notification
> mechanism. The default timeout of 3 seconds can be changed with the
> "poll_stop_idle" kernel module parameter.
> 
> This polling approach makes lot of sense for new HW with posted-interrupts for
> which we have exitless host-to-guest notifications. But even with support for
> posted interrupts, guest-to-host communication still causes exits. Polling adds
> the missing part.
> 
> When systems are overloaded, there won't be enough cpu time for the various
> vhost threads to poll their guests' devices. For these scenarios, we plan to add
> support for vhost threads that can be shared by multiple devices, even of
> multiple vms.
> Our ultimate goal is to implement the I/O acceleration features described in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> 
> I ran some experiments with TCP stream netperf and filebench (having 2 threads
> performing random reads) benchmarks on an IBM System x3650 M4.
> I have two machines, A and B. A hosts the vms, B runs the netserver.
> The vms (on A) run netperf, its destination server is running on B.
> All runs loaded the guests in a way that they were (cpu) saturated. For example,
> I ran netperf with 64B messages, which is heavily loading the vm (which is why
> its throughput is low).
> The idea was to get it 100% loaded, so we can see that the polling is getting it
> to produce higher throughput.

And, did your tests actually produce 100% load on both host CPUs?

> The system had two cores per guest, as to allow for both the vcpu and the vhost
> thread to run concurrently for maximum throughput (but I didn't pin the threads
> to specific cores).
> My experiments were fair in a sense that for both cases, with or without
> polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> way). The only difference was whether polling was enabled/disabled.
> 
> Results:
> 
> Netperf, 1 vm:
> The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> Number of exits/sec decreased 6x.
> The same improvement was shown when I tested with 3 vms running netperf
> (4086 MB/sec -> 5545 MB/sec).
> 
> filebench, 1 vm:
> ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> 31%.
> The same experiment with 3 vms running filebench showed similar numbers.
> 
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> ---
>  drivers/vhost/net.c   |    6 +-
>  drivers/vhost/scsi.c  |    6 +-
>  drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   38 +++++++-
>  4 files changed, 277 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 971a760..558aecb 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	}
>  	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>  
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> +			vqs[VHOST_NET_VQ_TX]);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> +			vqs[VHOST_NET_VQ_RX]);
>  
>  	f->private_data = n;
>  
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4f4ffa4..665eeeb 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>  	if (!vqs)
>  		goto err_vqs;
>  
> -	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
> -	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> -
> +	vhost_work_init(&vs->vs_completion_work, NULL,
> +						vhost_scsi_complete_cmd_work);
> +	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
>  	vs->vs_events_nr = 0;
>  	vs->vs_events_missed = false;
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c90f437..fbe8174 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -24,9 +24,17 @@
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
>  #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>  #include <linux/module.h>
>  
>  #include "vhost.h"
> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
>  
>  enum {
>  	VHOST_MEMORY_MAX_NREGIONS = 64,
> @@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>  	return 0;
>  }
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn)
>  {
>  	INIT_LIST_HEAD(&work->node);
>  	work->fn = fn;
>  	init_waitqueue_head(&work->done);
>  	work->flushing = 0;
>  	work->queue_seq = work->done_seq = 0;
> +	work->vq = vq;
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_init);
>  
>  /* Init poll structure */
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev)
> +		     unsigned long mask, struct vhost_virtqueue *vq)
>  {
>  	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
>  	init_poll_funcptr(&poll->table, vhost_poll_func);
>  	poll->mask = mask;
> -	poll->dev = dev;
> +	poll->dev = vq->dev;
>  	poll->wqh = NULL;
> -
> -	vhost_work_init(&poll->work, fn);
> +	vhost_work_init(&poll->work, vq, fn);
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_init);
>  
> @@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_queue);
>  
> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
> + *
> + * Enabling this mode it tells the guest not to notify ("kick") us when its
> + * has made more work available on this virtqueue; Rather, we will continuously
> + * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
> + * the worker thread polls them all, e.g., in a round-robin fashion.
> + * Note that vqpoll.enabled doesn't always mean that this virtqueue is
> + * actually being polled: The backend (e.g., net.c) may temporarily disable it
> + * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
> + *
> + * It is assumed that these functions are called relatively rarely, when vhost
> + * notices that this virtqueue's usage pattern significantly changed in a way
> + * that makes polling more efficient than notification, or vice versa.
> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
> + * reclaimed.
> + */
> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (vq->vqpoll.enabled)
> +		return; /* already enabled, nothing to do */
> +	if (!vq->handle_kick)
> +		return; /* polling will be a waste of time if no callback! */
> +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> +		/* vq has guest notifications enabled. Disable them,
> +		   and instead add vq to the polling list */
> +		vhost_disable_notify(vq->dev, vq);
> +		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
> +	}
> +	vq->vqpoll.jiffies_last_kick = jiffies;
> +	__get_user(vq->avail_idx, &vq->avail->idx);
> +	vq->vqpoll.enabled = true;
> +
> +	/* Map userspace's vq->avail to the kernel's memory space. */
> +	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> +		&vq->vqpoll.avail_page) != 1) {
> +		/* TODO: can this happen, as we check access
> +		to vq->avail in advance? */
> +		BUG();
> +	}
> +	vq->vqpoll.avail_mapped = (struct vring_avail *) (
> +		(unsigned long)kmap(vq->vqpoll.avail_page) |
> +		((unsigned long)vq->avail & ~PAGE_MASK));
> +}
> +
> +/*
> + * This function doesn't always succeed in changing the mode. Sometimes
> + * a temporary race condition prevents turning on guest notifications, so
> + * vq should be polled next time again.
> + */
> +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (!vq->vqpoll.enabled)
> +		return; /* already disabled, nothing to do */
> +
> +	vq->vqpoll.enabled = false;
> +
> +	if (!list_empty(&vq->vqpoll.link)) {
> +		/* vq is on the polling list, remove it from this list and
> +		 * instead enable guest notifications. */
> +		list_del_init(&vq->vqpoll.link);
> +		if (unlikely(vhost_enable_notify(vq->dev, vq))
> +			&& !vq->vqpoll.shutdown) {
> +			/* Race condition: guest wrote before we enabled
> +			 * notification, so we'll never get a notification for
> +			 * this work - so continue polling mode for a while. */
> +			vhost_disable_notify(vq->dev, vq);
> +			vq->vqpoll.enabled = true;
> +			vhost_enable_notify(vq->dev, vq);
> +			return;
> +		}
> +	}
> +
> +	if (vq->vqpoll.avail_mapped) {
> +		kunmap(vq->vqpoll.avail_page);
> +		put_page(vq->vqpoll.avail_page);
> +		vq->vqpoll.avail_mapped = 0;
> +	}
> +}
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>  			   struct vhost_virtqueue *vq)
>  {
> @@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
>  	vq->memory = NULL;
> +	INIT_LIST_HEAD(&vq->vqpoll.link);
> +	vq->vqpoll.enabled = false;
> +	vq->vqpoll.shutdown = false;
> +	vq->vqpoll.avail_mapped = NULL;
> +}
> +
> +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
> + * virtqueues which the caller should kick, or NULL in case none should be
> + * kicked. roundrobin_poll() also disables polling on a virtqueue which has
> + * been polled for too long without success.
> + *
> + * This current implementation (the "round-robin" implementation) only
> + * polls the first vq in the list, returning it or NULL as appropriate, and
> + * moves this vq to the end of the list, so next time a different one is
> + * polled.
> + */
> +static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
> +{
> +	struct vhost_virtqueue *vq;
> +	u16 avail_idx;
> +
> +	if (list_empty(list))
> +		return NULL;
> +
> +	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> +	WARN_ON(!vq->vqpoll.enabled);
> +	list_move_tail(&vq->vqpoll.link, list);
> +
> +	/* See if there is any new work available from the guest. */
> +	/* TODO: can check the optional idx feature, and if we haven't
> +	* reached that idx yet, don't kick... */
> +	avail_idx = vq->vqpoll.avail_mapped->idx;
> +	if (avail_idx != vq->last_avail_idx)
> +		return vq;
> +
> +	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> +		/* We've been polling this virtqueue for a long time with no
> +		* results, so switch back to guest notification
> +		*/
> +		vhost_vq_disable_vqpoll(vq);
> +	}
> +	return NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -237,12 +368,62 @@ static int vhost_worker(void *data)
>  		spin_unlock_irq(&dev->work_lock);
>  
>  		if (work) {
> +			struct vhost_virtqueue *vq = work->vq;
>  			__set_current_state(TASK_RUNNING);
>  			work->fn(work);
> +			/* Keep track of the work rate, for deciding when to
> +			 * enable polling */
> +			if (vq) {
> +				if (vq->vqpoll.jiffies_last_work != jiffies) {
> +					vq->vqpoll.jiffies_last_work = jiffies;
> +					vq->vqpoll.work_this_jiffy = 0;
> +				}
> +				vq->vqpoll.work_this_jiffy++;
> +			}
> +			/* If vq is in the round-robin list of virtqueues being
> +			 * constantly checked by this thread, move vq the end
> +			 * of the queue, because it had its fair chance now.
> +			 */
> +			if (vq && !list_empty(&vq->vqpoll.link)) {
> +				list_move_tail(&vq->vqpoll.link,
> +					&dev->vqpoll_list);
> +			}
> +			/* Otherwise, if this vq is looking for notifications
> +			 * but vq polling is not enabled for it, do it now.
> +			 */
> +			else if (poll_start_rate && vq && vq->handle_kick &&
> +				!vq->vqpoll.enabled &&
> +				!vq->vqpoll.shutdown &&
> +				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
> +				vq->vqpoll.work_this_jiffy >=
> +					poll_start_rate) {
> +				vhost_vq_enable_vqpoll(vq);
> +			}
> +		}
> +		/* Check one virtqueue from the round-robin list */
> +		if (!list_empty(&dev->vqpoll_list)) {
> +			struct vhost_virtqueue *vq;
> +
> +			vq = roundrobin_poll(&dev->vqpoll_list);
> +
> +			if (vq) {
> +				vq->handle_kick(&vq->poll.work);
> +				vq->vqpoll.jiffies_last_kick = jiffies;
> +			}
> +
> +			/* If our polling list isn't empty, ask to continue
> +			 * running this thread, don't yield.
> +			 */
> +			__set_current_state(TASK_RUNNING);
>  			if (need_resched())
>  				schedule();
> -		} else
> -			schedule();
> +		} else {
> +			if (work) {
> +				if (need_resched())
> +					schedule();
> +			} else
> +				schedule();
> +		}
>  
>  	}
>  	unuse_mm(dev->mm);
> @@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
> +	INIT_LIST_HEAD(&dev->vqpoll_list);
>  	dev->worker = NULL;
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
> @@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
>  			vhost_poll_init(&vq->poll, vq->handle_kick,
> -					POLLIN, dev);
> +					POLLIN, vq);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_init);
> @@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
>  	struct vhost_attach_cgroups_struct attach;
>  
>  	attach.owner = current;
> -	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> +	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
>  	vhost_work_queue(dev, &attach.work);
>  	vhost_work_flush(dev, &attach.work);
>  	return attach.ret;
> @@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
> + * mode for a given virtqueue which is itself being shut down. We ask the
> + * worker thread to do this rather than doing it directly, so that we don't
> + * race with the worker thread's use of the queue.
> + */
> +static void shutdown_vqpoll_work(struct vhost_work *work)
> +{
> +	work->vq->vqpoll.shutdown = true;
> +	vhost_vq_disable_vqpoll(work->vq);
> +	WARN_ON(work->vq->vqpoll.avail_mapped);
> +}
> +
> +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	struct vhost_work work;
> +
> +	vhost_work_init(&work, vq, shutdown_vqpoll_work);
> +	vhost_work_queue(vq->dev, &work);
> +	vhost_work_flush(vq->dev, &work);
> +}
>  /* Caller should have device mutex if and only if locked is set */
>  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  {
> @@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  			eventfd_ctx_put(dev->vqs[i]->call_ctx);
>  		if (dev->vqs[i]->call)
>  			fput(dev->vqs[i]->call);
> +		shutdown_vqpoll(dev->vqs[i]);
>  		vhost_vq_reset(dev, dev->vqs[i]);
>  	}
>  	vhost_dev_free_iovecs(dev);
> @@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	u16 avail_idx;
>  	int r;
>  
> +	/* In polling mode, when the backend (e.g., net.c) asks to enable
> +	 * notifications, we don't enable guest notifications. Instead, start
> +	 * polling on this vq by adding it to the round-robin list.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (list_empty(&vq->vqpoll.link)) {
> +			list_add_tail(&vq->vqpoll.link,
> +				&vq->dev->vqpoll_list);
> +			vq->vqpoll.jiffies_last_kick = jiffies;
> +		}
> +		return false;
> +	}
> +
>  	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>  		return false;
>  	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> @@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
>  	int r;
>  
> +	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
> +	 * will generate notifications even if the guest is asked not to send
> +	 * them. So we must remove it from the round-robin polling list.
> +	 * Note that vqpoll.enabled remains set.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (!list_empty(&vq->vqpoll.link))
> +			list_del_init(&vq->vqpoll.link);
> +		return;
> +	}
> +
>  	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
>  		return;
>  	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3eda654..11aaaf4 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -24,6 +24,7 @@ struct vhost_work {
>  	int			  flushing;
>  	unsigned		  queue_seq;
>  	unsigned		  done_seq;
> +	struct vhost_virtqueue    *vq;
>  };
>  
>  /* Poll a file (eventfd or socket) */
> @@ -37,11 +38,12 @@ struct vhost_poll {
>  	struct vhost_dev	 *dev;
>  };
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn);
>  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
>  
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev);
> +		     unsigned long mask, struct vhost_virtqueue  *vq);
>  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
>  void vhost_poll_stop(struct vhost_poll *poll);
>  void vhost_poll_flush(struct vhost_poll *poll);
> @@ -54,8 +56,6 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> -struct vhost_virtqueue;
> -
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -110,6 +110,35 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	struct {
> +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> +       * that instead of using guest notifications (kicks) to
> +       * discover new work, we prefer to continuously poll this
> +       * virtqueue in the worker thread.
> +       * If !enabled, the rest of the fields below are undefined.
> +       */
> +		bool enabled;
> +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> +       * actually being polled: The backend (e.g., net.c) may
> +       * temporarily disable it using vhost_disable/enable_notify().
> +       * vqpoll.link is used to maintain the thread's round-robin
> +       * list of virtqueues that actually need to be polled.
> +       * Note list_empty(link) means this virtqueue isn't polled.
> +       */
> +		struct list_head link;
> +      /* If this flag is true, the virtqueue is being shut down,
> +       * so vqpoll should not be re-enabled.
> +       */
> +		bool shutdown;
> +      /* Various counters used to decide when to enter polling mode
> +       * or leave it and return to notification mode.
> +       */
> +		unsigned long jiffies_last_kick;
> +		unsigned long jiffies_last_work;
> +		int work_this_jiffy;
> +		struct page *avail_page;
> +		volatile struct vring_avail *avail_mapped;
> +	} vqpoll;
>  };
>  
>  struct vhost_dev {
> @@ -123,6 +152,7 @@ struct vhost_dev {
>  	spinlock_t work_lock;
>  	struct list_head work_list;
>  	struct task_struct *worker;
> +	struct list_head vqpoll_list;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> -- 
> 1.7.9.5

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-10 19:45     ` Michael S. Tsirkin
@ 2014-08-11 19:46       ` David Miller
  -1 siblings, 0 replies; 55+ messages in thread
From: David Miller @ 2014-08-11 19:46 UTC (permalink / raw)
  To: mst
  Cc: razya, kvm, GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon,
	linux-kernel, netdev, virtualization

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Sun, 10 Aug 2014 21:45:59 +0200

> On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
 ...
> And, did your tests actually produce 100% load on both host CPUs?
 ...

Michael, please do not quote an entire patch just to ask a one line
question.

I truly, truly, wish it was simpler in modern email clients to delete
the unrelated quoted material because I bet when people do this they
are simply being lazy.

Thank you.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-11 19:46       ` David Miller
  0 siblings, 0 replies; 55+ messages in thread
From: David Miller @ 2014-08-11 19:46 UTC (permalink / raw)
  To: mst
  Cc: ERANRA, kvm, linux-kernel, razya, GLIKSON, YOSSIKU, abel.gordon,
	JOELN, netdev, virtualization

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Sun, 10 Aug 2014 21:45:59 +0200

> On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
 ...
> And, did your tests actually produce 100% load on both host CPUs?
 ...

Michael, please do not quote an entire patch just to ask a one line
question.

I truly, truly, wish it was simpler in modern email clients to delete
the unrelated quoted material because I bet when people do this they
are simply being lazy.

Thank you.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-11 19:46       ` David Miller
@ 2014-08-12  9:18         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-12  9:18 UTC (permalink / raw)
  To: David Miller
  Cc: razya, kvm, GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon,
	linux-kernel, netdev, virtualization

On Mon, Aug 11, 2014 at 12:46:21PM -0700, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Sun, 10 Aug 2014 21:45:59 +0200
> 
> > On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
>  ...
> > And, did your tests actually produce 100% load on both host CPUs?
>  ...
> 
> Michael, please do not quote an entire patch just to ask a one line
> question.
> 
> I truly, truly, wish it was simpler in modern email clients to delete
> the unrelated quoted material because I bet when people do this they
> are simply being lazy.
> 
> Thank you.

Lazy - mea culpa, though I'm using mutt so it isn't even hard.

The question still stands: the test results are only valid
if CPU was at 100% in all configurations.
This is the reason I generally prefer it when people report
throughput divided by CPU (power would be good too but it still
isn't easy for people to get that number).

-- 
MST


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-12  9:18         ` Michael S. Tsirkin
  0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-12  9:18 UTC (permalink / raw)
  To: David Miller
  Cc: ERANRA, kvm, linux-kernel, razya, GLIKSON, YOSSIKU, abel.gordon,
	JOELN, netdev, virtualization

On Mon, Aug 11, 2014 at 12:46:21PM -0700, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Sun, 10 Aug 2014 21:45:59 +0200
> 
> > On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
>  ...
> > And, did your tests actually produce 100% load on both host CPUs?
>  ...
> 
> Michael, please do not quote an entire patch just to ask a one line
> question.
> 
> I truly, truly, wish it was simpler in modern email clients to delete
> the unrelated quoted material because I bet when people do this they
> are simply being lazy.
> 
> Thank you.

Lazy - mea culpa, though I'm using mutt so it isn't even hard.

The question still stands: the test results are only valid
if CPU was at 100% in all configurations.
This is the reason I generally prefer it when people report
throughput divided by CPU (power would be good too but it still
isn't easy for people to get that number).

-- 
MST

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-12  9:18         ` Michael S. Tsirkin
@ 2014-08-12 10:57           ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-12 10:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, David Miller, Eran Raichstein,
	Joel Nider, kvm, linux-kernel, netdev, virtualization,
	Yossi Kuperman1

"Michael S. Tsirkin" <mst@redhat.com> wrote on 12/08/2014 12:18:50 PM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: David Miller <davem@davemloft.net>
> Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, Alex 
> Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi 
> Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> abel.gordon@gmail.com, linux-kernel@vger.kernel.org, 
> netdev@vger.kernel.org, virtualization@lists.linux-foundation.org
> Date: 12/08/2014 12:18 PM
> Subject: Re: [PATCH] vhost: Add polling mode
> 
> On Mon, Aug 11, 2014 at 12:46:21PM -0700, David Miller wrote:
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > Date: Sun, 10 Aug 2014 21:45:59 +0200
> > 
> > > On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> >  ...
> > > And, did your tests actually produce 100% load on both host CPUs?
> >  ...
> > 
> > Michael, please do not quote an entire patch just to ask a one line
> > question.
> > 
> > I truly, truly, wish it was simpler in modern email clients to delete
> > the unrelated quoted material because I bet when people do this they
> > are simply being lazy.
> > 
> > Thank you.
> 
> Lazy - mea culpa, though I'm using mutt so it isn't even hard.
> 
> The question still stands: the test results are only valid
> if CPU was at 100% in all configurations.
> This is the reason I generally prefer it when people report
> throughput divided by CPU (power would be good too but it still
> isn't easy for people to get that number).
> 

Hi Michael,

Sorry for the delay, had some problems with my mailbox, and I realized 
just now that 
my reply wasn't sent.
The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 
not.
The vhost thread utilized less than 100% (of the other cpu) when polling 
was disabled.
Enabling polling increased its utilization to 100% (in which case both 
cpus were 100% utilized). 


> -- 
> MST
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-12 10:57           ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-12 10:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eran Raichstein, kvm, linux-kernel, abel.gordon, Alex Glikson,
	Yossi Kuperman1, Joel Nider, netdev, virtualization,
	David Miller

"Michael S. Tsirkin" <mst@redhat.com> wrote on 12/08/2014 12:18:50 PM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: David Miller <davem@davemloft.net>
> Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, Alex 
> Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi 
> Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> abel.gordon@gmail.com, linux-kernel@vger.kernel.org, 
> netdev@vger.kernel.org, virtualization@lists.linux-foundation.org
> Date: 12/08/2014 12:18 PM
> Subject: Re: [PATCH] vhost: Add polling mode
> 
> On Mon, Aug 11, 2014 at 12:46:21PM -0700, David Miller wrote:
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > Date: Sun, 10 Aug 2014 21:45:59 +0200
> > 
> > > On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> >  ...
> > > And, did your tests actually produce 100% load on both host CPUs?
> >  ...
> > 
> > Michael, please do not quote an entire patch just to ask a one line
> > question.
> > 
> > I truly, truly, wish it was simpler in modern email clients to delete
> > the unrelated quoted material because I bet when people do this they
> > are simply being lazy.
> > 
> > Thank you.
> 
> Lazy - mea culpa, though I'm using mutt so it isn't even hard.
> 
> The question still stands: the test results are only valid
> if CPU was at 100% in all configurations.
> This is the reason I generally prefer it when people report
> throughput divided by CPU (power would be good too but it still
> isn't easy for people to get that number).
> 

Hi Michael,

Sorry for the delay, had some problems with my mailbox, and I realized 
just now that 
my reply wasn't sent.
The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 
not.
The vhost thread utilized less than 100% (of the other cpu) when polling 
was disabled.
Enabling polling increased its utilization to 100% (in which case both 
cpus were 100% utilized). 


> -- 
> MST
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-12 10:57           ` Razya Ladelsky
@ 2014-08-13 12:15             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-13 12:15 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Alex Glikson, David Miller, Eran Raichstein,
	Joel Nider, kvm, linux-kernel, netdev, virtualization,
	Yossi Kuperman1

On Tue, Aug 12, 2014 at 01:57:05PM +0300, Razya Ladelsky wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 12/08/2014 12:18:50 PM:
> 
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > To: David Miller <davem@davemloft.net>
> > Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, Alex 
> > Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi 
> > Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> > abel.gordon@gmail.com, linux-kernel@vger.kernel.org, 
> > netdev@vger.kernel.org, virtualization@lists.linux-foundation.org
> > Date: 12/08/2014 12:18 PM
> > Subject: Re: [PATCH] vhost: Add polling mode
> > 
> > On Mon, Aug 11, 2014 at 12:46:21PM -0700, David Miller wrote:
> > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > Date: Sun, 10 Aug 2014 21:45:59 +0200
> > > 
> > > > On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> > >  ...
> > > > And, did your tests actually produce 100% load on both host CPUs?
> > >  ...
> > > 
> > > Michael, please do not quote an entire patch just to ask a one line
> > > question.
> > > 
> > > I truly, truly, wish it was simpler in modern email clients to delete
> > > the unrelated quoted material because I bet when people do this they
> > > are simply being lazy.
> > > 
> > > Thank you.
> > 
> > Lazy - mea culpa, though I'm using mutt so it isn't even hard.
> > 
> > The question still stands: the test results are only valid
> > if CPU was at 100% in all configurations.
> > This is the reason I generally prefer it when people report
> > throughput divided by CPU (power would be good too but it still
> > isn't easy for people to get that number).
> > 
> 
> Hi Michael,
> 
> Sorry for the delay, had some problems with my mailbox, and I realized 
> just now that 
> my reply wasn't sent.
> The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 
> not.
> The vhost thread utilized less than 100% (of the other cpu) when polling 
> was disabled.
> Enabling polling increased its utilization to 100% (in which case both 
> cpus were 100% utilized). 

Hmm this means the testing wasn't successful then, as you said:

	The idea was to get it 100% loaded, so we can see that the polling is
	getting it to produce higher throughput.

in fact here you are producing more throughput but spending more power
to produce it, which can have any number of explanations besides polling
improving the efficiency. For example, increasing system load might
disable host power management.


> > -- 
> > MST
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-13 12:15             ` Michael S. Tsirkin
  0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-13 12:15 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: Eran Raichstein, kvm, linux-kernel, abel.gordon, Alex Glikson,
	Yossi Kuperman1, Joel Nider, netdev, virtualization,
	David Miller

On Tue, Aug 12, 2014 at 01:57:05PM +0300, Razya Ladelsky wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 12/08/2014 12:18:50 PM:
> 
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > To: David Miller <davem@davemloft.net>
> > Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, Alex 
> > Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Yossi 
> > Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> > abel.gordon@gmail.com, linux-kernel@vger.kernel.org, 
> > netdev@vger.kernel.org, virtualization@lists.linux-foundation.org
> > Date: 12/08/2014 12:18 PM
> > Subject: Re: [PATCH] vhost: Add polling mode
> > 
> > On Mon, Aug 11, 2014 at 12:46:21PM -0700, David Miller wrote:
> > > From: "Michael S. Tsirkin" <mst@redhat.com>
> > > Date: Sun, 10 Aug 2014 21:45:59 +0200
> > > 
> > > > On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> > >  ...
> > > > And, did your tests actually produce 100% load on both host CPUs?
> > >  ...
> > > 
> > > Michael, please do not quote an entire patch just to ask a one line
> > > question.
> > > 
> > > I truly, truly, wish it was simpler in modern email clients to delete
> > > the unrelated quoted material because I bet when people do this they
> > > are simply being lazy.
> > > 
> > > Thank you.
> > 
> > Lazy - mea culpa, though I'm using mutt so it isn't even hard.
> > 
> > The question still stands: the test results are only valid
> > if CPU was at 100% in all configurations.
> > This is the reason I generally prefer it when people report
> > throughput divided by CPU (power would be good too but it still
> > isn't easy for people to get that number).
> > 
> 
> Hi Michael,
> 
> Sorry for the delay, had some problems with my mailbox, and I realized 
> just now that 
> my reply wasn't sent.
> The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 
> not.
> The vhost thread utilized less than 100% (of the other cpu) when polling 
> was disabled.
> Enabling polling increased its utilization to 100% (in which case both 
> cpus were 100% utilized). 

Hmm this means the testing wasn't successful then, as you said:

	The idea was to get it 100% loaded, so we can see that the polling is
	getting it to produce higher throughput.

in fact here you are producing more throughput but spending more power
to produce it, which can have any number of explanations besides polling
improving the efficiency. For example, increasing system load might
disable host power management.


> > -- 
> > MST
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-13 12:15             ` Michael S. Tsirkin
@ 2014-08-17 12:35               ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-17 12:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, David Miller, Eran Raichstein,
	Joel Nider, kvm, kvm-owner, linux-kernel, netdev, virtualization,
	Yossi Kuperman1

> > 
> > Hi Michael,
> > 
> > Sorry for the delay, had some problems with my mailbox, and I realized 

> > just now that 
> > my reply wasn't sent.
> > The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 

> > not.
> > The vhost thread utilized less than 100% (of the other cpu) when 
polling 
> > was disabled.
> > Enabling polling increased its utilization to 100% (in which case both 

> > cpus were 100% utilized). 
> 
> Hmm this means the testing wasn't successful then, as you said:
> 
>    The idea was to get it 100% loaded, so we can see that the polling is
>    getting it to produce higher throughput.
> 
> in fact here you are producing more throughput but spending more power
> to produce it, which can have any number of explanations besides polling
> improving the efficiency. For example, increasing system load might
> disable host power management.
>

Hi Michael,
I re-ran the tests, this time with the  "turbo mode" and  "C-states" 
features off.
No Polling:
1 VM running netperf (msg size 64B): 1107 Mbits/sec
 Polling:
1 VM running netperf (msg size 64B): 1572 Mbits/sec








As you can see from the new results, the numbers are lower, 
but relatively (polling on/off) there's no change.
Thank you,
Razya


 


 
> 
> > > -- 
> > > MST
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-17 12:35               ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-17 12:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eran Raichstein, kvm-owner, kvm, linux-kernel, abel.gordon,
	Alex Glikson, Yossi Kuperman1, Joel Nider, netdev,
	virtualization, David Miller

> > 
> > Hi Michael,
> > 
> > Sorry for the delay, had some problems with my mailbox, and I realized 

> > just now that 
> > my reply wasn't sent.
> > The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 

> > not.
> > The vhost thread utilized less than 100% (of the other cpu) when 
polling 
> > was disabled.
> > Enabling polling increased its utilization to 100% (in which case both 

> > cpus were 100% utilized). 
> 
> Hmm this means the testing wasn't successful then, as you said:
> 
>    The idea was to get it 100% loaded, so we can see that the polling is
>    getting it to produce higher throughput.
> 
> in fact here you are producing more throughput but spending more power
> to produce it, which can have any number of explanations besides polling
> improving the efficiency. For example, increasing system load might
> disable host power management.
>

Hi Michael,
I re-ran the tests, this time with the  "turbo mode" and  "C-states" 
features off.
No Polling:
1 VM running netperf (msg size 64B): 1107 Mbits/sec
 Polling:
1 VM running netperf (msg size 64B): 1572 Mbits/sec








As you can see from the new results, the numbers are lower, 
but relatively (polling on/off) there's no change.
Thank you,
Razya


 


 
> 
> > > -- 
> > > MST
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-17 12:35               ` Razya Ladelsky
@ 2014-08-17 12:58                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-17 12:58 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Alex Glikson, David Miller, Eran Raichstein,
	Joel Nider, kvm, kvm-owner, linux-kernel, netdev, virtualization,
	Yossi Kuperman1

On Sun, Aug 17, 2014 at 03:35:39PM +0300, Razya Ladelsky wrote:
> > > 
> > > Hi Michael,
> > > 
> > > Sorry for the delay, had some problems with my mailbox, and I realized 
> 
> > > just now that 
> > > my reply wasn't sent.
> > > The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 
> 
> > > not.
> > > The vhost thread utilized less than 100% (of the other cpu) when 
> polling 
> > > was disabled.
> > > Enabling polling increased its utilization to 100% (in which case both 
> 
> > > cpus were 100% utilized). 
> > 
> > Hmm this means the testing wasn't successful then, as you said:
> > 
> >    The idea was to get it 100% loaded, so we can see that the polling is
> >    getting it to produce higher throughput.
> > 
> > in fact here you are producing more throughput but spending more power
> > to produce it, which can have any number of explanations besides polling
> > improving the efficiency. For example, increasing system load might
> > disable host power management.
> >
> 
> Hi Michael,
> I re-ran the tests, this time with the  "turbo mode" and  "C-states" 
> features off.
> No Polling:
> 1 VM running netperf (msg size 64B): 1107 Mbits/sec
>  Polling:
> 1 VM running netperf (msg size 64B): 1572 Mbits/sec
> 
> 
> 
> 
> 
> 
> 
> As you can see from the new results, the numbers are lower, 
> but relatively (polling on/off) there's no change.
> Thank you,
> Razya

That was just one example. There many other possibilities.  Either
actually make the systems load all host CPUs equally, or divide
throughput by host CPU.

> 
>  
> 
> 
>  
> > 
> > > > -- 
> > > > MST
> > > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-17 12:58                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-17 12:58 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: Eran Raichstein, kvm-owner, kvm, linux-kernel, abel.gordon,
	Alex Glikson, Yossi Kuperman1, Joel Nider, netdev,
	virtualization, David Miller

On Sun, Aug 17, 2014 at 03:35:39PM +0300, Razya Ladelsky wrote:
> > > 
> > > Hi Michael,
> > > 
> > > Sorry for the delay, had some problems with my mailbox, and I realized 
> 
> > > just now that 
> > > my reply wasn't sent.
> > > The vm indeed ALWAYS utilized 100% cpu, whether polling was enabled or 
> 
> > > not.
> > > The vhost thread utilized less than 100% (of the other cpu) when 
> polling 
> > > was disabled.
> > > Enabling polling increased its utilization to 100% (in which case both 
> 
> > > cpus were 100% utilized). 
> > 
> > Hmm this means the testing wasn't successful then, as you said:
> > 
> >    The idea was to get it 100% loaded, so we can see that the polling is
> >    getting it to produce higher throughput.
> > 
> > in fact here you are producing more throughput but spending more power
> > to produce it, which can have any number of explanations besides polling
> > improving the efficiency. For example, increasing system load might
> > disable host power management.
> >
> 
> Hi Michael,
> I re-ran the tests, this time with the  "turbo mode" and  "C-states" 
> features off.
> No Polling:
> 1 VM running netperf (msg size 64B): 1107 Mbits/sec
>  Polling:
> 1 VM running netperf (msg size 64B): 1572 Mbits/sec
> 
> 
> 
> 
> 
> 
> 
> As you can see from the new results, the numbers are lower, 
> but relatively (polling on/off) there's no change.
> Thank you,
> Razya

That was just one example. There many other possibilities.  Either
actually make the systems load all host CPUs equally, or divide
throughput by host CPU.

> 
>  
> 
> 
>  
> > 
> > > > -- 
> > > > MST
> > > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-17 12:58                 ` Michael S. Tsirkin
@ 2014-08-19  8:36                   ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-19  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, David Miller, Eran Raichstein,
	Joel Nider, kvm, kvm-owner, linux-kernel, netdev, virtualization,
	Yossi Kuperman1

> That was just one example. There many other possibilities.  Either
> actually make the systems load all host CPUs equally, or divide
> throughput by host CPU.
> 

The polling patch adds this capability to vhost, reducing costly exit 
overhead when the vm is loaded.

In order to load the vm I ran netperf  with msg size of 256:

Without polling:  2480 Mbits/sec,  utilization: vm - 100%   vhost - 64% 
With Polling: 4160 Mbits/sec,  utilization: vm - 100%   vhost - 100% 

Therefore, throughput/cpu without polling is 15.1, and 20.8 with polling.

My intention was to load vhost as close as possible to 100% utilization 
without polling, in order to compare it to the polling utilization case 
(where vhost is always 100%). 
The best use case, of course, would be when the shared vhost thread work 
(TBD) is integrated and then vhost will actually be using its polling 
cycles to handle requests of multiple devices (even from multiple vms).

Thanks,
Razya



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-19  8:36                   ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-19  8:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eran Raichstein, kvm-owner, kvm, linux-kernel, abel.gordon,
	Alex Glikson, Yossi Kuperman1, Joel Nider, netdev,
	virtualization, David Miller

> That was just one example. There many other possibilities.  Either
> actually make the systems load all host CPUs equally, or divide
> throughput by host CPU.
> 

The polling patch adds this capability to vhost, reducing costly exit 
overhead when the vm is loaded.

In order to load the vm I ran netperf  with msg size of 256:

Without polling:  2480 Mbits/sec,  utilization: vm - 100%   vhost - 64% 
With Polling: 4160 Mbits/sec,  utilization: vm - 100%   vhost - 100% 

Therefore, throughput/cpu without polling is 15.1, and 20.8 with polling.

My intention was to load vhost as close as possible to 100% utilization 
without polling, in order to compare it to the polling utilization case 
(where vhost is always 100%). 
The best use case, of course, would be when the shared vhost thread work 
(TBD) is integrated and then vhost will actually be using its polling 
cycles to handle requests of multiple devices (even from multiple vms).

Thanks,
Razya

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-10  8:30 ` Razya Ladelsky
@ 2014-08-20  8:41     ` Christian Borntraeger
  2014-08-20  8:41     ` Christian Borntraeger
  2014-08-20 10:57     ` Michael S. Tsirkin
  2 siblings, 0 replies; 55+ messages in thread
From: Christian Borntraeger @ 2014-08-20  8:41 UTC (permalink / raw)
  To: Razya Ladelsky, mst, kvm
  Cc: GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon, linux-kernel,
	netdev, virtualization

On 10/08/14 10:30, Razya Ladelsky wrote:
> From: Razya Ladelsky <razya@il.ibm.com>
> Date: Thu, 31 Jul 2014 09:47:20 +0300
> Subject: [PATCH] vhost: Add polling mode
> 
> When vhost is waiting for buffers from the guest driver (e.g., more packets to
> send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> (and possibly userspace involvement in translating this PIO exit into a file
> descriptor event), all of which hurts performance.
> 
> If the system is under-utilized (has cpu time to spare), vhost can continuously
> poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled via a
> kernel module parameter, "poll_start_rate".
> 
> When polling is active for a virtqueue, the guest is asked to disable
> notification (kicks), and the worker thread continuously checks for new buffers.
> When it does discover new buffers, it simulates a "kick" by invoking the
> underlying backend driver (such as vhost-net), which thinks it got a real kick
> from the guest, and acts accordingly. If the underlying driver asks not to be
> kicked, we disable polling on this virtqueue.
> 
> We start polling on a virtqueue when we notice it has work to do. Polling on
> this virtqueue is later disabled after 3 seconds of polling turning up no new
> work, as in this case we are better off returning to the exit-based notification
> mechanism. The default timeout of 3 seconds can be changed with the
> "poll_stop_idle" kernel module parameter.
> 
> This polling approach makes lot of sense for new HW with posted-interrupts for
> which we have exitless host-to-guest notifications. But even with support for
> posted interrupts, guest-to-host communication still causes exits. Polling adds
> the missing part.
> 
> When systems are overloaded, there won't be enough cpu time for the various
> vhost threads to poll their guests' devices. For these scenarios, we plan to add
> support for vhost threads that can be shared by multiple devices, even of
> multiple vms.
> Our ultimate goal is to implement the I/O acceleration features described in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> 
> I ran some experiments with TCP stream netperf and filebench (having 2 threads
> performing random reads) benchmarks on an IBM System x3650 M4.
> I have two machines, A and B. A hosts the vms, B runs the netserver.
> The vms (on A) run netperf, its destination server is running on B.
> All runs loaded the guests in a way that they were (cpu) saturated. For example,
> I ran netperf with 64B messages, which is heavily loading the vm (which is why
> its throughput is low).
> The idea was to get it 100% loaded, so we can see that the polling is getting it
> to produce higher throughput.
> 
> The system had two cores per guest, as to allow for both the vcpu and the vhost
> thread to run concurrently for maximum throughput (but I didn't pin the threads
> to specific cores).
> My experiments were fair in a sense that for both cases, with or without
> polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> way). The only difference was whether polling was enabled/disabled.
> 
> Results:
> 
> Netperf, 1 vm:
> The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> Number of exits/sec decreased 6x.
> The same improvement was shown when I tested with 3 vms running netperf
> (4086 MB/sec -> 5545 MB/sec).
> 
> filebench, 1 vm:
> ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> 31%.
> The same experiment with 3 vms running filebench showed similar numbers.
> 
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>

Gave it a quick try on s390/kvm. As expected it makes no difference for big streaming workload like iperf.
uperf with a 1-1 round robin got indeed faster by about 30%.
The high CPU consumption is something that bothers me though, as virtualized systems tend to be full.


> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");

This seems ridicoudly high. Even one jiffie is an eternity, so setting it to 1 as a default would reduce the CPU overhead for most cases.
If we dont have a packet in one millisecond, we can surely go back to the kick approach, I think.

Christian


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-20  8:41     ` Christian Borntraeger
  0 siblings, 0 replies; 55+ messages in thread
From: Christian Borntraeger @ 2014-08-20  8:41 UTC (permalink / raw)
  To: Razya Ladelsky, mst, kvm
  Cc: ERANRA, netdev, linux-kernel, GLIKSON, YOSSIKU, abel.gordon,
	JOELN, virtualization

On 10/08/14 10:30, Razya Ladelsky wrote:
> From: Razya Ladelsky <razya@il.ibm.com>
> Date: Thu, 31 Jul 2014 09:47:20 +0300
> Subject: [PATCH] vhost: Add polling mode
> 
> When vhost is waiting for buffers from the guest driver (e.g., more packets to
> send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> (and possibly userspace involvement in translating this PIO exit into a file
> descriptor event), all of which hurts performance.
> 
> If the system is under-utilized (has cpu time to spare), vhost can continuously
> poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled via a
> kernel module parameter, "poll_start_rate".
> 
> When polling is active for a virtqueue, the guest is asked to disable
> notification (kicks), and the worker thread continuously checks for new buffers.
> When it does discover new buffers, it simulates a "kick" by invoking the
> underlying backend driver (such as vhost-net), which thinks it got a real kick
> from the guest, and acts accordingly. If the underlying driver asks not to be
> kicked, we disable polling on this virtqueue.
> 
> We start polling on a virtqueue when we notice it has work to do. Polling on
> this virtqueue is later disabled after 3 seconds of polling turning up no new
> work, as in this case we are better off returning to the exit-based notification
> mechanism. The default timeout of 3 seconds can be changed with the
> "poll_stop_idle" kernel module parameter.
> 
> This polling approach makes lot of sense for new HW with posted-interrupts for
> which we have exitless host-to-guest notifications. But even with support for
> posted interrupts, guest-to-host communication still causes exits. Polling adds
> the missing part.
> 
> When systems are overloaded, there won't be enough cpu time for the various
> vhost threads to poll their guests' devices. For these scenarios, we plan to add
> support for vhost threads that can be shared by multiple devices, even of
> multiple vms.
> Our ultimate goal is to implement the I/O acceleration features described in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> 
> I ran some experiments with TCP stream netperf and filebench (having 2 threads
> performing random reads) benchmarks on an IBM System x3650 M4.
> I have two machines, A and B. A hosts the vms, B runs the netserver.
> The vms (on A) run netperf, its destination server is running on B.
> All runs loaded the guests in a way that they were (cpu) saturated. For example,
> I ran netperf with 64B messages, which is heavily loading the vm (which is why
> its throughput is low).
> The idea was to get it 100% loaded, so we can see that the polling is getting it
> to produce higher throughput.
> 
> The system had two cores per guest, as to allow for both the vcpu and the vhost
> thread to run concurrently for maximum throughput (but I didn't pin the threads
> to specific cores).
> My experiments were fair in a sense that for both cases, with or without
> polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> way). The only difference was whether polling was enabled/disabled.
> 
> Results:
> 
> Netperf, 1 vm:
> The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> Number of exits/sec decreased 6x.
> The same improvement was shown when I tested with 3 vms running netperf
> (4086 MB/sec -> 5545 MB/sec).
> 
> filebench, 1 vm:
> ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> 31%.
> The same experiment with 3 vms running filebench showed similar numbers.
> 
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>

Gave it a quick try on s390/kvm. As expected it makes no difference for big streaming workload like iperf.
uperf with a 1-1 round robin got indeed faster by about 30%.
The high CPU consumption is something that bothers me though, as virtualized systems tend to be full.


> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");

This seems ridicoudly high. Even one jiffie is an eternity, so setting it to 1 as a default would reduce the CPU overhead for most cases.
If we dont have a packet in one millisecond, we can surely go back to the kick approach, I think.

Christian

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-20  8:41     ` Christian Borntraeger
@ 2014-08-20 10:32       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-20 10:32 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Razya Ladelsky, kvm, GLIKSON, ERANRA, YOSSIKU, JOELN,
	abel.gordon, linux-kernel, netdev, virtualization

On Wed, Aug 20, 2014 at 10:41:32AM +0200, Christian Borntraeger wrote:
> On 10/08/14 10:30, Razya Ladelsky wrote:
> > From: Razya Ladelsky <razya@il.ibm.com>
> > Date: Thu, 31 Jul 2014 09:47:20 +0300
> > Subject: [PATCH] vhost: Add polling mode
> > 
> > When vhost is waiting for buffers from the guest driver (e.g., more packets to
> > send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> > guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> > (and possibly userspace involvement in translating this PIO exit into a file
> > descriptor event), all of which hurts performance.
> > 
> > If the system is under-utilized (has cpu time to spare), vhost can continuously
> > poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> > This patch adds an optional polling mode to vhost, that can be enabled via a
> > kernel module parameter, "poll_start_rate".
> > 
> > When polling is active for a virtqueue, the guest is asked to disable
> > notification (kicks), and the worker thread continuously checks for new buffers.
> > When it does discover new buffers, it simulates a "kick" by invoking the
> > underlying backend driver (such as vhost-net), which thinks it got a real kick
> > from the guest, and acts accordingly. If the underlying driver asks not to be
> > kicked, we disable polling on this virtqueue.
> > 
> > We start polling on a virtqueue when we notice it has work to do. Polling on
> > this virtqueue is later disabled after 3 seconds of polling turning up no new
> > work, as in this case we are better off returning to the exit-based notification
> > mechanism. The default timeout of 3 seconds can be changed with the
> > "poll_stop_idle" kernel module parameter.
> > 
> > This polling approach makes lot of sense for new HW with posted-interrupts for
> > which we have exitless host-to-guest notifications. But even with support for
> > posted interrupts, guest-to-host communication still causes exits. Polling adds
> > the missing part.
> > 
> > When systems are overloaded, there won't be enough cpu time for the various
> > vhost threads to poll their guests' devices. For these scenarios, we plan to add
> > support for vhost threads that can be shared by multiple devices, even of
> > multiple vms.
> > Our ultimate goal is to implement the I/O acceleration features described in:
> > KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> > https://www.youtube.com/watch?v=9EyweibHfEs
> > and
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> > 
> > I ran some experiments with TCP stream netperf and filebench (having 2 threads
> > performing random reads) benchmarks on an IBM System x3650 M4.
> > I have two machines, A and B. A hosts the vms, B runs the netserver.
> > The vms (on A) run netperf, its destination server is running on B.
> > All runs loaded the guests in a way that they were (cpu) saturated. For example,
> > I ran netperf with 64B messages, which is heavily loading the vm (which is why
> > its throughput is low).
> > The idea was to get it 100% loaded, so we can see that the polling is getting it
> > to produce higher throughput.
> > 
> > The system had two cores per guest, as to allow for both the vcpu and the vhost
> > thread to run concurrently for maximum throughput (but I didn't pin the threads
> > to specific cores).
> > My experiments were fair in a sense that for both cases, with or without
> > polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> > way). The only difference was whether polling was enabled/disabled.
> > 
> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> 
> Gave it a quick try on s390/kvm. As expected it makes no difference for big streaming workload like iperf.
> uperf with a 1-1 round robin got indeed faster by about 30%.
> The high CPU consumption is something that bothers me though, as virtualized systems tend to be full.
> 
> 
> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
> 
> This seems ridicoudly high. Even one jiffie is an eternity, so setting it to 1 as a default would reduce the CPU overhead for most cases.
> If we dont have a packet in one millisecond, we can surely go back to the kick approach, I think.
> 
> Christian


Seconded.
Could you publish data with different poll_stop_idle values?
Additionally, time in jiffies is not a reasonable userspace
API. Please switch to some reasonable unit, like microseconds.

Thinking more about it, isn't this almost exactly what net.core.busy_poll does?
That one suggests 50usec timeout.

The only difference I see is in poll_start_rate heuristic,
net.core does not have anything like this.
Do you have data to show that it's helpful - as opposed to just
starting polling whenever an event arrives?
If yes, might it be useful for net core as well?

Only setting timeout globally isn't friendly either.
Should be a tun ioctl similar to SO_BUSY_POLL.

-- 
MST

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-20 10:32       ` Michael S. Tsirkin
  0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-20 10:32 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: ERANRA, kvm, linux-kernel, Razya Ladelsky, GLIKSON, YOSSIKU,
	abel.gordon, JOELN, netdev, virtualization

On Wed, Aug 20, 2014 at 10:41:32AM +0200, Christian Borntraeger wrote:
> On 10/08/14 10:30, Razya Ladelsky wrote:
> > From: Razya Ladelsky <razya@il.ibm.com>
> > Date: Thu, 31 Jul 2014 09:47:20 +0300
> > Subject: [PATCH] vhost: Add polling mode
> > 
> > When vhost is waiting for buffers from the guest driver (e.g., more packets to
> > send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> > guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> > (and possibly userspace involvement in translating this PIO exit into a file
> > descriptor event), all of which hurts performance.
> > 
> > If the system is under-utilized (has cpu time to spare), vhost can continuously
> > poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> > This patch adds an optional polling mode to vhost, that can be enabled via a
> > kernel module parameter, "poll_start_rate".
> > 
> > When polling is active for a virtqueue, the guest is asked to disable
> > notification (kicks), and the worker thread continuously checks for new buffers.
> > When it does discover new buffers, it simulates a "kick" by invoking the
> > underlying backend driver (such as vhost-net), which thinks it got a real kick
> > from the guest, and acts accordingly. If the underlying driver asks not to be
> > kicked, we disable polling on this virtqueue.
> > 
> > We start polling on a virtqueue when we notice it has work to do. Polling on
> > this virtqueue is later disabled after 3 seconds of polling turning up no new
> > work, as in this case we are better off returning to the exit-based notification
> > mechanism. The default timeout of 3 seconds can be changed with the
> > "poll_stop_idle" kernel module parameter.
> > 
> > This polling approach makes lot of sense for new HW with posted-interrupts for
> > which we have exitless host-to-guest notifications. But even with support for
> > posted interrupts, guest-to-host communication still causes exits. Polling adds
> > the missing part.
> > 
> > When systems are overloaded, there won't be enough cpu time for the various
> > vhost threads to poll their guests' devices. For these scenarios, we plan to add
> > support for vhost threads that can be shared by multiple devices, even of
> > multiple vms.
> > Our ultimate goal is to implement the I/O acceleration features described in:
> > KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> > https://www.youtube.com/watch?v=9EyweibHfEs
> > and
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> > 
> > I ran some experiments with TCP stream netperf and filebench (having 2 threads
> > performing random reads) benchmarks on an IBM System x3650 M4.
> > I have two machines, A and B. A hosts the vms, B runs the netserver.
> > The vms (on A) run netperf, its destination server is running on B.
> > All runs loaded the guests in a way that they were (cpu) saturated. For example,
> > I ran netperf with 64B messages, which is heavily loading the vm (which is why
> > its throughput is low).
> > The idea was to get it 100% loaded, so we can see that the polling is getting it
> > to produce higher throughput.
> > 
> > The system had two cores per guest, as to allow for both the vcpu and the vhost
> > thread to run concurrently for maximum throughput (but I didn't pin the threads
> > to specific cores).
> > My experiments were fair in a sense that for both cases, with or without
> > polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> > way). The only difference was whether polling was enabled/disabled.
> > 
> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> 
> Gave it a quick try on s390/kvm. As expected it makes no difference for big streaming workload like iperf.
> uperf with a 1-1 round robin got indeed faster by about 30%.
> The high CPU consumption is something that bothers me though, as virtualized systems tend to be full.
> 
> 
> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
> 
> This seems ridicoudly high. Even one jiffie is an eternity, so setting it to 1 as a default would reduce the CPU overhead for most cases.
> If we dont have a packet in one millisecond, we can surely go back to the kick approach, I think.
> 
> Christian


Seconded.
Could you publish data with different poll_stop_idle values?
Additionally, time in jiffies is not a reasonable userspace
API. Please switch to some reasonable unit, like microseconds.

Thinking more about it, isn't this almost exactly what net.core.busy_poll does?
That one suggests 50usec timeout.

The only difference I see is in poll_start_rate heuristic,
net.core does not have anything like this.
Do you have data to show that it's helpful - as opposed to just
starting polling whenever an event arrives?
If yes, might it be useful for net core as well?

Only setting timeout globally isn't friendly either.
Should be a tun ioctl similar to SO_BUSY_POLL.

-- 
MST

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-10  8:30 ` Razya Ladelsky
@ 2014-08-20 10:57     ` Michael S. Tsirkin
  2014-08-20  8:41     ` Christian Borntraeger
  2014-08-20 10:57     ` Michael S. Tsirkin
  2 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-20 10:57 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: kvm, GLIKSON, ERANRA, YOSSIKU, JOELN, abel.gordon, linux-kernel,
	netdev, virtualization

On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> From: Razya Ladelsky <razya@il.ibm.com>
> Date: Thu, 31 Jul 2014 09:47:20 +0300
> Subject: [PATCH] vhost: Add polling mode
> 
> When vhost is waiting for buffers from the guest driver (e.g., more packets to
> send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> (and possibly userspace involvement in translating this PIO exit into a file
> descriptor event), all of which hurts performance.
> 
> If the system is under-utilized (has cpu time to spare), vhost can continuously
> poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled via a
> kernel module parameter, "poll_start_rate".
> 
> When polling is active for a virtqueue, the guest is asked to disable
> notification (kicks), and the worker thread continuously checks for new buffers.
> When it does discover new buffers, it simulates a "kick" by invoking the
> underlying backend driver (such as vhost-net), which thinks it got a real kick
> from the guest, and acts accordingly. If the underlying driver asks not to be
> kicked, we disable polling on this virtqueue.
> 
> We start polling on a virtqueue when we notice it has work to do. Polling on
> this virtqueue is later disabled after 3 seconds of polling turning up no new
> work, as in this case we are better off returning to the exit-based notification
> mechanism. The default timeout of 3 seconds can be changed with the
> "poll_stop_idle" kernel module parameter.
> 
> This polling approach makes lot of sense for new HW with posted-interrupts for
> which we have exitless host-to-guest notifications. But even with support for
> posted interrupts, guest-to-host communication still causes exits. Polling adds
> the missing part.
> 
> When systems are overloaded, there won't be enough cpu time for the various
> vhost threads to poll their guests' devices. For these scenarios, we plan to add
> support for vhost threads that can be shared by multiple devices, even of
> multiple vms.
> Our ultimate goal is to implement the I/O acceleration features described in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> 
> I ran some experiments with TCP stream netperf and filebench (having 2 threads
> performing random reads) benchmarks on an IBM System x3650 M4.
> I have two machines, A and B. A hosts the vms, B runs the netserver.
> The vms (on A) run netperf, its destination server is running on B.
> All runs loaded the guests in a way that they were (cpu) saturated. For example,
> I ran netperf with 64B messages, which is heavily loading the vm (which is why
> its throughput is low).
> The idea was to get it 100% loaded, so we can see that the polling is getting it
> to produce higher throughput.
> 
> The system had two cores per guest, as to allow for both the vcpu and the vhost
> thread to run concurrently for maximum throughput (but I didn't pin the threads
> to specific cores).
> My experiments were fair in a sense that for both cases, with or without
> polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> way). The only difference was whether polling was enabled/disabled.
> 
> Results:
> 
> Netperf, 1 vm:
> The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> Number of exits/sec decreased 6x.
> The same improvement was shown when I tested with 3 vms running netperf
> (4086 MB/sec -> 5545 MB/sec).
> 
> filebench, 1 vm:
> ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> 31%.
> The same experiment with 3 vms running filebench showed similar numbers.
> 
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>

This really needs more thourough benchmarking report, including
system data.  One good example for a related patch:
http://lwn.net/Articles/551179/
though for virtualization, we need data about host as well, and if you
want to look at streaming benchmarks, you need to test different message
sizes and measure packet size.

For now, commenting on the patches assuming that will be forthcoming.

> ---
>  drivers/vhost/net.c   |    6 +-
>  drivers/vhost/scsi.c  |    6 +-
>  drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   38 +++++++-
>  4 files changed, 277 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 971a760..558aecb 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	}
>  	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>  
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> +			vqs[VHOST_NET_VQ_TX]);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> +			vqs[VHOST_NET_VQ_RX]);
>  
>  	f->private_data = n;
>  
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4f4ffa4..665eeeb 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>  	if (!vqs)
>  		goto err_vqs;
>  
> -	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
> -	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> -
> +	vhost_work_init(&vs->vs_completion_work, NULL,
> +						vhost_scsi_complete_cmd_work);
> +	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
>  	vs->vs_events_nr = 0;
>  	vs->vs_events_missed = false;
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c90f437..fbe8174 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -24,9 +24,17 @@
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
>  #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>  #include <linux/module.h>
>  
>  #include "vhost.h"
> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
>  
>  enum {
>  	VHOST_MEMORY_MAX_NREGIONS = 64,

So how does one know whether the heuristic works?
We need some kind of counter here.
E.g.  sk_busy_loop uses
                        NET_ADD_STATS_BH(sock_net(sk),
                                         LINUX_MIB_BUSYPOLLRXPACKETS, rc);


> @@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>  	return 0;
>  }
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn)
>  {
>  	INIT_LIST_HEAD(&work->node);
>  	work->fn = fn;
>  	init_waitqueue_head(&work->done);
>  	work->flushing = 0;
>  	work->queue_seq = work->done_seq = 0;
> +	work->vq = vq;
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_init);
>  
>  /* Init poll structure */
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev)
> +		     unsigned long mask, struct vhost_virtqueue *vq)
>  {
>  	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
>  	init_poll_funcptr(&poll->table, vhost_poll_func);
>  	poll->mask = mask;
> -	poll->dev = dev;
> +	poll->dev = vq->dev;
>  	poll->wqh = NULL;
> -
> -	vhost_work_init(&poll->work, fn);
> +	vhost_work_init(&poll->work, vq, fn);
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_init);
>  
> @@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_queue);
>  
> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
> + *
> + * Enabling this mode it tells the guest not to notify ("kick") us when its
> + * has made more work available on this virtqueue; Rather, we will continuously
> + * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
> + * the worker thread polls them all, e.g., in a round-robin fashion.
> + * Note that vqpoll.enabled doesn't always mean that this virtqueue is
> + * actually being polled: The backend (e.g., net.c) may temporarily disable it
> + * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
> + *
> + * It is assumed that these functions are called relatively rarely, when vhost
> + * notices that this virtqueue's usage pattern significantly changed in a way
> + * that makes polling more efficient than notification, or vice versa.
> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
> + * reclaimed.
> + */
> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (vq->vqpoll.enabled)
> +		return; /* already enabled, nothing to do */
> +	if (!vq->handle_kick)
> +		return; /* polling will be a waste of time if no callback! */
> +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> +		/* vq has guest notifications enabled. Disable them,
> +		   and instead add vq to the polling list */

Pls fix up multiline comment to match coding style.

> +		vhost_disable_notify(vq->dev, vq);
> +		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
> +	}
> +	vq->vqpoll.jiffies_last_kick = jiffies;
> +	__get_user(vq->avail_idx, &vq->avail->idx);
> +	vq->vqpoll.enabled = true;
> +
> +	/* Map userspace's vq->avail to the kernel's memory space. */
> +	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> +		&vq->vqpoll.avail_page) != 1) {
> +		/* TODO: can this happen, as we check access
> +		to vq->avail in advance? */

It can since you don't have the mm lock, so userspace can
unmap the page in this window. And especially since you didn't
check __get_user return code, so you don't even know that
it succeeded in the first place.

> +		BUG();
> +	}
> +	vq->vqpoll.avail_mapped = (struct vring_avail *) (
> +		(unsigned long)kmap(vq->vqpoll.avail_page) |
> +		((unsigned long)vq->avail & ~PAGE_MASK));
> +}
> +
> +/*
> + * This function doesn't always succeed in changing the mode. Sometimes
> + * a temporary race condition prevents turning on guest notifications, so
> + * vq should be polled next time again.
> + */
> +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (!vq->vqpoll.enabled)
> +		return; /* already disabled, nothing to do */
> +
> +	vq->vqpoll.enabled = false;
> +
> +	if (!list_empty(&vq->vqpoll.link)) {
> +		/* vq is on the polling list, remove it from this list and
> +		 * instead enable guest notifications. */
> +		list_del_init(&vq->vqpoll.link);
> +		if (unlikely(vhost_enable_notify(vq->dev, vq))
> +			&& !vq->vqpoll.shutdown) {
> +			/* Race condition: guest wrote before we enabled
> +			 * notification, so we'll never get a notification for
> +			 * this work - so continue polling mode for a while. */
> +			vhost_disable_notify(vq->dev, vq);
> +			vq->vqpoll.enabled = true;
> +			vhost_enable_notify(vq->dev, vq);
> +			return;
> +		}
> +	}
> +
> +	if (vq->vqpoll.avail_mapped) {
> +		kunmap(vq->vqpoll.avail_page);
> +		put_page(vq->vqpoll.avail_page);
> +		vq->vqpoll.avail_mapped = 0;
> +	}
> +}
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>  			   struct vhost_virtqueue *vq)
>  {
> @@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
>  	vq->memory = NULL;
> +	INIT_LIST_HEAD(&vq->vqpoll.link);
> +	vq->vqpoll.enabled = false;
> +	vq->vqpoll.shutdown = false;
> +	vq->vqpoll.avail_mapped = NULL;
> +}
> +
> +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
> + * virtqueues which the caller should kick, or NULL in case none should be
> + * kicked. roundrobin_poll() also disables polling on a virtqueue which has
> + * been polled for too long without success.
> + *
> + * This current implementation (the "round-robin" implementation) only
> + * polls the first vq in the list, returning it or NULL as appropriate, and
> + * moves this vq to the end of the list, so next time a different one is
> + * polled.
> + */
> +static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
> +{
> +	struct vhost_virtqueue *vq;
> +	u16 avail_idx;
> +
> +	if (list_empty(list))
> +		return NULL;
> +
> +	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> +	WARN_ON(!vq->vqpoll.enabled);
> +	list_move_tail(&vq->vqpoll.link, list);
> +
> +	/* See if there is any new work available from the guest. */
> +	/* TODO: can check the optional idx feature, and if we haven't
> +	* reached that idx yet, don't kick... */
> +	avail_idx = vq->vqpoll.avail_mapped->idx;
> +	if (avail_idx != vq->last_avail_idx)
> +		return vq;
> +
> +	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> +		/* We've been polling this virtqueue for a long time with no
> +		* results, so switch back to guest notification
> +		*/
> +		vhost_vq_disable_vqpoll(vq);
> +	}
> +	return NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -237,12 +368,62 @@ static int vhost_worker(void *data)
>  		spin_unlock_irq(&dev->work_lock);
>  
>  		if (work) {
> +			struct vhost_virtqueue *vq = work->vq;
>  			__set_current_state(TASK_RUNNING);
>  			work->fn(work);
> +			/* Keep track of the work rate, for deciding when to
> +			 * enable polling */
> +			if (vq) {
> +				if (vq->vqpoll.jiffies_last_work != jiffies) {
> +					vq->vqpoll.jiffies_last_work = jiffies;
> +					vq->vqpoll.work_this_jiffy = 0;
> +				}
> +				vq->vqpoll.work_this_jiffy++;
> +			}
> +			/* If vq is in the round-robin list of virtqueues being
> +			 * constantly checked by this thread, move vq the end
> +			 * of the queue, because it had its fair chance now.
> +			 */
> +			if (vq && !list_empty(&vq->vqpoll.link)) {
> +				list_move_tail(&vq->vqpoll.link,
> +					&dev->vqpoll_list);
> +			}
> +			/* Otherwise, if this vq is looking for notifications
> +			 * but vq polling is not enabled for it, do it now.
> +			 */
> +			else if (poll_start_rate && vq && vq->handle_kick &&
> +				!vq->vqpoll.enabled &&
> +				!vq->vqpoll.shutdown &&
> +				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
> +				vq->vqpoll.work_this_jiffy >=
> +					poll_start_rate) {
> +				vhost_vq_enable_vqpoll(vq);
> +			}
> +		}
> +		/* Check one virtqueue from the round-robin list */
> +		if (!list_empty(&dev->vqpoll_list)) {
> +			struct vhost_virtqueue *vq;
> +
> +			vq = roundrobin_poll(&dev->vqpoll_list);
> +
> +			if (vq) {
> +				vq->handle_kick(&vq->poll.work);
> +				vq->vqpoll.jiffies_last_kick = jiffies;
> +			}
> +
> +			/* If our polling list isn't empty, ask to continue
> +			 * running this thread, don't yield.
> +			 */

This isn't friendly to other processes running on the same CPU,
or if we are trying to kill the process.
See sk_can_busy_loop.

I think you also want to insert cpu_relax somewhere.

> +			__set_current_state(TASK_RUNNING);
>  			if (need_resched())
>  				schedule();

> -		} else
> -			schedule();
> +		} else {
> +			if (work) {
> +				if (need_resched())
> +					schedule();
> +			} else
> +				schedule();
> +		}
>  
>  	}
>  	unuse_mm(dev->mm);
> @@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
> +	INIT_LIST_HEAD(&dev->vqpoll_list);
>  	dev->worker = NULL;
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
> @@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
>  			vhost_poll_init(&vq->poll, vq->handle_kick,
> -					POLLIN, dev);
> +					POLLIN, vq);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_init);
> @@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
>  	struct vhost_attach_cgroups_struct attach;
>  
>  	attach.owner = current;
> -	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> +	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
>  	vhost_work_queue(dev, &attach.work);
>  	vhost_work_flush(dev, &attach.work);
>  	return attach.ret;
> @@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
> + * mode for a given virtqueue which is itself being shut down. We ask the
> + * worker thread to do this rather than doing it directly, so that we don't
> + * race with the worker thread's use of the queue.
> + */
> +static void shutdown_vqpoll_work(struct vhost_work *work)
> +{
> +	work->vq->vqpoll.shutdown = true;
> +	vhost_vq_disable_vqpoll(work->vq);
> +	WARN_ON(work->vq->vqpoll.avail_mapped);
> +}
> +
> +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	struct vhost_work work;
> +
> +	vhost_work_init(&work, vq, shutdown_vqpoll_work);
> +	vhost_work_queue(vq->dev, &work);
> +	vhost_work_flush(vq->dev, &work);
> +}
>  /* Caller should have device mutex if and only if locked is set */
>  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  {
> @@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  			eventfd_ctx_put(dev->vqs[i]->call_ctx);
>  		if (dev->vqs[i]->call)
>  			fput(dev->vqs[i]->call);
> +		shutdown_vqpoll(dev->vqs[i]);
>  		vhost_vq_reset(dev, dev->vqs[i]);
>  	}
>  	vhost_dev_free_iovecs(dev);
> @@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	u16 avail_idx;
>  	int r;
>  
> +	/* In polling mode, when the backend (e.g., net.c) asks to enable
> +	 * notifications, we don't enable guest notifications. Instead, start
> +	 * polling on this vq by adding it to the round-robin list.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (list_empty(&vq->vqpoll.link)) {
> +			list_add_tail(&vq->vqpoll.link,
> +				&vq->dev->vqpoll_list);
> +			vq->vqpoll.jiffies_last_kick = jiffies;
> +		}
> +		return false;
> +	}
> +
>  	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>  		return false;
>  	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> @@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
>  	int r;
>  
> +	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
> +	 * will generate notifications even if the guest is asked not to send
> +	 * them. So we must remove it from the round-robin polling list.
> +	 * Note that vqpoll.enabled remains set.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (!list_empty(&vq->vqpoll.link))
> +			list_del_init(&vq->vqpoll.link);
> +		return;
> +	}
> +
>  	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
>  		return;
>  	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3eda654..11aaaf4 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -24,6 +24,7 @@ struct vhost_work {
>  	int			  flushing;
>  	unsigned		  queue_seq;
>  	unsigned		  done_seq;
> +	struct vhost_virtqueue    *vq;
>  };
>  
>  /* Poll a file (eventfd or socket) */
> @@ -37,11 +38,12 @@ struct vhost_poll {
>  	struct vhost_dev	 *dev;
>  };
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn);
>  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
>  
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev);
> +		     unsigned long mask, struct vhost_virtqueue  *vq);
>  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
>  void vhost_poll_stop(struct vhost_poll *poll);
>  void vhost_poll_flush(struct vhost_poll *poll);
> @@ -54,8 +56,6 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> -struct vhost_virtqueue;
> -
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -110,6 +110,35 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	struct {
> +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> +       * that instead of using guest notifications (kicks) to
> +       * discover new work, we prefer to continuously poll this
> +       * virtqueue in the worker thread.
> +       * If !enabled, the rest of the fields below are undefined.
> +       */
> +		bool enabled;
> +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> +       * actually being polled: The backend (e.g., net.c) may
> +       * temporarily disable it using vhost_disable/enable_notify().
> +       * vqpoll.link is used to maintain the thread's round-robin
> +       * list of virtqueues that actually need to be polled.
> +       * Note list_empty(link) means this virtqueue isn't polled.
> +       */
> +		struct list_head link;
> +      /* If this flag is true, the virtqueue is being shut down,
> +       * so vqpoll should not be re-enabled.
> +       */
> +		bool shutdown;
> +      /* Various counters used to decide when to enter polling mode
> +       * or leave it and return to notification mode.
> +       */

Please align comments with fields.

> +		unsigned long jiffies_last_kick;
> +		unsigned long jiffies_last_work;
> +		int work_this_jiffy;
> +		struct page *avail_page;
> +		volatile struct vring_avail *avail_mapped;
> +	} vqpoll;
>  };
>  
>  struct vhost_dev {
> @@ -123,6 +152,7 @@ struct vhost_dev {
>  	spinlock_t work_lock;
>  	struct list_head work_list;
>  	struct task_struct *worker;
> +	struct list_head vqpoll_list;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> -- 
> 1.7.9.5

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-20 10:57     ` Michael S. Tsirkin
  0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-20 10:57 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: ERANRA, kvm, linux-kernel, GLIKSON, abel.gordon, YOSSIKU, JOELN,
	netdev, virtualization

On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> From: Razya Ladelsky <razya@il.ibm.com>
> Date: Thu, 31 Jul 2014 09:47:20 +0300
> Subject: [PATCH] vhost: Add polling mode
> 
> When vhost is waiting for buffers from the guest driver (e.g., more packets to
> send in vhost-net's transmit queue), it normally goes to sleep and waits for the
> guest to "kick" it. This kick involves a PIO in the guest, and therefore an exit
> (and possibly userspace involvement in translating this PIO exit into a file
> descriptor event), all of which hurts performance.
> 
> If the system is under-utilized (has cpu time to spare), vhost can continuously
> poll the virtqueues for new buffers, and avoid asking the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled via a
> kernel module parameter, "poll_start_rate".
> 
> When polling is active for a virtqueue, the guest is asked to disable
> notification (kicks), and the worker thread continuously checks for new buffers.
> When it does discover new buffers, it simulates a "kick" by invoking the
> underlying backend driver (such as vhost-net), which thinks it got a real kick
> from the guest, and acts accordingly. If the underlying driver asks not to be
> kicked, we disable polling on this virtqueue.
> 
> We start polling on a virtqueue when we notice it has work to do. Polling on
> this virtqueue is later disabled after 3 seconds of polling turning up no new
> work, as in this case we are better off returning to the exit-based notification
> mechanism. The default timeout of 3 seconds can be changed with the
> "poll_stop_idle" kernel module parameter.
> 
> This polling approach makes lot of sense for new HW with posted-interrupts for
> which we have exitless host-to-guest notifications. But even with support for
> posted interrupts, guest-to-host communication still causes exits. Polling adds
> the missing part.
> 
> When systems are overloaded, there won't be enough cpu time for the various
> vhost threads to poll their guests' devices. For these scenarios, we plan to add
> support for vhost threads that can be shared by multiple devices, even of
> multiple vms.
> Our ultimate goal is to implement the I/O acceleration features described in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> 
> I ran some experiments with TCP stream netperf and filebench (having 2 threads
> performing random reads) benchmarks on an IBM System x3650 M4.
> I have two machines, A and B. A hosts the vms, B runs the netserver.
> The vms (on A) run netperf, its destination server is running on B.
> All runs loaded the guests in a way that they were (cpu) saturated. For example,
> I ran netperf with 64B messages, which is heavily loading the vm (which is why
> its throughput is low).
> The idea was to get it 100% loaded, so we can see that the polling is getting it
> to produce higher throughput.
> 
> The system had two cores per guest, as to allow for both the vcpu and the vhost
> thread to run concurrently for maximum throughput (but I didn't pin the threads
> to specific cores).
> My experiments were fair in a sense that for both cases, with or without
> polling, I run both threads, vcpu and vhost, on 2 cores (set their affinity that
> way). The only difference was whether polling was enabled/disabled.
> 
> Results:
> 
> Netperf, 1 vm:
> The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> Number of exits/sec decreased 6x.
> The same improvement was shown when I tested with 3 vms running netperf
> (4086 MB/sec -> 5545 MB/sec).
> 
> filebench, 1 vm:
> ops/sec improved by 13% with the polling patch. Number of exits was reduced by
> 31%.
> The same experiment with 3 vms running filebench showed similar numbers.
> 
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>

This really needs more thourough benchmarking report, including
system data.  One good example for a related patch:
http://lwn.net/Articles/551179/
though for virtualization, we need data about host as well, and if you
want to look at streaming benchmarks, you need to test different message
sizes and measure packet size.

For now, commenting on the patches assuming that will be forthcoming.

> ---
>  drivers/vhost/net.c   |    6 +-
>  drivers/vhost/scsi.c  |    6 +-
>  drivers/vhost/vhost.c |  245 +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   38 +++++++-
>  4 files changed, 277 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 971a760..558aecb 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	}
>  	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>  
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
> -	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> +			vqs[VHOST_NET_VQ_TX]);
> +	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> +			vqs[VHOST_NET_VQ_RX]);
>  
>  	f->private_data = n;
>  
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4f4ffa4..665eeeb 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
>  	if (!vqs)
>  		goto err_vqs;
>  
> -	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
> -	vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> -
> +	vhost_work_init(&vs->vs_completion_work, NULL,
> +						vhost_scsi_complete_cmd_work);
> +	vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
>  	vs->vs_events_nr = 0;
>  	vs->vs_events_missed = false;
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c90f437..fbe8174 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -24,9 +24,17 @@
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
>  #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>  #include <linux/module.h>
>  
>  #include "vhost.h"
> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue when rate of events is at least this number per jiffy. If 0, never start polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue after this many jiffies of no work.");
>  
>  enum {
>  	VHOST_MEMORY_MAX_NREGIONS = 64,

So how does one know whether the heuristic works?
We need some kind of counter here.
E.g.  sk_busy_loop uses
                        NET_ADD_STATS_BH(sock_net(sk),
                                         LINUX_MIB_BUSYPOLLRXPACKETS, rc);


> @@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>  	return 0;
>  }
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn)
>  {
>  	INIT_LIST_HEAD(&work->node);
>  	work->fn = fn;
>  	init_waitqueue_head(&work->done);
>  	work->flushing = 0;
>  	work->queue_seq = work->done_seq = 0;
> +	work->vq = vq;
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_init);
>  
>  /* Init poll structure */
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev)
> +		     unsigned long mask, struct vhost_virtqueue *vq)
>  {
>  	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
>  	init_poll_funcptr(&poll->table, vhost_poll_func);
>  	poll->mask = mask;
> -	poll->dev = dev;
> +	poll->dev = vq->dev;
>  	poll->wqh = NULL;
> -
> -	vhost_work_init(&poll->work, fn);
> +	vhost_work_init(&poll->work, vq, fn);
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_init);
>  
> @@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_queue);
>  
> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
> + *
> + * Enabling this mode it tells the guest not to notify ("kick") us when its
> + * has made more work available on this virtqueue; Rather, we will continuously
> + * poll this virtqueue in the worker thread. If multiple virtqueues are polled,
> + * the worker thread polls them all, e.g., in a round-robin fashion.
> + * Note that vqpoll.enabled doesn't always mean that this virtqueue is
> + * actually being polled: The backend (e.g., net.c) may temporarily disable it
> + * using vhost_disable/enable_notify(), while vqpoll.enabled is unchanged.
> + *
> + * It is assumed that these functions are called relatively rarely, when vhost
> + * notices that this virtqueue's usage pattern significantly changed in a way
> + * that makes polling more efficient than notification, or vice versa.
> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
> + * reclaimed.
> + */
> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (vq->vqpoll.enabled)
> +		return; /* already enabled, nothing to do */
> +	if (!vq->handle_kick)
> +		return; /* polling will be a waste of time if no callback! */
> +	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> +		/* vq has guest notifications enabled. Disable them,
> +		   and instead add vq to the polling list */

Pls fix up multiline comment to match coding style.

> +		vhost_disable_notify(vq->dev, vq);
> +		list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
> +	}
> +	vq->vqpoll.jiffies_last_kick = jiffies;
> +	__get_user(vq->avail_idx, &vq->avail->idx);
> +	vq->vqpoll.enabled = true;
> +
> +	/* Map userspace's vq->avail to the kernel's memory space. */
> +	if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> +		&vq->vqpoll.avail_page) != 1) {
> +		/* TODO: can this happen, as we check access
> +		to vq->avail in advance? */

It can since you don't have the mm lock, so userspace can
unmap the page in this window. And especially since you didn't
check __get_user return code, so you don't even know that
it succeeded in the first place.

> +		BUG();
> +	}
> +	vq->vqpoll.avail_mapped = (struct vring_avail *) (
> +		(unsigned long)kmap(vq->vqpoll.avail_page) |
> +		((unsigned long)vq->avail & ~PAGE_MASK));
> +}
> +
> +/*
> + * This function doesn't always succeed in changing the mode. Sometimes
> + * a temporary race condition prevents turning on guest notifications, so
> + * vq should be polled next time again.
> + */
> +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	if (!vq->vqpoll.enabled)
> +		return; /* already disabled, nothing to do */
> +
> +	vq->vqpoll.enabled = false;
> +
> +	if (!list_empty(&vq->vqpoll.link)) {
> +		/* vq is on the polling list, remove it from this list and
> +		 * instead enable guest notifications. */
> +		list_del_init(&vq->vqpoll.link);
> +		if (unlikely(vhost_enable_notify(vq->dev, vq))
> +			&& !vq->vqpoll.shutdown) {
> +			/* Race condition: guest wrote before we enabled
> +			 * notification, so we'll never get a notification for
> +			 * this work - so continue polling mode for a while. */
> +			vhost_disable_notify(vq->dev, vq);
> +			vq->vqpoll.enabled = true;
> +			vhost_enable_notify(vq->dev, vq);
> +			return;
> +		}
> +	}
> +
> +	if (vq->vqpoll.avail_mapped) {
> +		kunmap(vq->vqpoll.avail_page);
> +		put_page(vq->vqpoll.avail_page);
> +		vq->vqpoll.avail_mapped = 0;
> +	}
> +}
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>  			   struct vhost_virtqueue *vq)
>  {
> @@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
>  	vq->memory = NULL;
> +	INIT_LIST_HEAD(&vq->vqpoll.link);
> +	vq->vqpoll.enabled = false;
> +	vq->vqpoll.shutdown = false;
> +	vq->vqpoll.avail_mapped = NULL;
> +}
> +
> +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
> + * virtqueues which the caller should kick, or NULL in case none should be
> + * kicked. roundrobin_poll() also disables polling on a virtqueue which has
> + * been polled for too long without success.
> + *
> + * This current implementation (the "round-robin" implementation) only
> + * polls the first vq in the list, returning it or NULL as appropriate, and
> + * moves this vq to the end of the list, so next time a different one is
> + * polled.
> + */
> +static struct vhost_virtqueue *roundrobin_poll(struct list_head *list)
> +{
> +	struct vhost_virtqueue *vq;
> +	u16 avail_idx;
> +
> +	if (list_empty(list))
> +		return NULL;
> +
> +	vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> +	WARN_ON(!vq->vqpoll.enabled);
> +	list_move_tail(&vq->vqpoll.link, list);
> +
> +	/* See if there is any new work available from the guest. */
> +	/* TODO: can check the optional idx feature, and if we haven't
> +	* reached that idx yet, don't kick... */
> +	avail_idx = vq->vqpoll.avail_mapped->idx;
> +	if (avail_idx != vq->last_avail_idx)
> +		return vq;
> +
> +	if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> +		/* We've been polling this virtqueue for a long time with no
> +		* results, so switch back to guest notification
> +		*/
> +		vhost_vq_disable_vqpoll(vq);
> +	}
> +	return NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -237,12 +368,62 @@ static int vhost_worker(void *data)
>  		spin_unlock_irq(&dev->work_lock);
>  
>  		if (work) {
> +			struct vhost_virtqueue *vq = work->vq;
>  			__set_current_state(TASK_RUNNING);
>  			work->fn(work);
> +			/* Keep track of the work rate, for deciding when to
> +			 * enable polling */
> +			if (vq) {
> +				if (vq->vqpoll.jiffies_last_work != jiffies) {
> +					vq->vqpoll.jiffies_last_work = jiffies;
> +					vq->vqpoll.work_this_jiffy = 0;
> +				}
> +				vq->vqpoll.work_this_jiffy++;
> +			}
> +			/* If vq is in the round-robin list of virtqueues being
> +			 * constantly checked by this thread, move vq the end
> +			 * of the queue, because it had its fair chance now.
> +			 */
> +			if (vq && !list_empty(&vq->vqpoll.link)) {
> +				list_move_tail(&vq->vqpoll.link,
> +					&dev->vqpoll_list);
> +			}
> +			/* Otherwise, if this vq is looking for notifications
> +			 * but vq polling is not enabled for it, do it now.
> +			 */
> +			else if (poll_start_rate && vq && vq->handle_kick &&
> +				!vq->vqpoll.enabled &&
> +				!vq->vqpoll.shutdown &&
> +				!(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
> +				vq->vqpoll.work_this_jiffy >=
> +					poll_start_rate) {
> +				vhost_vq_enable_vqpoll(vq);
> +			}
> +		}
> +		/* Check one virtqueue from the round-robin list */
> +		if (!list_empty(&dev->vqpoll_list)) {
> +			struct vhost_virtqueue *vq;
> +
> +			vq = roundrobin_poll(&dev->vqpoll_list);
> +
> +			if (vq) {
> +				vq->handle_kick(&vq->poll.work);
> +				vq->vqpoll.jiffies_last_kick = jiffies;
> +			}
> +
> +			/* If our polling list isn't empty, ask to continue
> +			 * running this thread, don't yield.
> +			 */

This isn't friendly to other processes running on the same CPU,
or if we are trying to kill the process.
See sk_can_busy_loop.

I think you also want to insert cpu_relax somewhere.

> +			__set_current_state(TASK_RUNNING);
>  			if (need_resched())
>  				schedule();

> -		} else
> -			schedule();
> +		} else {
> +			if (work) {
> +				if (need_resched())
> +					schedule();
> +			} else
> +				schedule();
> +		}
>  
>  	}
>  	unuse_mm(dev->mm);
> @@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
> +	INIT_LIST_HEAD(&dev->vqpoll_list);
>  	dev->worker = NULL;
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
> @@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
>  			vhost_poll_init(&vq->poll, vq->handle_kick,
> -					POLLIN, dev);
> +					POLLIN, vq);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_init);
> @@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
>  	struct vhost_attach_cgroups_struct attach;
>  
>  	attach.owner = current;
> -	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> +	vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
>  	vhost_work_queue(dev, &attach.work);
>  	vhost_work_flush(dev, &attach.work);
>  	return attach.ret;
> @@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue polling
> + * mode for a given virtqueue which is itself being shut down. We ask the
> + * worker thread to do this rather than doing it directly, so that we don't
> + * race with the worker thread's use of the queue.
> + */
> +static void shutdown_vqpoll_work(struct vhost_work *work)
> +{
> +	work->vq->vqpoll.shutdown = true;
> +	vhost_vq_disable_vqpoll(work->vq);
> +	WARN_ON(work->vq->vqpoll.avail_mapped);
> +}
> +
> +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> +{
> +	struct vhost_work work;
> +
> +	vhost_work_init(&work, vq, shutdown_vqpoll_work);
> +	vhost_work_queue(vq->dev, &work);
> +	vhost_work_flush(vq->dev, &work);
> +}
>  /* Caller should have device mutex if and only if locked is set */
>  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  {
> @@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  			eventfd_ctx_put(dev->vqs[i]->call_ctx);
>  		if (dev->vqs[i]->call)
>  			fput(dev->vqs[i]->call);
> +		shutdown_vqpoll(dev->vqs[i]);
>  		vhost_vq_reset(dev, dev->vqs[i]);
>  	}
>  	vhost_dev_free_iovecs(dev);
> @@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  	u16 avail_idx;
>  	int r;
>  
> +	/* In polling mode, when the backend (e.g., net.c) asks to enable
> +	 * notifications, we don't enable guest notifications. Instead, start
> +	 * polling on this vq by adding it to the round-robin list.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (list_empty(&vq->vqpoll.link)) {
> +			list_add_tail(&vq->vqpoll.link,
> +				&vq->dev->vqpoll_list);
> +			vq->vqpoll.jiffies_last_kick = jiffies;
> +		}
> +		return false;
> +	}
> +
>  	if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>  		return false;
>  	vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> @@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
>  	int r;
>  
> +	/* If this virtqueue is vqpoll.enabled, and on the polling list, it
> +	 * will generate notifications even if the guest is asked not to send
> +	 * them. So we must remove it from the round-robin polling list.
> +	 * Note that vqpoll.enabled remains set.
> +	 */
> +	if (vq->vqpoll.enabled) {
> +		if (!list_empty(&vq->vqpoll.link))
> +			list_del_init(&vq->vqpoll.link);
> +		return;
> +	}
> +
>  	if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
>  		return;
>  	vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3eda654..11aaaf4 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -24,6 +24,7 @@ struct vhost_work {
>  	int			  flushing;
>  	unsigned		  queue_seq;
>  	unsigned		  done_seq;
> +	struct vhost_virtqueue    *vq;
>  };
>  
>  /* Poll a file (eventfd or socket) */
> @@ -37,11 +38,12 @@ struct vhost_poll {
>  	struct vhost_dev	 *dev;
>  };
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq,
> +							vhost_work_fn_t fn);
>  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
>  
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -		     unsigned long mask, struct vhost_dev *dev);
> +		     unsigned long mask, struct vhost_virtqueue  *vq);
>  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
>  void vhost_poll_stop(struct vhost_poll *poll);
>  void vhost_poll_flush(struct vhost_poll *poll);
> @@ -54,8 +56,6 @@ struct vhost_log {
>  	u64 len;
>  };
>  
> -struct vhost_virtqueue;
> -
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>  	struct vhost_dev *dev;
> @@ -110,6 +110,35 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	struct {
> +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> +       * that instead of using guest notifications (kicks) to
> +       * discover new work, we prefer to continuously poll this
> +       * virtqueue in the worker thread.
> +       * If !enabled, the rest of the fields below are undefined.
> +       */
> +		bool enabled;
> +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> +       * actually being polled: The backend (e.g., net.c) may
> +       * temporarily disable it using vhost_disable/enable_notify().
> +       * vqpoll.link is used to maintain the thread's round-robin
> +       * list of virtqueues that actually need to be polled.
> +       * Note list_empty(link) means this virtqueue isn't polled.
> +       */
> +		struct list_head link;
> +      /* If this flag is true, the virtqueue is being shut down,
> +       * so vqpoll should not be re-enabled.
> +       */
> +		bool shutdown;
> +      /* Various counters used to decide when to enter polling mode
> +       * or leave it and return to notification mode.
> +       */

Please align comments with fields.

> +		unsigned long jiffies_last_kick;
> +		unsigned long jiffies_last_work;
> +		int work_this_jiffy;
> +		struct page *avail_page;
> +		volatile struct vring_avail *avail_mapped;
> +	} vqpoll;
>  };
>  
>  struct vhost_dev {
> @@ -123,6 +152,7 @@ struct vhost_dev {
>  	spinlock_t work_lock;
>  	struct list_head work_list;
>  	struct task_struct *worker;
> +	struct list_head vqpoll_list;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int nvqs);
> -- 
> 1.7.9.5

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-19  8:36                   ` Razya Ladelsky
@ 2014-08-20 11:05                     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-20 11:05 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Alex Glikson, David Miller, Eran Raichstein,
	Joel Nider, kvm, kvm-owner, linux-kernel, netdev, virtualization,
	Yossi Kuperman1

On Tue, Aug 19, 2014 at 11:36:31AM +0300, Razya Ladelsky wrote:
> > That was just one example. There many other possibilities.  Either
> > actually make the systems load all host CPUs equally, or divide
> > throughput by host CPU.
> > 
> 
> The polling patch adds this capability to vhost, reducing costly exit 
> overhead when the vm is loaded.
> 
> In order to load the vm I ran netperf  with msg size of 256:
> 
> Without polling:  2480 Mbits/sec,  utilization: vm - 100%   vhost - 64% 
> With Polling: 4160 Mbits/sec,  utilization: vm - 100%   vhost - 100% 
> 
> Therefore, throughput/cpu without polling is 15.1, and 20.8 with polling.
> 

Can you please present results in a form that makes
it possible to see the effect on various configurations
and workloads?

Here's one example where this was done:
https://lkml.org/lkml/2014/8/14/495

You really should also provide data about your host
configuration (missing in the above link).

> My intention was to load vhost as close as possible to 100% utilization 
> without polling, in order to compare it to the polling utilization case 
> (where vhost is always 100%). 
> The best use case, of course, would be when the shared vhost thread work 
> (TBD) is integrated and then vhost will actually be using its polling 
> cycles to handle requests of multiple devices (even from multiple vms).
> 
> Thanks,
> Razya


-- 
MST

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-20 11:05                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-08-20 11:05 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: Eran Raichstein, kvm-owner, kvm, linux-kernel, abel.gordon,
	Alex Glikson, Yossi Kuperman1, Joel Nider, netdev,
	virtualization, David Miller

On Tue, Aug 19, 2014 at 11:36:31AM +0300, Razya Ladelsky wrote:
> > That was just one example. There many other possibilities.  Either
> > actually make the systems load all host CPUs equally, or divide
> > throughput by host CPU.
> > 
> 
> The polling patch adds this capability to vhost, reducing costly exit 
> overhead when the vm is loaded.
> 
> In order to load the vm I ran netperf  with msg size of 256:
> 
> Without polling:  2480 Mbits/sec,  utilization: vm - 100%   vhost - 64% 
> With Polling: 4160 Mbits/sec,  utilization: vm - 100%   vhost - 100% 
> 
> Therefore, throughput/cpu without polling is 15.1, and 20.8 with polling.
> 

Can you please present results in a form that makes
it possible to see the effect on various configurations
and workloads?

Here's one example where this was done:
https://lkml.org/lkml/2014/8/14/495

You really should also provide data about your host
configuration (missing in the above link).

> My intention was to load vhost as close as possible to 100% utilization 
> without polling, in order to compare it to the polling utilization case 
> (where vhost is always 100%). 
> The best use case, of course, would be when the shared vhost thread work 
> (TBD) is integrated and then vhost will actually be using its polling 
> cycles to handle requests of multiple devices (even from multiple vms).
> 
> Thanks,
> Razya


-- 
MST

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-20  8:41     ` Christian Borntraeger
@ 2014-08-21 13:53       ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-21 13:53 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	linux-kernel, mst, netdev, virtualization, Yossi Kuperman1

Christian Borntraeger <borntraeger@de.ibm.com> wrote on 20/08/2014 
11:41:32 AM:


> > 
> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 
MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running 
netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits 
> was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar 
numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> 
> Gave it a quick try on s390/kvm. As expected it makes no difference 
> for big streaming workload like iperf.
> uperf with a 1-1 round robin got indeed faster by about 30%.
> The high CPU consumption is something that bothers me though, as 
> virtualized systems tend to be full.
> 
> 

Thanks for confirming the results!
The best way to use this patch would be along with a shared vhost thread 
for multiple
devices/vms, as described in:
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument
This work assumes having a dedicated I/O core where the vhost thread 
serves multiple vms, which 
makes the high cpu utilization less of a concern. 



> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
> virtqueue when rate of events is at least this number per jiffy. If 
> 0, never start polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
> virtqueue after this many jiffies of no work.");
> 
> This seems ridicoudly high. Even one jiffie is an eternity, so 
> setting it to 1 as a default would reduce the CPU overhead for most 
cases.
> If we dont have a packet in one millisecond, we can surely go back 
> to the kick approach, I think.
> 
> Christian
> 

Good point, will reduce it and recheck.
Thank you,
Razya


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-21 13:53       ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-21 13:53 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Eran Raichstein, Alex Glikson, kvm, mst, netdev, linux-kernel,
	abel.gordon, Yossi Kuperman1, Joel Nider, virtualization

Christian Borntraeger <borntraeger@de.ibm.com> wrote on 20/08/2014 
11:41:32 AM:


> > 
> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 
MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running 
netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits 
> was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar 
numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> 
> Gave it a quick try on s390/kvm. As expected it makes no difference 
> for big streaming workload like iperf.
> uperf with a 1-1 round robin got indeed faster by about 30%.
> The high CPU consumption is something that bothers me though, as 
> virtualized systems tend to be full.
> 
> 

Thanks for confirming the results!
The best way to use this patch would be along with a shared vhost thread 
for multiple
devices/vms, as described in:
http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument
This work assumes having a dedicated I/O core where the vhost thread 
serves multiple vms, which 
makes the high cpu utilization less of a concern. 



> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
> virtqueue when rate of events is at least this number per jiffy. If 
> 0, never start polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
> virtqueue after this many jiffies of no work.");
> 
> This seems ridicoudly high. Even one jiffie is an eternity, so 
> setting it to 1 as a default would reduce the CPU overhead for most 
cases.
> If we dont have a packet in one millisecond, we can surely go back 
> to the kick approach, I think.
> 
> Christian
> 

Good point, will reduce it and recheck.
Thank you,
Razya

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-20 10:57     ` Michael S. Tsirkin
@ 2014-08-21 14:23       ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-21 14:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	linux-kernel, netdev, virtualization, Yossi Kuperman1

"Michael S. Tsirkin" <mst@redhat.com> wrote on 20/08/2014 01:57:10 PM:

> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 
MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running 
netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits 
> was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar 
numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> 
> This really needs more thourough benchmarking report, including
> system data.  One good example for a related patch:
> http://lwn.net/Articles/551179/
> though for virtualization, we need data about host as well, and if you
> want to look at streaming benchmarks, you need to test different message
> sizes and measure packet size.
>

Hi Michael,
I have already tried running netperf with several message sizes: 
64,128,256,512,600,800...
But the results are inconsistent even in the baseline/unpatched 
configuration.
For smaller msg sizes, I get consistent numbers. However, at some point, 
when I increase the msg size
I get unstable results. For example, for a 512B msg, I get two scenarios:
vm utilization 100%, vhost utilization 75%, throughput ~6300 
vm utilization 80%, vhost utilization 13%, throughput ~9400 (line rate)

I don't know why vhost is behaving that way for certain message sizes.
Do you have any insight to why this is happening?
Thank you,
Razya
 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
@ 2014-08-21 14:23       ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-21 14:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eran Raichstein, Alex Glikson, kvm, netdev, linux-kernel,
	abel.gordon, Yossi Kuperman1, Joel Nider, virtualization

"Michael S. Tsirkin" <mst@redhat.com> wrote on 20/08/2014 01:57:10 PM:

> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 
MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running 
netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits 
> was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar 
numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> 
> This really needs more thourough benchmarking report, including
> system data.  One good example for a related patch:
> http://lwn.net/Articles/551179/
> though for virtualization, we need data about host as well, and if you
> want to look at streaming benchmarks, you need to test different message
> sizes and measure packet size.
>

Hi Michael,
I have already tried running netperf with several message sizes: 
64,128,256,512,600,800...
But the results are inconsistent even in the baseline/unpatched 
configuration.
For smaller msg sizes, I get consistent numbers. However, at some point, 
when I increase the msg size
I get unstable results. For example, for a 512B msg, I get two scenarios:
vm utilization 100%, vhost utilization 75%, throughput ~6300 
vm utilization 80%, vhost utilization 13%, throughput ~9400 (line rate)

I don't know why vhost is behaving that way for certain message sizes.
Do you have any insight to why this is happening?
Thank you,
Razya

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH] vhost: Add polling mode
  2014-08-21 14:23       ` Razya Ladelsky
@ 2014-08-21 14:29         ` David Laight
  -1 siblings, 0 replies; 55+ messages in thread
From: David Laight @ 2014-08-21 14:29 UTC (permalink / raw)
  To: 'Razya Ladelsky', Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	linux-kernel, netdev, virtualization, Yossi Kuperman1

From: Razya Ladelsky
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 20/08/2014 01:57:10 PM:
> 
> > > Results:
> > >
> > > Netperf, 1 vm:
> > > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> > > Number of exits/sec decreased 6x.
> > > The same improvement was shown when I tested with 3 vms running netperf
> > > (4086 MB/sec -> 5545 MB/sec).
> > >
> > > filebench, 1 vm:
> > > ops/sec improved by 13% with the polling patch. Number of exits
> > > was reduced by 31%.
> > > The same experiment with 3 vms running filebench showed similar numbers.
> > >
> > > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> >
> > This really needs more thourough benchmarking report, including
> > system data.  One good example for a related patch:
> > http://lwn.net/Articles/551179/
> > though for virtualization, we need data about host as well, and if you
> > want to look at streaming benchmarks, you need to test different message
> > sizes and measure packet size.
> >
> 
> Hi Michael,
> I have already tried running netperf with several message sizes:
> 64,128,256,512,600,800...
> But the results are inconsistent even in the baseline/unpatched
> configuration.
> For smaller msg sizes, I get consistent numbers. However, at some point,
> when I increase the msg size
> I get unstable results. For example, for a 512B msg, I get two scenarios:
> vm utilization 100%, vhost utilization 75%, throughput ~6300
> vm utilization 80%, vhost utilization 13%, throughput ~9400 (line rate)
> 
> I don't know why vhost is behaving that way for certain message sizes.
> Do you have any insight to why this is happening?

Have you tried looking at the actual ethernet packet sizes.
It may well jump between using small packets (the size of the writes)
and full sized ones.

If you are trying to measure ethernet packet 'cost' you need to use UDP.
However that probably uses different code paths.

	David




^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH] vhost: Add polling mode
@ 2014-08-21 14:29         ` David Laight
  0 siblings, 0 replies; 55+ messages in thread
From: David Laight @ 2014-08-21 14:29 UTC (permalink / raw)
  To: 'Razya Ladelsky', Michael S. Tsirkin
  Cc: Eran Raichstein, Alex Glikson, kvm, netdev, linux-kernel,
	abel.gordon, Yossi Kuperman1, Joel Nider, virtualization

From: Razya Ladelsky
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 20/08/2014 01:57:10 PM:
> 
> > > Results:
> > >
> > > Netperf, 1 vm:
> > > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
> > > Number of exits/sec decreased 6x.
> > > The same improvement was shown when I tested with 3 vms running netperf
> > > (4086 MB/sec -> 5545 MB/sec).
> > >
> > > filebench, 1 vm:
> > > ops/sec improved by 13% with the polling patch. Number of exits
> > > was reduced by 31%.
> > > The same experiment with 3 vms running filebench showed similar numbers.
> > >
> > > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> >
> > This really needs more thourough benchmarking report, including
> > system data.  One good example for a related patch:
> > http://lwn.net/Articles/551179/
> > though for virtualization, we need data about host as well, and if you
> > want to look at streaming benchmarks, you need to test different message
> > sizes and measure packet size.
> >
> 
> Hi Michael,
> I have already tried running netperf with several message sizes:
> 64,128,256,512,600,800...
> But the results are inconsistent even in the baseline/unpatched
> configuration.
> For smaller msg sizes, I get consistent numbers. However, at some point,
> when I increase the msg size
> I get unstable results. For example, for a 512B msg, I get two scenarios:
> vm utilization 100%, vhost utilization 75%, throughput ~6300
> vm utilization 80%, vhost utilization 13%, throughput ~9400 (line rate)
> 
> I don't know why vhost is behaving that way for certain message sizes.
> Do you have any insight to why this is happening?

Have you tried looking at the actual ethernet packet sizes.
It may well jump between using small packets (the size of the writes)
and full sized ones.

If you are trying to measure ethernet packet 'cost' you need to use UDP.
However that probably uses different code paths.

	David

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-21 13:53       ` Razya Ladelsky
  (?)
@ 2014-08-22  9:30       ` Zhang Haoyu
  -1 siblings, 0 replies; 55+ messages in thread
From: Zhang Haoyu @ 2014-08-22  9:30 UTC (permalink / raw)
  To: Razya Ladelsky, Christian Borntraeger, mashirle, Jason Wang,
	Michael S.Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	linux-kernel, mst, netdev, virtualization, Yossi Kuperman1

>> > 
>> > Results:
>> > 
>> > Netperf, 1 vm:
>> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
>> > Number of exits/sec decreased 6x.
>> > The same improvement was shown when I tested with 3 vms running netperf
>> > (4086 MB/sec -> 5545 MB/sec).
>> > 
>> > filebench, 1 vm:
>> > ops/sec improved by 13% with the polling patch. Number of exits 
>> was reduced by
>> > 31%.
>> > The same experiment with 3 vms running filebench showed similar numbers.
>> > 
>> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
>> 
>> Gave it a quick try on s390/kvm. As expected it makes no difference 
>> for big streaming workload like iperf.
>> uperf with a 1-1 round robin got indeed faster by about 30%.
>> The high CPU consumption is something that bothers me though, as 
>> virtualized systems tend to be full.
>> 
>> 
>
>Thanks for confirming the results!
>The best way to use this patch would be along with a shared vhost thread 
>for multiple
>devices/vms, as described in:
>http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument
>This work assumes having a dedicated I/O core where the vhost thread 
>serves multiple vms, which 
>makes the high cpu utilization less of a concern. 
>
Hi, Razya, Shirley
I am going to test the combination of 
"several (depends on total number of cpu on host, e.g.,  total_number * 1/3) vhost threads server all VMs" and "vhost: add polling mode",
now I get the patch "http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88682/focus=88723" posted by Shirley,
any update to this patch?

And, I want to make a bit change on this patch, create total_cpu_number * 1/N(N={3,4}) vhost threads instead of per-cpu vhost thread to server all VMs,
any ideas?

Thanks,
Zhang Haoyu
>
>
>> > +static int poll_start_rate = 0;
>> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
>> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
>> virtqueue when rate of events is at least this number per jiffy. If 
>> 0, never start polling.");
>> > +
>> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
>> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
>> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
>> virtqueue after this many jiffies of no work.");
>> 
>> This seems ridicoudly high. Even one jiffie is an eternity, so 
>> setting it to 1 as a default would reduce the CPU overhead for most cases.
>> If we dont have a packet in one millisecond, we can surely go back 
>> to the kick approach, I think.
>> 
>> Christian
>> 
>
>Good point, will reduce it and recheck.
>Thank you,
>Razya


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-21 13:53       ` Razya Ladelsky
  (?)
  (?)
@ 2014-08-22 10:01       ` Zhang Haoyu
  -1 siblings, 0 replies; 55+ messages in thread
From: Zhang Haoyu @ 2014-08-22 10:01 UTC (permalink / raw)
  To: Zhang Haoyu, Razya Ladelsky, Christian Borntraeger, mashirle,
	Jason Wang, Michael S.Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	linux-kernel, mst, netdev, virtualization, Yossi Kuperman1

>>> > 
>>> > Results:
>>> > 
>>> > Netperf, 1 vm:
>>> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 MB/sec).
>>> > Number of exits/sec decreased 6x.
>>> > The same improvement was shown when I tested with 3 vms running netperf
>>> > (4086 MB/sec -> 5545 MB/sec).
>>> > 
>>> > filebench, 1 vm:
>>> > ops/sec improved by 13% with the polling patch. Number of exits 
>>> was reduced by
>>> > 31%.
>>> > The same experiment with 3 vms running filebench showed similar numbers.
>>> > 
>>> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
>>> 
>>> Gave it a quick try on s390/kvm. As expected it makes no difference 
>>> for big streaming workload like iperf.
>>> uperf with a 1-1 round robin got indeed faster by about 30%.
>>> The high CPU consumption is something that bothers me though, as 
>>> virtualized systems tend to be full.
>>> 
>>> 
>>
>>Thanks for confirming the results!
>>The best way to use this patch would be along with a shared vhost thread 
>>for multiple
>>devices/vms, as described in:
>>http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/479e3578ed05bfac85257b4200427735!OpenDocument
>>This work assumes having a dedicated I/O core where the vhost thread 
>>serves multiple vms, which 
>>makes the high cpu utilization less of a concern. 
>>
>Hi, Razya, Shirley
>I am going to test the combination of 
>"several (depends on total number of cpu on host, e.g.,  total_number * 1/3) vhost threads server all VMs" and "vhost: add polling mode",
>now I get the patch "http://thread.gmane.org/gmane.comp.emulators.kvm.devel/88682/focus=88723" posted by Shirley,
>any update to this patch?
>
>And, I want to make a bit change on this patch, create total_cpu_number * 1/N(N={3,4}) vhost threads instead of per-cpu vhost thread to server all VMs,
Just like xen netback threads, whose number is equal to num_online_cpus on Dom0, 
but for kvm host, I think per-cpu vhost thread is too many.
>any ideas?
>
>Thanks,
>Zhang Haoyu
>>
>>
>>> > +static int poll_start_rate = 0;
>>> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
>>> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
>>> virtqueue when rate of events is at least this number per jiffy. If 
>>> 0, never start polling.");
>>> > +
>>> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
>>> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
>>> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
>>> virtqueue after this many jiffies of no work.");
>>> 
>>> This seems ridicoudly high. Even one jiffie is an eternity, so 
>>> setting it to 1 as a default would reduce the CPU overhead for most cases.
>>> If we dont have a packet in one millisecond, we can surely go back 
>>> to the kick approach, I think.
>>> 
>>> Christian
>>> 
>>
>>Good point, will reduce it and recheck.
>>Thank you,
>>Razya


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH] vhost: Add polling mode
  2014-08-21 14:29         ` David Laight
@ 2014-08-24 12:26           ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-24 12:26 UTC (permalink / raw)
  To: David Laight
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	linux-kernel, Michael S. Tsirkin, netdev, virtualization,
	Yossi Kuperman1

David Laight <David.Laight@ACULAB.COM> wrote on 21/08/2014 05:29:41 PM:

> From: David Laight <David.Laight@ACULAB.COM>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, "Michael S. Tsirkin" 
<mst@redhat.com>
> Cc: "abel.gordon@gmail.com" <abel.gordon@gmail.com>, Alex Glikson/
> Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/
> IBM@IBMIL, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "linux-
> kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, 
> "netdev@vger.kernel.org" <netdev@vger.kernel.org>, 
> "virtualization@lists.linux-foundation.org" 
> <virtualization@lists.linux-foundation.org>, Yossi 
Kuperman1/Haifa/IBM@IBMIL
> Date: 21/08/2014 05:31 PM
> Subject: RE: [PATCH] vhost: Add polling mode
> 
> From: Razya Ladelsky
> > "Michael S. Tsirkin" <mst@redhat.com> wrote on 20/08/2014 01:57:10 PM:
> > 
> > > > Results:
> > > >
> > > > Netperf, 1 vm:
> > > > The polling patch improved throughput by ~33% (1516 MB/sec -> 
> 2046 MB/sec).
> > > > Number of exits/sec decreased 6x.
> > > > The same improvement was shown when I tested with 3 vms running 
netperf
> > > > (4086 MB/sec -> 5545 MB/sec).
> > > >
> > > > filebench, 1 vm:
> > > > ops/sec improved by 13% with the polling patch. Number of exits
> > > > was reduced by 31%.
> > > > The same experiment with 3 vms running filebench showed similar 
numbers.
> > > >
> > > > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> > >
> > > This really needs more thourough benchmarking report, including
> > > system data.  One good example for a related patch:
> > > http://lwn.net/Articles/551179/
> > > though for virtualization, we need data about host as well, and if 
you
> > > want to look at streaming benchmarks, you need to test different 
message
> > > sizes and measure packet size.
> > >
> > 
> > Hi Michael,
> > I have already tried running netperf with several message sizes:
> > 64,128,256,512,600,800...
> > But the results are inconsistent even in the baseline/unpatched
> > configuration.
> > For smaller msg sizes, I get consistent numbers. However, at some 
point,
> > when I increase the msg size
> > I get unstable results. For example, for a 512B msg, I get two 
scenarios:
> > vm utilization 100%, vhost utilization 75%, throughput ~6300
> > vm utilization 80%, vhost utilization 13%, throughput ~9400 (line 
rate)
> > 
> > I don't know why vhost is behaving that way for certain message sizes.
> > Do you have any insight to why this is happening?
> 
> Have you tried looking at the actual ethernet packet sizes.
> It may well jump between using small packets (the size of the writes)
> and full sized ones.

I will check it,
Thanks,
Razya

> 
> If you are trying to measure ethernet packet 'cost' you need to use UDP.
> However that probably uses different code paths.
> 
>    David
> 
> 
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH] vhost: Add polling mode
@ 2014-08-24 12:26           ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-08-24 12:26 UTC (permalink / raw)
  To: David Laight
  Cc: Eran Raichstein, Alex Glikson, kvm, Michael S. Tsirkin, netdev,
	linux-kernel, abel.gordon, Yossi Kuperman1, Joel Nider,
	virtualization

David Laight <David.Laight@ACULAB.COM> wrote on 21/08/2014 05:29:41 PM:

> From: David Laight <David.Laight@ACULAB.COM>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, "Michael S. Tsirkin" 
<mst@redhat.com>
> Cc: "abel.gordon@gmail.com" <abel.gordon@gmail.com>, Alex Glikson/
> Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/
> IBM@IBMIL, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "linux-
> kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, 
> "netdev@vger.kernel.org" <netdev@vger.kernel.org>, 
> "virtualization@lists.linux-foundation.org" 
> <virtualization@lists.linux-foundation.org>, Yossi 
Kuperman1/Haifa/IBM@IBMIL
> Date: 21/08/2014 05:31 PM
> Subject: RE: [PATCH] vhost: Add polling mode
> 
> From: Razya Ladelsky
> > "Michael S. Tsirkin" <mst@redhat.com> wrote on 20/08/2014 01:57:10 PM:
> > 
> > > > Results:
> > > >
> > > > Netperf, 1 vm:
> > > > The polling patch improved throughput by ~33% (1516 MB/sec -> 
> 2046 MB/sec).
> > > > Number of exits/sec decreased 6x.
> > > > The same improvement was shown when I tested with 3 vms running 
netperf
> > > > (4086 MB/sec -> 5545 MB/sec).
> > > >
> > > > filebench, 1 vm:
> > > > ops/sec improved by 13% with the polling patch. Number of exits
> > > > was reduced by 31%.
> > > > The same experiment with 3 vms running filebench showed similar 
numbers.
> > > >
> > > > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> > >
> > > This really needs more thourough benchmarking report, including
> > > system data.  One good example for a related patch:
> > > http://lwn.net/Articles/551179/
> > > though for virtualization, we need data about host as well, and if 
you
> > > want to look at streaming benchmarks, you need to test different 
message
> > > sizes and measure packet size.
> > >
> > 
> > Hi Michael,
> > I have already tried running netperf with several message sizes:
> > 64,128,256,512,600,800...
> > But the results are inconsistent even in the baseline/unpatched
> > configuration.
> > For smaller msg sizes, I get consistent numbers. However, at some 
point,
> > when I increase the msg size
> > I get unstable results. For example, for a 512B msg, I get two 
scenarios:
> > vm utilization 100%, vhost utilization 75%, throughput ~6300
> > vm utilization 80%, vhost utilization 13%, throughput ~9400 (line 
rate)
> > 
> > I don't know why vhost is behaving that way for certain message sizes.
> > Do you have any insight to why this is happening?
> 
> Have you tried looking at the actual ethernet packet sizes.
> It may well jump between using small packets (the size of the writes)
> and full sized ones.

I will check it,
Thanks,
Razya

> 
> If you are trying to measure ethernet packet 'cost' you need to use UDP.
> However that probably uses different code paths.
> 
>    David
> 
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-10 19:45     ` Michael S. Tsirkin
                       ` (2 preceding siblings ...)
  (?)
@ 2016-09-04  8:45     ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2016-09-04  8:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex_Glikson/Haifa/IBM%IBMIL,
	Eran_Raichstein/Haifa/IBM%IBMIL, Joel_Nider/Haifa/IBM%IBMIL, kvm,
	linux-kernel, netdev, virtualization,
	Yossi_Kuperman1/Haifa/IBM%IBMIL

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/08/2014 10:45:59 PM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, 
> Cc: kvm@vger.kernel.org, Alex Glikson/Haifa/IBM@IBMIL, Eran 
> Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel 
> Nider/Haifa/IBM@IBMIL, abel.gordon@gmail.com, linux-
> kernel@vger.kernel.org, netdev@vger.kernel.org, 
> virtualization@lists.linux-foundation.org
> Date: 10/08/2014 10:45 PM
> Subject: Re: [PATCH] vhost: Add polling mode
> 
> On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> > From: Razya Ladelsky <razya@il.ibm.com>
> > Date: Thu, 31 Jul 2014 09:47:20 +0300
> > Subject: [PATCH] vhost: Add polling mode
> > 
> > When vhost is waiting for buffers from the guest driver (e.g., 
> more packets to
> > send in vhost-net's transmit queue), it normally goes to sleep and
> waits for the
> > guest to "kick" it. This kick involves a PIO in the guest, and 
> therefore an exit
> > (and possibly userspace involvement in translating this PIO exit into 
a file
> > descriptor event), all of which hurts performance.
> > 
> > If the system is under-utilized (has cpu time to spare), vhost can
> continuously
> > poll the virtqueues for new buffers, and avoid asking the guest to 
kick us.
> > This patch adds an optional polling mode to vhost, that can be enabled 
via a
> > kernel module parameter, "poll_start_rate".
> > 
> > When polling is active for a virtqueue, the guest is asked to disable
> > notification (kicks), and the worker thread continuously checks 
> for new buffers.
> > When it does discover new buffers, it simulates a "kick" by invoking 
the
> > underlying backend driver (such as vhost-net), which thinks it got
> a real kick
> > from the guest, and acts accordingly. If the underlying driver 
> asks not to be
> > kicked, we disable polling on this virtqueue.
> > 
> > We start polling on a virtqueue when we notice it has work to do. 
Polling on
> > this virtqueue is later disabled after 3 seconds of polling 
> turning up no new
> > work, as in this case we are better off returning to the exit-
> based notification
> > mechanism. The default timeout of 3 seconds can be changed with the
> > "poll_stop_idle" kernel module parameter.
> > 
> > This polling approach makes lot of sense for new HW with posted-
> interrupts for
> > which we have exitless host-to-guest notifications. But even with 
> support for
> > posted interrupts, guest-to-host communication still causes exits.
> Polling adds
> > the missing part.
> > 
> > When systems are overloaded, there won't be enough cpu time for the 
various
> > vhost threads to poll their guests' devices. For these scenarios, 
> we plan to add
> > support for vhost threads that can be shared by multiple devices, even 
of
> > multiple vms.
> > Our ultimate goal is to implement the I/O acceleration features 
> described in:
> > KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> > https://www.youtube.com/watch?v=9EyweibHfEs
> > and
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> > 
> > I ran some experiments with TCP stream netperf and filebench 
> (having 2 threads
> > performing random reads) benchmarks on an IBM System x3650 M4.
> > I have two machines, A and B. A hosts the vms, B runs the netserver.
> > The vms (on A) run netperf, its destination server is running on B.
> > All runs loaded the guests in a way that they were (cpu) 
> saturated. For example,
> > I ran netperf with 64B messages, which is heavily loading the vm 
> (which is why
> > its throughput is low).
> > The idea was to get it 100% loaded, so we can see that the polling
> is getting it
> > to produce higher throughput.
> 
> And, did your tests actually produce 100% load on both host CPUs?
> 

The vm indeed utilized 100% cpu, whether polling was enabled or not.
The vhost thread utilized less than 100% (of the other cpu) when polling 
was disabled.
Enabling polling increased its utilization to 100% (in which case both 
cpus were 100% utilized). 
 

> > The system had two cores per guest, as to allow for both the vcpu 
> and the vhost
> > thread to run concurrently for maximum throughput (but I didn't 
> pin the threads
> > to specific cores).
> > My experiments were fair in a sense that for both cases, with or 
without
> > polling, I run both threads, vcpu and vhost, on 2 cores (set their
> affinity that
> > way). The only difference was whether polling was enabled/disabled.
> > 
> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 
MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running 
netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits 
> was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar 
numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> > ---
> >  drivers/vhost/net.c   |    6 +-
> >  drivers/vhost/scsi.c  |    6 +-
> >  drivers/vhost/vhost.c |  245 ++++++++++++++++++++++++++++++++++++
> +++++++++++--
> >  drivers/vhost/vhost.h |   38 +++++++-
> >  4 files changed, 277 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 971a760..558aecb 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode 
> *inode, struct file *f)
> >     }
> >     vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
> > 
> > -   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, 
dev);
> > -   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, 
dev);
> > +   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> > +         vqs[VHOST_NET_VQ_TX]);
> > +   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> > +         vqs[VHOST_NET_VQ_RX]);
> > 
> >     f->private_data = n;
> > 
> > diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> > index 4f4ffa4..665eeeb 100644
> > --- a/drivers/vhost/scsi.c
> > +++ b/drivers/vhost/scsi.c
> > @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode 
> *inode, struct file *f)
> >     if (!vqs)
> >        goto err_vqs;
> > 
> > -   vhost_work_init(&vs->vs_completion_work, 
vhost_scsi_complete_cmd_work);
> > -   vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> > -
> > +   vhost_work_init(&vs->vs_completion_work, NULL,
> > +                  vhost_scsi_complete_cmd_work);
> > +   vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
> >     vs->vs_events_nr = 0;
> >     vs->vs_events_missed = false;
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index c90f437..fbe8174 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -24,9 +24,17 @@
> >  #include <linux/slab.h>
> >  #include <linux/kthread.h>
> >  #include <linux/cgroup.h>
> > +#include <linux/jiffies.h>
> >  #include <linux/module.h>
> > 
> >  #include "vhost.h"
> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
> virtqueue when rate of events is at least this number per jiffy. If 
> 0, never start polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
> virtqueue after this many jiffies of no work.");
> > 
> >  enum {
> >     VHOST_MEMORY_MAX_NREGIONS = 64,
> > @@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t 
> *wait, unsigned mode, int sync,
> >     return 0;
> >  }
> > 
> > -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> > +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue 
*vq,
> > +                     vhost_work_fn_t fn)
> >  {
> >     INIT_LIST_HEAD(&work->node);
> >     work->fn = fn;
> >     init_waitqueue_head(&work->done);
> >     work->flushing = 0;
> >     work->queue_seq = work->done_seq = 0;
> > +   work->vq = vq;
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_work_init);
> > 
> >  /* Init poll structure */
> >  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -           unsigned long mask, struct vhost_dev *dev)
> > +           unsigned long mask, struct vhost_virtqueue *vq)
> >  {
> >     init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> >     init_poll_funcptr(&poll->table, vhost_poll_func);
> >     poll->mask = mask;
> > -   poll->dev = dev;
> > +   poll->dev = vq->dev;
> >     poll->wqh = NULL;
> > -
> > -   vhost_work_init(&poll->work, fn);
> > +   vhost_work_init(&poll->work, vq, fn);
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_poll_init);
> > 
> > @@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_poll_queue);
> > 
> > +/* Enable or disable virtqueue polling (vqpoll.enabled) for a 
virtqueue.
> > + *
> > + * Enabling this mode it tells the guest not to notify ("kick") us 
when its
> > + * has made more work available on this virtqueue; Rather, we 
> will continuously
> > + * poll this virtqueue in the worker thread. If multiple 
> virtqueues are polled,
> > + * the worker thread polls them all, e.g., in a round-robin fashion.
> > + * Note that vqpoll.enabled doesn't always mean that this virtqueue 
is
> > + * actually being polled: The backend (e.g., net.c) may 
> temporarily disable it
> > + * using vhost_disable/enable_notify(), while vqpoll.enabled is 
unchanged.
> > + *
> > + * It is assumed that these functions are called relatively 
> rarely, when vhost
> > + * notices that this virtqueue's usage pattern significantly 
> changed in a way
> > + * that makes polling more efficient than notification, or vice 
versa.
> > + * Also, we assume that vhost_vq_disable_vqpoll() is always called on 
vq
> > + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can 
be
> > + * reclaimed.
> > + */
> > +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +   if (vq->vqpoll.enabled)
> > +      return; /* already enabled, nothing to do */
> > +   if (!vq->handle_kick)
> > +      return; /* polling will be a waste of time if no callback! */
> > +   if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> > +      /* vq has guest notifications enabled. Disable them,
> > +         and instead add vq to the polling list */
> > +      vhost_disable_notify(vq->dev, vq);
> > +      list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
> > +   }
> > +   vq->vqpoll.jiffies_last_kick = jiffies;
> > +   __get_user(vq->avail_idx, &vq->avail->idx);
> > +   vq->vqpoll.enabled = true;
> > +
> > +   /* Map userspace's vq->avail to the kernel's memory space. */
> > +   if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> > +      &vq->vqpoll.avail_page) != 1) {
> > +      /* TODO: can this happen, as we check access
> > +      to vq->avail in advance? */
> > +      BUG();
> > +   }
> > +   vq->vqpoll.avail_mapped = (struct vring_avail *) (
> > +      (unsigned long)kmap(vq->vqpoll.avail_page) |
> > +      ((unsigned long)vq->avail & ~PAGE_MASK));
> > +}
> > +
> > +/*
> > + * This function doesn't always succeed in changing the mode. 
Sometimes
> > + * a temporary race condition prevents turning on guest 
notifications, so
> > + * vq should be polled next time again.
> > + */
> > +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +   if (!vq->vqpoll.enabled)
> > +      return; /* already disabled, nothing to do */
> > +
> > +   vq->vqpoll.enabled = false;
> > +
> > +   if (!list_empty(&vq->vqpoll.link)) {
> > +      /* vq is on the polling list, remove it from this list and
> > +       * instead enable guest notifications. */
> > +      list_del_init(&vq->vqpoll.link);
> > +      if (unlikely(vhost_enable_notify(vq->dev, vq))
> > +         && !vq->vqpoll.shutdown) {
> > +         /* Race condition: guest wrote before we enabled
> > +          * notification, so we'll never get a notification for
> > +          * this work - so continue polling mode for a while. */
> > +         vhost_disable_notify(vq->dev, vq);
> > +         vq->vqpoll.enabled = true;
> > +         vhost_enable_notify(vq->dev, vq);
> > +         return;
> > +      }
> > +   }
> > +
> > +   if (vq->vqpoll.avail_mapped) {
> > +      kunmap(vq->vqpoll.avail_page);
> > +      put_page(vq->vqpoll.avail_page);
> > +      vq->vqpoll.avail_mapped = 0;
> > +   }
> > +}
> > +
> >  static void vhost_vq_reset(struct vhost_dev *dev,
> >              struct vhost_virtqueue *vq)
> >  {
> > @@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> >     vq->call = NULL;
> >     vq->log_ctx = NULL;
> >     vq->memory = NULL;
> > +   INIT_LIST_HEAD(&vq->vqpoll.link);
> > +   vq->vqpoll.enabled = false;
> > +   vq->vqpoll.shutdown = false;
> > +   vq->vqpoll.avail_mapped = NULL;
> > +}
> > +
> > +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of 
the
> > + * virtqueues which the caller should kick, or NULL in case none 
should be
> > + * kicked. roundrobin_poll() also disables polling on a 
virtqueuewhich has
> > + * been polled for too long without success.
> > + *
> > + * This current implementation (the "round-robin" implementation) 
only
> > + * polls the first vq in the list, returning it or NULL as 
appropriate, and
> > + * moves this vq to the end of the list, so next time a different one 
is
> > + * polled.
> > + */
> > +static struct vhost_virtqueue *roundrobin_poll(struct list_head 
*list)
> > +{
> > +   struct vhost_virtqueue *vq;
> > +   u16 avail_idx;
> > +
> > +   if (list_empty(list))
> > +      return NULL;
> > +
> > +   vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> > +   WARN_ON(!vq->vqpoll.enabled);
> > +   list_move_tail(&vq->vqpoll.link, list);
> > +
> > +   /* See if there is any new work available from the guest. */
> > +   /* TODO: can check the optional idx feature, and if we haven't
> > +   * reached that idx yet, don't kick... */
> > +   avail_idx = vq->vqpoll.avail_mapped->idx;
> > +   if (avail_idx != vq->last_avail_idx)
> > +      return vq;
> > +
> > +   if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> > +      /* We've been polling this virtqueue for a long time with no
> > +      * results, so switch back to guest notification
> > +      */
> > +      vhost_vq_disable_vqpoll(vq);
> > +   }
> > +   return NULL;
> >  }
> > 
> >  static int vhost_worker(void *data)
> > @@ -237,12 +368,62 @@ static int vhost_worker(void *data)
> >        spin_unlock_irq(&dev->work_lock);
> > 
> >        if (work) {
> > +         struct vhost_virtqueue *vq = work->vq;
> >           __set_current_state(TASK_RUNNING);
> >           work->fn(work);
> > +         /* Keep track of the work rate, for deciding when to
> > +          * enable polling */
> > +         if (vq) {
> > +            if (vq->vqpoll.jiffies_last_work != jiffies) {
> > +               vq->vqpoll.jiffies_last_work = jiffies;
> > +               vq->vqpoll.work_this_jiffy = 0;
> > +            }
> > +            vq->vqpoll.work_this_jiffy++;
> > +         }
> > +         /* If vq is in the round-robin list of virtqueues being
> > +          * constantly checked by this thread, move vq the end
> > +          * of the queue, because it had its fair chance now.
> > +          */
> > +         if (vq && !list_empty(&vq->vqpoll.link)) {
> > +            list_move_tail(&vq->vqpoll.link,
> > +               &dev->vqpoll_list);
> > +         }
> > +         /* Otherwise, if this vq is looking for notifications
> > +          * but vq polling is not enabled for it, do it now.
> > +          */
> > +         else if (poll_start_rate && vq && vq->handle_kick &&
> > +            !vq->vqpoll.enabled &&
> > +            !vq->vqpoll.shutdown &&
> > +            !(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
> > +            vq->vqpoll.work_this_jiffy >=
> > +               poll_start_rate) {
> > +            vhost_vq_enable_vqpoll(vq);
> > +         }
> > +      }
> > +      /* Check one virtqueue from the round-robin list */
> > +      if (!list_empty(&dev->vqpoll_list)) {
> > +         struct vhost_virtqueue *vq;
> > +
> > +         vq = roundrobin_poll(&dev->vqpoll_list);
> > +
> > +         if (vq) {
> > +            vq->handle_kick(&vq->poll.work);
> > +            vq->vqpoll.jiffies_last_kick = jiffies;
> > +         }
> > +
> > +         /* If our polling list isn't empty, ask to continue
> > +          * running this thread, don't yield.
> > +          */
> > +         __set_current_state(TASK_RUNNING);
> >           if (need_resched())
> >              schedule();
> > -      } else
> > -         schedule();
> > +      } else {
> > +         if (work) {
> > +            if (need_resched())
> > +               schedule();
> > +         } else
> > +            schedule();
> > +      }
> > 
> >     }
> >     unuse_mm(dev->mm);
> > @@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >     dev->mm = NULL;
> >     spin_lock_init(&dev->work_lock);
> >     INIT_LIST_HEAD(&dev->work_list);
> > +   INIT_LIST_HEAD(&dev->vqpoll_list);
> >     dev->worker = NULL;
> > 
> >     for (i = 0; i < dev->nvqs; ++i) {
> > @@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >        vhost_vq_reset(dev, vq);
> >        if (vq->handle_kick)
> >           vhost_poll_init(&vq->poll, vq->handle_kick,
> > -               POLLIN, dev);
> > +               POLLIN, vq);
> >     }
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_dev_init);
> > @@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev 
*dev)
> >     struct vhost_attach_cgroups_struct attach;
> > 
> >     attach.owner = current;
> > -   vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> > +   vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
> >     vhost_work_queue(dev, &attach.work);
> >     vhost_work_flush(dev, &attach.work);
> >     return attach.ret;
> > @@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_dev_stop);
> > 
> > +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue 
polling
> > + * mode for a given virtqueue which is itself being shut down. We ask 
the
> > + * worker thread to do this rather than doing it directly, so that we 
don't
> > + * race with the worker thread's use of the queue.
> > + */
> > +static void shutdown_vqpoll_work(struct vhost_work *work)
> > +{
> > +   work->vq->vqpoll.shutdown = true;
> > +   vhost_vq_disable_vqpoll(work->vq);
> > +   WARN_ON(work->vq->vqpoll.avail_mapped);
> > +}
> > +
> > +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +   struct vhost_work work;
> > +
> > +   vhost_work_init(&work, vq, shutdown_vqpoll_work);
> > +   vhost_work_queue(vq->dev, &work);
> > +   vhost_work_flush(vq->dev, &work);
> > +}
> >  /* Caller should have device mutex if and only if locked is set */
> >  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
> >  {
> > @@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, 
> bool locked)
> >           eventfd_ctx_put(dev->vqs[i]->call_ctx);
> >        if (dev->vqs[i]->call)
> >           fput(dev->vqs[i]->call);
> > +      shutdown_vqpoll(dev->vqs[i]);
> >        vhost_vq_reset(dev, dev->vqs[i]);
> >     }
> >     vhost_dev_free_iovecs(dev);
> > @@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev 
> *dev, struct vhost_virtqueue *vq)
> >     u16 avail_idx;
> >     int r;
> > 
> > +   /* In polling mode, when the backend (e.g., net.c) asks to enable
> > +    * notifications, we don't enable guest notifications. Instead, 
start
> > +    * polling on this vq by adding it to the round-robin list.
> > +    */
> > +   if (vq->vqpoll.enabled) {
> > +      if (list_empty(&vq->vqpoll.link)) {
> > +         list_add_tail(&vq->vqpoll.link,
> > +            &vq->dev->vqpoll_list);
> > +         vq->vqpoll.jiffies_last_kick = jiffies;
> > +      }
> > +      return false;
> > +   }
> > +
> >     if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> >        return false;
> >     vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> > @@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev 
> *dev, struct vhost_virtqueue *vq)
> >  {
> >     int r;
> > 
> > +   /* If this virtqueue is vqpoll.enabled, and on the polling list, 
it
> > +    * will generate notifications even if the guest is asked not to 
send
> > +    * them. So we must remove it from the round-robin polling list.
> > +    * Note that vqpoll.enabled remains set.
> > +    */
> > +   if (vq->vqpoll.enabled) {
> > +      if (!list_empty(&vq->vqpoll.link))
> > +         list_del_init(&vq->vqpoll.link);
> > +      return;
> > +   }
> > +
> >     if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> >        return;
> >     vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 3eda654..11aaaf4 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -24,6 +24,7 @@ struct vhost_work {
> >     int           flushing;
> >     unsigned        queue_seq;
> >     unsigned        done_seq;
> > +   struct vhost_virtqueue    *vq;
> >  };
> > 
> >  /* Poll a file (eventfd or socket) */
> > @@ -37,11 +38,12 @@ struct vhost_poll {
> >     struct vhost_dev    *dev;
> >  };
> > 
> > -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> > +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue 
*vq,
> > +                     vhost_work_fn_t fn);
> >  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work 
*work);
> > 
> >  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -           unsigned long mask, struct vhost_dev *dev);
> > +           unsigned long mask, struct vhost_virtqueue  *vq);
> >  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
> >  void vhost_poll_stop(struct vhost_poll *poll);
> >  void vhost_poll_flush(struct vhost_poll *poll);
> > @@ -54,8 +56,6 @@ struct vhost_log {
> >     u64 len;
> >  };
> > 
> > -struct vhost_virtqueue;
> > -
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >     struct vhost_dev *dev;
> > @@ -110,6 +110,35 @@ struct vhost_virtqueue {
> >     /* Log write descriptors */
> >     void __user *log_base;
> >     struct vhost_log *log;
> > +   struct {
> > +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> > +       * that instead of using guest notifications (kicks) to
> > +       * discover new work, we prefer to continuously poll this
> > +       * virtqueue in the worker thread.
> > +       * If !enabled, the rest of the fields below are undefined.
> > +       */
> > +      bool enabled;
> > +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> > +       * actually being polled: The backend (e.g., net.c) may
> > +       * temporarily disable it using vhost_disable/enable_notify().
> > +       * vqpoll.link is used to maintain the thread's round-robin
> > +       * list of virtqueues that actually need to be polled.
> > +       * Note list_empty(link) means this virtqueue isn't polled.
> > +       */
> > +      struct list_head link;
> > +      /* If this flag is true, the virtqueue is being shut down,
> > +       * so vqpoll should not be re-enabled.
> > +       */
> > +      bool shutdown;
> > +      /* Various counters used to decide when to enter polling mode
> > +       * or leave it and return to notification mode.
> > +       */
> > +      unsigned long jiffies_last_kick;
> > +      unsigned long jiffies_last_work;
> > +      int work_this_jiffy;
> > +      struct page *avail_page;
> > +      volatile struct vring_avail *avail_mapped;
> > +   } vqpoll;
> >  };
> > 
> >  struct vhost_dev {
> > @@ -123,6 +152,7 @@ struct vhost_dev {
> >     spinlock_t work_lock;
> >     struct list_head work_list;
> >     struct task_struct *worker;
> > +   struct list_head vqpoll_list;
> >  };
> > 
> >  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue 
> **vqs, int nvqs);
> > -- 
> > 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-08-10 19:45     ` Michael S. Tsirkin
  (?)
  (?)
@ 2016-09-04  8:45     ` Razya Ladelsky
  -1 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2016-09-04  8:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Yossi_Kuperman1/Haifa/IBM%IBMIL, kvm, Joel_Nider/Haifa/IBM%IBMIL,
	netdev, linux-kernel, abel.gordon, Alex_Glikson/Haifa/IBM%IBMIL,
	Eran_Raichstein/Haifa/IBM%IBMIL, virtualization

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/08/2014 10:45:59 PM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, 
> Cc: kvm@vger.kernel.org, Alex Glikson/Haifa/IBM@IBMIL, Eran 
> Raichstein/Haifa/IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Joel 
> Nider/Haifa/IBM@IBMIL, abel.gordon@gmail.com, linux-
> kernel@vger.kernel.org, netdev@vger.kernel.org, 
> virtualization@lists.linux-foundation.org
> Date: 10/08/2014 10:45 PM
> Subject: Re: [PATCH] vhost: Add polling mode
> 
> On Sun, Aug 10, 2014 at 11:30:35AM +0300, Razya Ladelsky wrote:
> > From: Razya Ladelsky <razya@il.ibm.com>
> > Date: Thu, 31 Jul 2014 09:47:20 +0300
> > Subject: [PATCH] vhost: Add polling mode
> > 
> > When vhost is waiting for buffers from the guest driver (e.g., 
> more packets to
> > send in vhost-net's transmit queue), it normally goes to sleep and
> waits for the
> > guest to "kick" it. This kick involves a PIO in the guest, and 
> therefore an exit
> > (and possibly userspace involvement in translating this PIO exit into 
a file
> > descriptor event), all of which hurts performance.
> > 
> > If the system is under-utilized (has cpu time to spare), vhost can
> continuously
> > poll the virtqueues for new buffers, and avoid asking the guest to 
kick us.
> > This patch adds an optional polling mode to vhost, that can be enabled 
via a
> > kernel module parameter, "poll_start_rate".
> > 
> > When polling is active for a virtqueue, the guest is asked to disable
> > notification (kicks), and the worker thread continuously checks 
> for new buffers.
> > When it does discover new buffers, it simulates a "kick" by invoking 
the
> > underlying backend driver (such as vhost-net), which thinks it got
> a real kick
> > from the guest, and acts accordingly. If the underlying driver 
> asks not to be
> > kicked, we disable polling on this virtqueue.
> > 
> > We start polling on a virtqueue when we notice it has work to do. 
Polling on
> > this virtqueue is later disabled after 3 seconds of polling 
> turning up no new
> > work, as in this case we are better off returning to the exit-
> based notification
> > mechanism. The default timeout of 3 seconds can be changed with the
> > "poll_stop_idle" kernel module parameter.
> > 
> > This polling approach makes lot of sense for new HW with posted-
> interrupts for
> > which we have exitless host-to-guest notifications. But even with 
> support for
> > posted interrupts, guest-to-host communication still causes exits.
> Polling adds
> > the missing part.
> > 
> > When systems are overloaded, there won't be enough cpu time for the 
various
> > vhost threads to poll their guests' devices. For these scenarios, 
> we plan to add
> > support for vhost threads that can be shared by multiple devices, even 
of
> > multiple vms.
> > Our ultimate goal is to implement the I/O acceleration features 
> described in:
> > KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> > https://www.youtube.com/watch?v=9EyweibHfEs
> > and
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> > 
> > I ran some experiments with TCP stream netperf and filebench 
> (having 2 threads
> > performing random reads) benchmarks on an IBM System x3650 M4.
> > I have two machines, A and B. A hosts the vms, B runs the netserver.
> > The vms (on A) run netperf, its destination server is running on B.
> > All runs loaded the guests in a way that they were (cpu) 
> saturated. For example,
> > I ran netperf with 64B messages, which is heavily loading the vm 
> (which is why
> > its throughput is low).
> > The idea was to get it 100% loaded, so we can see that the polling
> is getting it
> > to produce higher throughput.
> 
> And, did your tests actually produce 100% load on both host CPUs?
> 

The vm indeed utilized 100% cpu, whether polling was enabled or not.
The vhost thread utilized less than 100% (of the other cpu) when polling 
was disabled.
Enabling polling increased its utilization to 100% (in which case both 
cpus were 100% utilized). 
 

> > The system had two cores per guest, as to allow for both the vcpu 
> and the vhost
> > thread to run concurrently for maximum throughput (but I didn't 
> pin the threads
> > to specific cores).
> > My experiments were fair in a sense that for both cases, with or 
without
> > polling, I run both threads, vcpu and vhost, on 2 cores (set their
> affinity that
> > way). The only difference was whether polling was enabled/disabled.
> > 
> > Results:
> > 
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33% (1516 MB/sec -> 2046 
MB/sec).
> > Number of exits/sec decreased 6x.
> > The same improvement was shown when I tested with 3 vms running 
netperf
> > (4086 MB/sec -> 5545 MB/sec).
> > 
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits 
> was reduced by
> > 31%.
> > The same experiment with 3 vms running filebench showed similar 
numbers.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> > ---
> >  drivers/vhost/net.c   |    6 +-
> >  drivers/vhost/scsi.c  |    6 +-
> >  drivers/vhost/vhost.c |  245 ++++++++++++++++++++++++++++++++++++
> +++++++++++--
> >  drivers/vhost/vhost.h |   38 +++++++-
> >  4 files changed, 277 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 971a760..558aecb 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode 
> *inode, struct file *f)
> >     }
> >     vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
> > 
> > -   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, 
dev);
> > -   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, 
dev);
> > +   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> > +         vqs[VHOST_NET_VQ_TX]);
> > +   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> > +         vqs[VHOST_NET_VQ_RX]);
> > 
> >     f->private_data = n;
> > 
> > diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> > index 4f4ffa4..665eeeb 100644
> > --- a/drivers/vhost/scsi.c
> > +++ b/drivers/vhost/scsi.c
> > @@ -1528,9 +1528,9 @@ static int vhost_scsi_open(struct inode 
> *inode, struct file *f)
> >     if (!vqs)
> >        goto err_vqs;
> > 
> > -   vhost_work_init(&vs->vs_completion_work, 
vhost_scsi_complete_cmd_work);
> > -   vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> > -
> > +   vhost_work_init(&vs->vs_completion_work, NULL,
> > +                  vhost_scsi_complete_cmd_work);
> > +   vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
> >     vs->vs_events_nr = 0;
> >     vs->vs_events_missed = false;
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index c90f437..fbe8174 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -24,9 +24,17 @@
> >  #include <linux/slab.h>
> >  #include <linux/kthread.h>
> >  #include <linux/cgroup.h>
> > +#include <linux/jiffies.h>
> >  #include <linux/module.h>
> > 
> >  #include "vhost.h"
> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
> virtqueue when rate of events is at least this number per jiffy. If 
> 0, never start polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
> virtqueue after this many jiffies of no work.");
> > 
> >  enum {
> >     VHOST_MEMORY_MAX_NREGIONS = 64,
> > @@ -58,27 +66,28 @@ static int vhost_poll_wakeup(wait_queue_t 
> *wait, unsigned mode, int sync,
> >     return 0;
> >  }
> > 
> > -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> > +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue 
*vq,
> > +                     vhost_work_fn_t fn)
> >  {
> >     INIT_LIST_HEAD(&work->node);
> >     work->fn = fn;
> >     init_waitqueue_head(&work->done);
> >     work->flushing = 0;
> >     work->queue_seq = work->done_seq = 0;
> > +   work->vq = vq;
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_work_init);
> > 
> >  /* Init poll structure */
> >  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -           unsigned long mask, struct vhost_dev *dev)
> > +           unsigned long mask, struct vhost_virtqueue *vq)
> >  {
> >     init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> >     init_poll_funcptr(&poll->table, vhost_poll_func);
> >     poll->mask = mask;
> > -   poll->dev = dev;
> > +   poll->dev = vq->dev;
> >     poll->wqh = NULL;
> > -
> > -   vhost_work_init(&poll->work, fn);
> > +   vhost_work_init(&poll->work, vq, fn);
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_poll_init);
> > 
> > @@ -174,6 +183,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_poll_queue);
> > 
> > +/* Enable or disable virtqueue polling (vqpoll.enabled) for a 
virtqueue.
> > + *
> > + * Enabling this mode it tells the guest not to notify ("kick") us 
when its
> > + * has made more work available on this virtqueue; Rather, we 
> will continuously
> > + * poll this virtqueue in the worker thread. If multiple 
> virtqueues are polled,
> > + * the worker thread polls them all, e.g., in a round-robin fashion.
> > + * Note that vqpoll.enabled doesn't always mean that this virtqueue 
is
> > + * actually being polled: The backend (e.g., net.c) may 
> temporarily disable it
> > + * using vhost_disable/enable_notify(), while vqpoll.enabled is 
unchanged.
> > + *
> > + * It is assumed that these functions are called relatively 
> rarely, when vhost
> > + * notices that this virtqueue's usage pattern significantly 
> changed in a way
> > + * that makes polling more efficient than notification, or vice 
versa.
> > + * Also, we assume that vhost_vq_disable_vqpoll() is always called on 
vq
> > + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can 
be
> > + * reclaimed.
> > + */
> > +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +   if (vq->vqpoll.enabled)
> > +      return; /* already enabled, nothing to do */
> > +   if (!vq->handle_kick)
> > +      return; /* polling will be a waste of time if no callback! */
> > +   if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> > +      /* vq has guest notifications enabled. Disable them,
> > +         and instead add vq to the polling list */
> > +      vhost_disable_notify(vq->dev, vq);
> > +      list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
> > +   }
> > +   vq->vqpoll.jiffies_last_kick = jiffies;
> > +   __get_user(vq->avail_idx, &vq->avail->idx);
> > +   vq->vqpoll.enabled = true;
> > +
> > +   /* Map userspace's vq->avail to the kernel's memory space. */
> > +   if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> > +      &vq->vqpoll.avail_page) != 1) {
> > +      /* TODO: can this happen, as we check access
> > +      to vq->avail in advance? */
> > +      BUG();
> > +   }
> > +   vq->vqpoll.avail_mapped = (struct vring_avail *) (
> > +      (unsigned long)kmap(vq->vqpoll.avail_page) |
> > +      ((unsigned long)vq->avail & ~PAGE_MASK));
> > +}
> > +
> > +/*
> > + * This function doesn't always succeed in changing the mode. 
Sometimes
> > + * a temporary race condition prevents turning on guest 
notifications, so
> > + * vq should be polled next time again.
> > + */
> > +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +   if (!vq->vqpoll.enabled)
> > +      return; /* already disabled, nothing to do */
> > +
> > +   vq->vqpoll.enabled = false;
> > +
> > +   if (!list_empty(&vq->vqpoll.link)) {
> > +      /* vq is on the polling list, remove it from this list and
> > +       * instead enable guest notifications. */
> > +      list_del_init(&vq->vqpoll.link);
> > +      if (unlikely(vhost_enable_notify(vq->dev, vq))
> > +         && !vq->vqpoll.shutdown) {
> > +         /* Race condition: guest wrote before we enabled
> > +          * notification, so we'll never get a notification for
> > +          * this work - so continue polling mode for a while. */
> > +         vhost_disable_notify(vq->dev, vq);
> > +         vq->vqpoll.enabled = true;
> > +         vhost_enable_notify(vq->dev, vq);
> > +         return;
> > +      }
> > +   }
> > +
> > +   if (vq->vqpoll.avail_mapped) {
> > +      kunmap(vq->vqpoll.avail_page);
> > +      put_page(vq->vqpoll.avail_page);
> > +      vq->vqpoll.avail_mapped = 0;
> > +   }
> > +}
> > +
> >  static void vhost_vq_reset(struct vhost_dev *dev,
> >              struct vhost_virtqueue *vq)
> >  {
> > @@ -199,6 +288,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> >     vq->call = NULL;
> >     vq->log_ctx = NULL;
> >     vq->memory = NULL;
> > +   INIT_LIST_HEAD(&vq->vqpoll.link);
> > +   vq->vqpoll.enabled = false;
> > +   vq->vqpoll.shutdown = false;
> > +   vq->vqpoll.avail_mapped = NULL;
> > +}
> > +
> > +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of 
the
> > + * virtqueues which the caller should kick, or NULL in case none 
should be
> > + * kicked. roundrobin_poll() also disables polling on a 
virtqueuewhich has
> > + * been polled for too long without success.
> > + *
> > + * This current implementation (the "round-robin" implementation) 
only
> > + * polls the first vq in the list, returning it or NULL as 
appropriate, and
> > + * moves this vq to the end of the list, so next time a different one 
is
> > + * polled.
> > + */
> > +static struct vhost_virtqueue *roundrobin_poll(struct list_head 
*list)
> > +{
> > +   struct vhost_virtqueue *vq;
> > +   u16 avail_idx;
> > +
> > +   if (list_empty(list))
> > +      return NULL;
> > +
> > +   vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> > +   WARN_ON(!vq->vqpoll.enabled);
> > +   list_move_tail(&vq->vqpoll.link, list);
> > +
> > +   /* See if there is any new work available from the guest. */
> > +   /* TODO: can check the optional idx feature, and if we haven't
> > +   * reached that idx yet, don't kick... */
> > +   avail_idx = vq->vqpoll.avail_mapped->idx;
> > +   if (avail_idx != vq->last_avail_idx)
> > +      return vq;
> > +
> > +   if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> > +      /* We've been polling this virtqueue for a long time with no
> > +      * results, so switch back to guest notification
> > +      */
> > +      vhost_vq_disable_vqpoll(vq);
> > +   }
> > +   return NULL;
> >  }
> > 
> >  static int vhost_worker(void *data)
> > @@ -237,12 +368,62 @@ static int vhost_worker(void *data)
> >        spin_unlock_irq(&dev->work_lock);
> > 
> >        if (work) {
> > +         struct vhost_virtqueue *vq = work->vq;
> >           __set_current_state(TASK_RUNNING);
> >           work->fn(work);
> > +         /* Keep track of the work rate, for deciding when to
> > +          * enable polling */
> > +         if (vq) {
> > +            if (vq->vqpoll.jiffies_last_work != jiffies) {
> > +               vq->vqpoll.jiffies_last_work = jiffies;
> > +               vq->vqpoll.work_this_jiffy = 0;
> > +            }
> > +            vq->vqpoll.work_this_jiffy++;
> > +         }
> > +         /* If vq is in the round-robin list of virtqueues being
> > +          * constantly checked by this thread, move vq the end
> > +          * of the queue, because it had its fair chance now.
> > +          */
> > +         if (vq && !list_empty(&vq->vqpoll.link)) {
> > +            list_move_tail(&vq->vqpoll.link,
> > +               &dev->vqpoll_list);
> > +         }
> > +         /* Otherwise, if this vq is looking for notifications
> > +          * but vq polling is not enabled for it, do it now.
> > +          */
> > +         else if (poll_start_rate && vq && vq->handle_kick &&
> > +            !vq->vqpoll.enabled &&
> > +            !vq->vqpoll.shutdown &&
> > +            !(vq->used_flags & VRING_USED_F_NO_NOTIFY) &&
> > +            vq->vqpoll.work_this_jiffy >=
> > +               poll_start_rate) {
> > +            vhost_vq_enable_vqpoll(vq);
> > +         }
> > +      }
> > +      /* Check one virtqueue from the round-robin list */
> > +      if (!list_empty(&dev->vqpoll_list)) {
> > +         struct vhost_virtqueue *vq;
> > +
> > +         vq = roundrobin_poll(&dev->vqpoll_list);
> > +
> > +         if (vq) {
> > +            vq->handle_kick(&vq->poll.work);
> > +            vq->vqpoll.jiffies_last_kick = jiffies;
> > +         }
> > +
> > +         /* If our polling list isn't empty, ask to continue
> > +          * running this thread, don't yield.
> > +          */
> > +         __set_current_state(TASK_RUNNING);
> >           if (need_resched())
> >              schedule();
> > -      } else
> > -         schedule();
> > +      } else {
> > +         if (work) {
> > +            if (need_resched())
> > +               schedule();
> > +         } else
> > +            schedule();
> > +      }
> > 
> >     }
> >     unuse_mm(dev->mm);
> > @@ -306,6 +487,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >     dev->mm = NULL;
> >     spin_lock_init(&dev->work_lock);
> >     INIT_LIST_HEAD(&dev->work_list);
> > +   INIT_LIST_HEAD(&dev->vqpoll_list);
> >     dev->worker = NULL;
> > 
> >     for (i = 0; i < dev->nvqs; ++i) {
> > @@ -318,7 +500,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >        vhost_vq_reset(dev, vq);
> >        if (vq->handle_kick)
> >           vhost_poll_init(&vq->poll, vq->handle_kick,
> > -               POLLIN, dev);
> > +               POLLIN, vq);
> >     }
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_dev_init);
> > @@ -350,7 +532,7 @@ static int vhost_attach_cgroups(struct vhost_dev 
*dev)
> >     struct vhost_attach_cgroups_struct attach;
> > 
> >     attach.owner = current;
> > -   vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> > +   vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
> >     vhost_work_queue(dev, &attach.work);
> >     vhost_work_flush(dev, &attach.work);
> >     return attach.ret;
> > @@ -444,6 +626,26 @@ void vhost_dev_stop(struct vhost_dev *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_dev_stop);
> > 
> > +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue 
polling
> > + * mode for a given virtqueue which is itself being shut down. We ask 
the
> > + * worker thread to do this rather than doing it directly, so that we 
don't
> > + * race with the worker thread's use of the queue.
> > + */
> > +static void shutdown_vqpoll_work(struct vhost_work *work)
> > +{
> > +   work->vq->vqpoll.shutdown = true;
> > +   vhost_vq_disable_vqpoll(work->vq);
> > +   WARN_ON(work->vq->vqpoll.avail_mapped);
> > +}
> > +
> > +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +   struct vhost_work work;
> > +
> > +   vhost_work_init(&work, vq, shutdown_vqpoll_work);
> > +   vhost_work_queue(vq->dev, &work);
> > +   vhost_work_flush(vq->dev, &work);
> > +}
> >  /* Caller should have device mutex if and only if locked is set */
> >  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
> >  {
> > @@ -460,6 +662,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, 
> bool locked)
> >           eventfd_ctx_put(dev->vqs[i]->call_ctx);
> >        if (dev->vqs[i]->call)
> >           fput(dev->vqs[i]->call);
> > +      shutdown_vqpoll(dev->vqs[i]);
> >        vhost_vq_reset(dev, dev->vqs[i]);
> >     }
> >     vhost_dev_free_iovecs(dev);
> > @@ -1491,6 +1694,19 @@ bool vhost_enable_notify(struct vhost_dev 
> *dev, struct vhost_virtqueue *vq)
> >     u16 avail_idx;
> >     int r;
> > 
> > +   /* In polling mode, when the backend (e.g., net.c) asks to enable
> > +    * notifications, we don't enable guest notifications. Instead, 
start
> > +    * polling on this vq by adding it to the round-robin list.
> > +    */
> > +   if (vq->vqpoll.enabled) {
> > +      if (list_empty(&vq->vqpoll.link)) {
> > +         list_add_tail(&vq->vqpoll.link,
> > +            &vq->dev->vqpoll_list);
> > +         vq->vqpoll.jiffies_last_kick = jiffies;
> > +      }
> > +      return false;
> > +   }
> > +
> >     if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> >        return false;
> >     vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> > @@ -1528,6 +1744,17 @@ void vhost_disable_notify(struct vhost_dev 
> *dev, struct vhost_virtqueue *vq)
> >  {
> >     int r;
> > 
> > +   /* If this virtqueue is vqpoll.enabled, and on the polling list, 
it
> > +    * will generate notifications even if the guest is asked not to 
send
> > +    * them. So we must remove it from the round-robin polling list.
> > +    * Note that vqpoll.enabled remains set.
> > +    */
> > +   if (vq->vqpoll.enabled) {
> > +      if (!list_empty(&vq->vqpoll.link))
> > +         list_del_init(&vq->vqpoll.link);
> > +      return;
> > +   }
> > +
> >     if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> >        return;
> >     vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 3eda654..11aaaf4 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -24,6 +24,7 @@ struct vhost_work {
> >     int           flushing;
> >     unsigned        queue_seq;
> >     unsigned        done_seq;
> > +   struct vhost_virtqueue    *vq;
> >  };
> > 
> >  /* Poll a file (eventfd or socket) */
> > @@ -37,11 +38,12 @@ struct vhost_poll {
> >     struct vhost_dev    *dev;
> >  };
> > 
> > -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> > +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue 
*vq,
> > +                     vhost_work_fn_t fn);
> >  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work 
*work);
> > 
> >  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -           unsigned long mask, struct vhost_dev *dev);
> > +           unsigned long mask, struct vhost_virtqueue  *vq);
> >  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
> >  void vhost_poll_stop(struct vhost_poll *poll);
> >  void vhost_poll_flush(struct vhost_poll *poll);
> > @@ -54,8 +56,6 @@ struct vhost_log {
> >     u64 len;
> >  };
> > 
> > -struct vhost_virtqueue;
> > -
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >     struct vhost_dev *dev;
> > @@ -110,6 +110,35 @@ struct vhost_virtqueue {
> >     /* Log write descriptors */
> >     void __user *log_base;
> >     struct vhost_log *log;
> > +   struct {
> > +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> > +       * that instead of using guest notifications (kicks) to
> > +       * discover new work, we prefer to continuously poll this
> > +       * virtqueue in the worker thread.
> > +       * If !enabled, the rest of the fields below are undefined.
> > +       */
> > +      bool enabled;
> > +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> > +       * actually being polled: The backend (e.g., net.c) may
> > +       * temporarily disable it using vhost_disable/enable_notify().
> > +       * vqpoll.link is used to maintain the thread's round-robin
> > +       * list of virtqueues that actually need to be polled.
> > +       * Note list_empty(link) means this virtqueue isn't polled.
> > +       */
> > +      struct list_head link;
> > +      /* If this flag is true, the virtqueue is being shut down,
> > +       * so vqpoll should not be re-enabled.
> > +       */
> > +      bool shutdown;
> > +      /* Various counters used to decide when to enter polling mode
> > +       * or leave it and return to notification mode.
> > +       */
> > +      unsigned long jiffies_last_kick;
> > +      unsigned long jiffies_last_work;
> > +      int work_this_jiffy;
> > +      struct page *avail_page;
> > +      volatile struct vring_avail *avail_mapped;
> > +   } vqpoll;
> >  };
> > 
> >  struct vhost_dev {
> > @@ -123,6 +152,7 @@ struct vhost_dev {
> >     spinlock_t work_lock;
> >     struct list_head work_list;
> >     struct task_struct *worker;
> > +   struct list_head vqpoll_list;
> >  };
> > 
> >  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue 
> **vqs, int nvqs);
> > -- 
> > 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-29 12:40         ` Michael S. Tsirkin
@ 2014-07-30  6:32           ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-07-30  6:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	kvm-owner, Yossi Kuperman1

kvm-owner@vger.kernel.org wrote on 29/07/2014 03:40:18 PM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, 
> Cc: abel.gordon@gmail.com, Alex Glikson/Haifa/IBM@IBMIL, Eran 
> Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> kvm@vger.kernel.org, kvm-owner@vger.kernel.org, Yossi Kuperman1/
> Haifa/IBM@IBMIL
> Date: 29/07/2014 03:40 PM
> Subject: Re: [PATCH] vhost: Add polling mode
> Sent by: kvm-owner@vger.kernel.org
> 
> On Tue, Jul 29, 2014 at 03:23:59PM +0300, Razya Ladelsky wrote:
> > > 
> > > Hmm there aren't a lot of numbers there :(. Speed increased by 33% 
but
> > > by how much?  E.g. maybe you are getting from 1Mbyte/sec to 1.3,
> > > if so it's hard to get excited about it. 
> > 
> > Netperf 1 VM: 1516 MB/sec -> 2046 MB/sec
> > and for 3 VMs: 4086 MB/sec -> 5545 MB/sec
> 
> What do you mean by 1 VM? Streaming TCP host to vm?
> Also, your throughput is somewhat low, it's worth seeing
> why you can't hit higher speeds.
> 

My configuration is this:
I have two machines, A and B.
A hosts the vms, B runs the netserver.
One vm (on A) runs netperf, where the its destination server is running on 
B. 

I ran netperf with 64B messages, which is heavily loading the vm, which is 
why its throughput is low.
The idea was to get it 100% loaded, so we can see that the polling is 
getting it to produce higher throughput. 



> > > Some questions that come to
> > > mind: what was the message size? I would expect several measurements
> > > with different values.  How did host CPU utilization change?
> > > 
> > 
> > message size  was 64B in order to get the VM to be cpu saturated. 
> > so vhost had 99% cpu and vhost 38%, with the polling patch both had 
99%.
> 
> Hmm so a net loss in throughput/CPU.
> 

Actually, my experiments were fair in a sense that for both cases, 
with or without polling, I run both threads, vcpu and vhost, on 2 cores 
(set their affinity that way).
The only difference was whether polling was enabled/disabled. 


> > 
> > 
> > > What about latency? As we are competing with guest for host CPU,
> > > would worst-case or average latency suffer?
> > > 
> > 
> > Polling indeed doesn't make a lot of sense if there aren't enough 
> > available cores.
> > In these cases polling should not be used.
> > 
> > Thank you,
> > Razya
> 
> OK but scheduler might run vm and vhost on the same cpu
> even if cores are available.
> This needs to be detected somehow and polling disabled.
> 
> 
> > 
> > 
> > > Thanks,
> > > 
> > > -- 
> > > MST
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-29 12:23       ` Razya Ladelsky
@ 2014-07-29 12:40         ` Michael S. Tsirkin
  2014-07-30  6:32           ` Razya Ladelsky
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-07-29 12:40 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	kvm-owner, Yossi Kuperman1

On Tue, Jul 29, 2014 at 03:23:59PM +0300, Razya Ladelsky wrote:
> > 
> > Hmm there aren't a lot of numbers there :(. Speed increased by 33% but
> > by how much?  E.g. maybe you are getting from 1Mbyte/sec to 1.3,
> > if so it's hard to get excited about it. 
> 
> Netperf 1 VM: 1516 MB/sec -> 2046 MB/sec
> and for 3 VMs: 4086 MB/sec -> 5545 MB/sec

What do you mean by 1 VM? Streaming TCP host to vm?
Also, your throughput is somewhat low, it's worth seeing
why you can't hit higher speeds.

> > Some questions that come to
> > mind: what was the message size? I would expect several measurements
> > with different values.  How did host CPU utilization change?
> > 
> 
> message size  was 64B in order to get the VM to be cpu saturated. 
> so vhost had 99% cpu and vhost 38%, with the polling patch both had 99%.

Hmm so a net loss in throughput/CPU.

> 
> 
> > What about latency? As we are competing with guest for host CPU,
> > would worst-case or average latency suffer?
> > 
> 
> Polling indeed doesn't make a lot of sense if there aren't enough 
> available cores.
> In these cases polling should not be used.
> 
> Thank you,
> Razya

OK but scheduler might run vm and vhost on the same cpu
even if cores are available.
This needs to be detected somehow and polling disabled.


> 
> 
> > Thanks,
> > 
> > -- 
> > MST
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-29 10:44     ` Michael S. Tsirkin
@ 2014-07-29 12:23       ` Razya Ladelsky
  2014-07-29 12:40         ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Razya Ladelsky @ 2014-07-29 12:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	kvm-owner, Yossi Kuperman1

> 
> Hmm there aren't a lot of numbers there :(. Speed increased by 33% but
> by how much?  E.g. maybe you are getting from 1Mbyte/sec to 1.3,
> if so it's hard to get excited about it. 

Netperf 1 VM: 1516 MB/sec -> 2046 MB/sec
and for 3 VMs: 4086 MB/sec -> 5545 MB/sec

> Some questions that come to
> mind: what was the message size? I would expect several measurements
> with different values.  How did host CPU utilization change?
> 

message size  was 64B in order to get the VM to be cpu saturated. 
so vhost had 99% cpu and vhost 38%, with the polling patch both had 99%.



> What about latency? As we are competing with guest for host CPU,
> would worst-case or average latency suffer?
> 

Polling indeed doesn't make a lot of sense if there aren't enough 
available cores.
In these cases polling should not be used.

Thank you,
Razya



> Thanks,
> 
> -- 
> MST
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-29 10:30   ` Razya Ladelsky
@ 2014-07-29 10:44     ` Michael S. Tsirkin
  2014-07-29 12:23       ` Razya Ladelsky
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-07-29 10:44 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	Yossi Kuperman1

On Tue, Jul 29, 2014 at 01:30:03PM +0300, Razya Ladelsky wrote:

[..] had to snip off the quoted text, it's mangled up to
     unreadability.
Please take a look at Documentation/email-clients.txt
to fix this.

> > This is an optimization patch, isn't it?
> > Could you please include some numbers showing its
> > effect?
> > 
> > 
> 
> Hi Michael,
> Sure. I included them in a reply to Jason Wang in this thread,
> Here it is:
> http://www.spinics.net/linux/lists/kvm/msg106049.html
> 

Hmm there aren't a lot of numbers there :(. Speed increased by 33% but
by how much?  E.g. maybe you are getting from 1Mbyte/sec to 1.3,
if so it's hard to get excited about it. Some questions that come to
mind: what was the message size? I would expect several measurements
with different values.  How did host CPU utilization change?

What about latency? As we are competing with guest for host CPU,
would worst-case or average latency suffer?

Thanks,

-- 
MST

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-29  8:06 ` Michael S. Tsirkin
@ 2014-07-29 10:30   ` Razya Ladelsky
  2014-07-29 10:44     ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Razya Ladelsky @ 2014-07-29 10:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	Yossi Kuperman1

"Michael S. Tsirkin" <mst@redhat.com> wrote on 29/07/2014 11:06:40 AM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, 
> Cc: kvm@vger.kernel.org, abel.gordon@gmail.com, Joel Nider/Haifa/
> IBM@IBMIL, Yossi Kuperman1/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/
> IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL
> Date: 29/07/2014 11:06 AM
> Subject: Re: [PATCH] vhost: Add polling mode
> 
> On Mon, Jul 21, 2014 at 04:23:44PM +0300, Razya Ladelsky wrote:
> > Hello All,
> > 
> > When vhost is waiting for buffers from the guest driver (e.g., more 
> > packets
> > to send in vhost-net's transmit queue), it normally goes to sleep and 
> > waits
> > for the guest to "kick" it. This kick involves a PIO in the guest, and
> > therefore an exit (and possibly userspace involvement in translating 
this 
> > PIO
> > exit into a file descriptor event), all of which hurts performance.
> > 
> > If the system is under-utilized (has cpu time to spare), vhost can 
> > continuously poll the virtqueues for new buffers, and avoid asking 
> > the guest to kick us.
> > This patch adds an optional polling mode to vhost, that can be enabled 

> > via a kernel module parameter, "poll_start_rate".
> > 
> > When polling is active for a virtqueue, the guest is asked to
> > disable notification (kicks), and the worker thread continuously 
checks 
> > for
> > new buffers. When it does discover new buffers, it simulates a "kick" 
by
> > invoking the underlying backend driver (such as vhost-net), which 
thinks 
> > it
> > got a real kick from the guest, and acts accordingly. If the 
underlying
> > driver asks not to be kicked, we disable polling on this virtqueue.
> > 
> > We start polling on a virtqueue when we notice it has
> > work to do. Polling on this virtqueue is later disabled after 3 
seconds of
> > polling turning up no new work, as in this case we are better off 
> > returning
> > to the exit-based notification mechanism. The default timeout of 3 
seconds
> > can be changed with the "poll_stop_idle" kernel module parameter.
> > 
> > This polling approach makes lot of sense for new HW with 
posted-interrupts
> > for which we have exitless host-to-guest notifications. But even with 
> > support 
> > for posted interrupts, guest-to-host communication still causes exits. 

> > Polling adds the missing part.
> > 
> > When systems are overloaded, there won?t be enough cpu time for the 
> > various 
> > vhost threads to poll their guests' devices. For these scenarios, we 
plan 
> > to add support for vhost threads that can be shared by multiple 
devices, 
> > even of multiple vms. 
> > Our ultimate goal is to implement the I/O acceleration features 
described 
> > in:
> > KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) 
> > https://www.youtube.com/watch?v=9EyweibHfEs
> > and
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> > 
> > 
> > Comments are welcome, 
> > Thank you,
> > Razya
> > 
> > From: Razya Ladelsky <razya@il.ibm.com>
> > 
> > Add an optional polling mode to continuously poll the virtqueues
> > for new buffers, and avoid asking the guest to kick us.
> > 
> > Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> 
> This is an optimization patch, isn't it?
> Could you please include some numbers showing its
> effect?
> 
> 

Hi Michael,
Sure. I included them in a reply to Jason Wang in this thread,
Here it is:
http://www.spinics.net/linux/lists/kvm/msg106049.html




> > ---
> >  drivers/vhost/net.c   |    6 +-
> >  drivers/vhost/scsi.c  |    5 +-
> >  drivers/vhost/vhost.c |  247 
> > +++++++++++++++++++++++++++++++++++++++++++++++--
> >  drivers/vhost/vhost.h |   37 +++++++-
> >  4 files changed, 277 insertions(+), 18 deletions(-)
> 
> 
> Whitespace seems mangled to the point of making patch
> unreadable. Can you pls repost?
> 

Sure.

> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 971a760..558aecb 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, 
struct 
> > file *f)
> >         }
> >         vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
> > 
> > -       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, 
POLLOUT, 
> > dev);
> > -       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, 
POLLIN, 
> > dev);
> > +       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, 
POLLOUT,
> > +                       vqs[VHOST_NET_VQ_TX]);
> > +       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, 
POLLIN,
> > +                       vqs[VHOST_NET_VQ_RX]);
> > 
> >         f->private_data = n;
> > 
> > diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> > index 4f4ffa4..56f0233 100644
> > --- a/drivers/vhost/scsi.c
> > +++ b/drivers/vhost/scsi.c
> > @@ -1528,9 +1528,8 @@ static int vhost_scsi_open(struct inode *inode, 
> > struct file *f)
> >         if (!vqs)
> >                 goto err_vqs;
> > 
> > -       vhost_work_init(&vs->vs_completion_work, 
> > vhost_scsi_complete_cmd_work);
> > -       vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> > -
> > +       vhost_work_init(&vs->vs_completion_work, NULL, 
> > vhost_scsi_complete_cmd_work);
> > +       vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
> >         vs->vs_events_nr = 0;
> >         vs->vs_events_missed = false;
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index c90f437..678d766 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -24,9 +24,17 @@
> >  #include <linux/slab.h>
> >  #include <linux/kthread.h>
> >  #include <linux/cgroup.h>
> > +#include <linux/jiffies.h>
> >  #include <linux/module.h>
> > 
> >  #include "vhost.h"
> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
virtqueue 
> > when rate of events is at least this number per jiffy. If 0, never 
start 
> > polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
virtqueue 
> > after this many jiffies of no work.");
> > 
> >  enum {
> >         VHOST_MEMORY_MAX_NREGIONS = 64,
> > @@ -58,27 +66,27 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
> > unsigned mode, int sync,
> >         return 0;
> >  }
> > 
> > -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> > +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue 
*vq, 
> > vhost_work_fn_t fn)
> >  {
> >         INIT_LIST_HEAD(&work->node);
> >         work->fn = fn;
> >         init_waitqueue_head(&work->done);
> >         work->flushing = 0;
> >         work->queue_seq = work->done_seq = 0;
> > +       work->vq = vq;
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_work_init);
> > 
> >  /* Init poll structure */
> >  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -                    unsigned long mask, struct vhost_dev *dev)
> > +                    unsigned long mask, struct vhost_virtqueue *vq)
> >  {
> >         init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
> >         init_poll_funcptr(&poll->table, vhost_poll_func);
> >         poll->mask = mask;
> > -       poll->dev = dev;
> > +       poll->dev = vq->dev;
> >         poll->wqh = NULL;
> > -
> > -       vhost_work_init(&poll->work, fn);
> > +       vhost_work_init(&poll->work, vq, fn);
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_poll_init);
> > 
> > @@ -174,6 +182,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_poll_queue);
> > 
> > +/* Enable or disable virtqueue polling (vqpoll.enabled) for a 
virtqueue.
> > + *
> > + * Enabling this mode it tells the guest not to notify ("kick") us 
when 
> > its
> > + * has made more work available on this virtqueue; Rather, we will 
> > continuously
> > + * poll this virtqueue in the worker thread. If multiple virtqueues 
are 
> > polled,
> > + * the worker thread polls them all, e.g., in a round-robin fashion.
> > + * Note that vqpoll.enabled doesn't always mean that this virtqueue 
is
> > + * actually being polled: The backend (e.g., net.c) may temporarily 
> > disable it
> > + * using vhost_disable/enable_notify(), while vqpoll.enabled is 
> > unchanged.
> > + *
> > + * It is assumed that these functions are called relatively rarely, 
when 
> > vhost
> > + * notices that this virtqueue's usage pattern significantly changed 
in a 
> > way
> > + * that makes polling more efficient than notification, or vice 
versa.
> > + * Also, we assume that vhost_vq_disable_vqpoll() is always called on 
vq
> > + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can 
be
> > + * reclaimed.
> > + */
> > +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +       if (vq->vqpoll.enabled)
> > +               return; /* already enabled, nothing to do */
> > +       if (!vq->handle_kick)
> > +               return; /* polling will be a waste of time if no 
callback! 
> > */
> > +       if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> > +               /* vq has guest notifications enabled. Disable them,
> > +                  and instead add vq to the polling list */
> > +               vhost_disable_notify(vq->dev, vq);
> > +               list_add_tail(&vq->vqpoll.link, 
&vq->dev->vqpoll_list);
> > +       }
> > +       vq->vqpoll.jiffies_last_kick = jiffies;
> > +       __get_user(vq->avail_idx, &vq->avail->idx); 
> > +       vq->vqpoll.enabled = true;
> > +
> > +       /* Map userspace's vq->avail to the kernel's memory space. */
> > +       if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> > +               &vq->vqpoll.avail_page) != 1) {
> > +               /* TODO: can this happen, as we check access
> > +               to vq->avail in advance? */
> > +               BUG();
> > +       }
> > +       vq->vqpoll.avail_mapped = (struct vring_avail *) (
> > +               (unsigned long)kmap(vq->vqpoll.avail_page) |
> > +               ((unsigned long)vq->avail & ~PAGE_MASK));
> > +}
> > +
> > +/*
> > + * This function doesn't always succeed in changing the mode. 
Sometimes
> > + * a temporary race condition prevents turning on guest 
notifications, so
> > + * vq should be polled next time again.
> > + */
> > +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +       if (!vq->vqpoll.enabled) {
> > +               return; /* already disabled, nothing to do */
> > +       }
> > +       vq->vqpoll.enabled = false;
> > +
> > +       if (!list_empty(&vq->vqpoll.link)) {
> > +               /* vq is on the polling list, remove it from this list 
and
> > +                * instead enable guest notifications. */
> > +               list_del_init(&vq->vqpoll.link);
> > +               if (unlikely(vhost_enable_notify(vq->dev,vq))
> > +                       && !vq->vqpoll.shutdown) {
> > +                       /* Race condition: guest wrote before we 
enabled
> > +                        * notification, so we'll never get a 
notification 
> > for
> > +                        * this work - so continue polling mode for a 
> > while. */
> > +                       vhost_disable_notify(vq->dev, vq);
> > +                       vq->vqpoll.enabled = true;
> > +                       vhost_enable_notify(vq->dev, vq);
> > +                       return;
> > +               }
> > +       }
> > +
> > +       if (vq->vqpoll.avail_mapped) {
> > +               kunmap(vq->vqpoll.avail_page);
> > +               put_page(vq->vqpoll.avail_page);
> > +               vq->vqpoll.avail_mapped = 0;
> > +       }
> > +}
> > +
> >  static void vhost_vq_reset(struct vhost_dev *dev,
> >                            struct vhost_virtqueue *vq)
> >  {
> > @@ -199,6 +287,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> >         vq->call = NULL;
> >         vq->log_ctx = NULL;
> >         vq->memory = NULL;
> > +       INIT_LIST_HEAD(&vq->vqpoll.link);
> > +       vq->vqpoll.enabled = false;
> > +       vq->vqpoll.shutdown = false;
> > +       vq->vqpoll.avail_mapped = NULL;
> > +}
> > +
> > +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of 
the
> > + * virtqueues which the caller should kick, or NULL in case none 
should 
> > be
> > + * kicked. roundrobin_poll() also disables polling on a virtqueue 
which 
> > has
> > + * been polled for too long without success.
> > + *
> > + * This current implementation (the "round-robin" implementation) 
only
> > + * polls the first vq in the list, returning it or NULL as 
appropriate, 
> > and
> > + * moves this vq to the end of the list, so next time a different one 
is
> > + * polled.
> > + */
> > +static struct vhost_virtqueue *roundrobin_poll(struct list_head 
*list) {
> > +       struct vhost_virtqueue *vq;
> > +       u16 avail_idx;
> > +
> > +
> > +       if (list_empty(list))
> > +               return NULL;
> > +
> > +       vq = list_first_entry(list, struct vhost_virtqueue, 
vqpoll.link);
> > +       WARN_ON(!vq->vqpoll.enabled);
> > +       list_move_tail(&vq->vqpoll.link, list);
> > +
> > +       /* See if there is any new work available from the guest. */
> > +       /* TODO: can check the optional idx feature, and if we haven't
> > +        * reached that idx yet, don't kick... */
> > +       avail_idx = vq->vqpoll.avail_mapped->idx;
> > +       if (avail_idx != vq->last_avail_idx) {
> > +               return vq;
> > +       }
> > +       if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> > +               /* We've been polling this virtqueue for a long time 
with 
> > no
> > +                * results, so switch back to guest notification
> > +                */
> > +               vhost_vq_disable_vqpoll(vq);
> > +       }
> > +       return NULL;
> >  }
> > 
> >  static int vhost_worker(void *data)
> > @@ -237,12 +367,66 @@ static int vhost_worker(void *data)
> >                 spin_unlock_irq(&dev->work_lock);
> > 
> >                 if (work) {
> > +                       struct vhost_virtqueue *vq = work->vq;
> >                         __set_current_state(TASK_RUNNING);
> >                         work->fn(work);
> > +                       /* Keep track of the work rate, for deciding 
when 
> > to
> > +                        * enable polling */
> > +                       if (vq) {
> > +                               if (vq->vqpoll.jiffies_last_work != 
> > jiffies) {
> > +                                       vq->vqpoll.jiffies_last_work = 

> > jiffies;
> > +                                       vq->vqpoll.work_this_jiffy = 
0;
> > +                               }
> > +                               vq->vqpoll.work_this_jiffy++;
> > +                       }
> > +                       /* If vq is in the round-robin list of 
virtqueues 
> > being
> > +                        * constantly checked by this thread, move vq 
the 
> > end
> > +                        * of the queue, because it had its fair 
chance 
> > now.
> > +                        */
> > +                       if (vq && !list_empty(&vq->vqpoll.link)) {
> > +                               list_move_tail(&vq->vqpoll.link,
> > +                                       &dev->vqpoll_list);
> > +                       }
> > +                       /* Otherwise, if this vq is looking for 
> > notifications
> > +                        * but vq polling is not enabled for it, do it 

> > now.
> > +                        */
> > +                       else if (poll_start_rate && vq && 
vq->handle_kick 
> > &&
> > +                               !vq->vqpoll.enabled &&
> > +                               !vq->vqpoll.shutdown &&
> > +                               !(vq->used_flags & 
VRING_USED_F_NO_NOTIFY) 
> > &&
> > +                               vq->vqpoll.work_this_jiffy >=
> > +                                       poll_start_rate) {
> > +                               vhost_vq_enable_vqpoll(vq);
> > +                       }
> > +               }
> > +               /* Check one virtqueue from the round-robin list */
> > +               if (!list_empty(&dev->vqpoll_list)) {
> > +                       struct vhost_virtqueue *vq;
> > +
> > +                       vq = roundrobin_poll(&dev->vqpoll_list);
> > +
> > +                       if (vq) {
> > +                               vq->handle_kick(&vq->poll.work);
> > +                               vq->vqpoll.jiffies_last_kick=jiffies;
> > +                       }
> > +
> > +                       /* If our polling list isn't empty, ask to 
> > continue
> > +                        * running this thread, don't yield.
> > +                        */
> > +                       __set_current_state(TASK_RUNNING);
> >                         if (need_resched())
> > +                        schedule();
> > +
> > +               }
> > +               else {
> > +                       if (work)
> > +                       {
> > +                               if (need_resched())
> > +                                       schedule();
> > +                       }
> > +                       else
> >                                 schedule();
> > -               } else
> > -                       schedule();
> > +               }
> > 
> >         }
> >         unuse_mm(dev->mm);
> > @@ -306,6 +490,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >         dev->mm = NULL;
> >         spin_lock_init(&dev->work_lock);
> >         INIT_LIST_HEAD(&dev->work_list);
> > +   INIT_LIST_HEAD(&dev->vqpoll_list);
> >         dev->worker = NULL;
> > 
> >         for (i = 0; i < dev->nvqs; ++i) {
> > @@ -318,7 +503,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >                 vhost_vq_reset(dev, vq);
> >                 if (vq->handle_kick)
> >                         vhost_poll_init(&vq->poll, vq->handle_kick,
> > -                                       POLLIN, dev);
> > +                                       POLLIN, vq);
> >         }
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_dev_init);
> > @@ -350,7 +535,7 @@ static int vhost_attach_cgroups(struct vhost_dev 
*dev)
> >         struct vhost_attach_cgroups_struct attach;
> > 
> >         attach.owner = current;
> > -       vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> > +       vhost_work_init(&attach.work, NULL, 
vhost_attach_cgroups_work);
> >         vhost_work_queue(dev, &attach.work);
> >         vhost_work_flush(dev, &attach.work);
> >         return attach.ret;
> > @@ -444,6 +629,25 @@ void vhost_dev_stop(struct vhost_dev *dev)
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_dev_stop);
> > 
> > +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue 
> > polling
> > + * mode for a given virtqueue which is itself being shut down. We ask 
the
> > + * worker thread to do this rather than doing it directly, so that we 

> > don't
> > + * race with the worker thread's use of the queue.
> > + */
> > +static void shutdown_vqpoll_work(struct vhost_work *work)
> > +{
> > +       work->vq->vqpoll.shutdown = true;
> > +       vhost_vq_disable_vqpoll(work->vq);
> > +       WARN_ON(work->vq->vqpoll.avail_mapped);
> > +}
> > +
> > +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +       struct vhost_work work;
> > +       vhost_work_init(&work, vq, shutdown_vqpoll_work);
> > +       vhost_work_queue(vq->dev, &work);
> > +       vhost_work_flush(vq->dev, &work);
> > +}
> >  /* Caller should have device mutex if and only if locked is set */
> >  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
> >  {
> > @@ -460,6 +664,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool 

> > locked)
> >                         eventfd_ctx_put(dev->vqs[i]->call_ctx);
> >                 if (dev->vqs[i]->call)
> >                         fput(dev->vqs[i]->call);
> > +               shutdown_vqpoll(dev->vqs[i]);
> >                 vhost_vq_reset(dev, dev->vqs[i]);
> >         }
> >         vhost_dev_free_iovecs(dev);
> > @@ -1491,6 +1696,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, 

> > struct vhost_virtqueue *vq)
> >         u16 avail_idx;
> >         int r;
> > 
> > +       /* In polling mode, when the backend (e.g., net.c) asks to 
enable
> > +        * notifications, we don't enable guest notifications. 
Instead, 
> > start
> > +        * polling on this vq by adding it to the round-robin list.
> > +        */
> > +       if (vq->vqpoll.enabled) {
> > +               if (list_empty(&vq->vqpoll.link)) {
> > +                       list_add_tail(&vq->vqpoll.link,
> > +                               &vq->dev->vqpoll_list);
> > +                       vq->vqpoll.jiffies_last_kick = jiffies;
> > +               }
> > +               return false;
> > +       }
> > +
> >         if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
> >                 return false;
> >         vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> > @@ -1528,6 +1746,17 @@ void vhost_disable_notify(struct vhost_dev 
*dev, 
> > struct vhost_virtqueue *vq)
> >  {
> >         int r;
> > 
> > +       /* If this virtqueue is vqpoll.enabled, and on the polling 
list, 
> > it
> > +        * will generate notifications even if the guest is asked not 
to 
> > send
> > +        * them. So we must remove it from the round-robin polling 
list.
> > +        * Note that vqpoll.enabled remains set.
> > +        */
> > +       if (vq->vqpoll.enabled) {
> > +               if(!list_empty(&vq->vqpoll.link))
> > +                       list_del_init(&vq->vqpoll.link);
> > +               return;
> > +       }
> > +
> >         if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
> >                 return;
> >         vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index 3eda654..feb16d6 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -24,6 +24,7 @@ struct vhost_work {
> >         int                       flushing;
> >         unsigned                  queue_seq;
> >         unsigned                  done_seq;
> > +       struct vhost_virtqueue    *vq;
> >  };
> > 
> >  /* Poll a file (eventfd or socket) */
> > @@ -37,11 +38,11 @@ struct vhost_poll {
> >         struct vhost_dev         *dev;
> >  };
> > 
> > -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> > +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue 
*vq, 
> > vhost_work_fn_t fn);
> >  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work 
*work);
> > 
> >  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> > -                    unsigned long mask, struct vhost_dev *dev);
> > +                    unsigned long mask, struct vhost_virtqueue  *vq);
> >  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
> >  void vhost_poll_stop(struct vhost_poll *poll);
> >  void vhost_poll_flush(struct vhost_poll *poll);
> > @@ -54,8 +55,6 @@ struct vhost_log {
> >         u64 len;
> >  };
> > 
> > -struct vhost_virtqueue;
> > -
> >  /* The virtqueue structure describes a queue attached to a device. */
> >  struct vhost_virtqueue {
> >         struct vhost_dev *dev;
> > @@ -110,6 +109,35 @@ struct vhost_virtqueue {
> >         /* Log write descriptors */
> >         void __user *log_base;
> >         struct vhost_log *log;
> > +   struct {
> > +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> > +       * that instead of using guest notifications (kicks) to
> > +       * discover new work, we prefer to continuously poll this
> > +       * virtqueue in the worker thread.
> > +       * If !enabled, the rest of the fields below are undefined.
> > +       */
> > +      bool enabled;
> > +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> > +       * actually being polled: The backend (e.g., net.c) may
> > +       * temporarily disable it using vhost_disable/enable_notify().
> > +       * vqpoll.link is used to maintain the thread's round-robin
> > +       * list of virtqueues that actually need to be polled.
> > +       * Note list_empty(link) means this virtqueue isn't polled.
> > +       */
> > +      struct list_head link;
> > +      /* If this flag is true, the virtqueue is being shut down,
> > +       * so vqpoll should not be re-enabled.
> > +       */
> > +      bool shutdown;
> > +      /* Various counters used to decide when to enter polling mode
> > +       * or leave it and return to notification mode.
> > +       */
> > +      unsigned long jiffies_last_kick;
> > +      unsigned long jiffies_last_work;
> > +      int work_this_jiffy;
> > +      struct page *avail_page;
> > +      volatile struct vring_avail *avail_mapped;
> > +   } vqpoll;
> >  };
> > 
> >  struct vhost_dev {
> > @@ -123,6 +151,7 @@ struct vhost_dev {
> >         spinlock_t work_lock;
> >         struct list_head work_list;
> >         struct task_struct *worker;
> > +        struct list_head vqpoll_list;
> >  };
> > 
> >  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, 
int 
> > nvqs);
> > -- 
> > 1.7.9.5


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-21 13:23 Razya Ladelsky
  2014-07-23  5:26 ` Jason Wang
@ 2014-07-29  8:06 ` Michael S. Tsirkin
  2014-07-29 10:30   ` Razya Ladelsky
  1 sibling, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2014-07-29  8:06 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: kvm, abel.gordon, Joel Nider, Yossi Kuperman1, Eran Raichstein,
	Alex Glikson

On Mon, Jul 21, 2014 at 04:23:44PM +0300, Razya Ladelsky wrote:
> Hello All,
> 
> When vhost is waiting for buffers from the guest driver (e.g., more 
> packets
> to send in vhost-net's transmit queue), it normally goes to sleep and 
> waits
> for the guest to "kick" it. This kick involves a PIO in the guest, and
> therefore an exit (and possibly userspace involvement in translating this 
> PIO
> exit into a file descriptor event), all of which hurts performance.
> 
> If the system is under-utilized (has cpu time to spare), vhost can 
> continuously poll the virtqueues for new buffers, and avoid asking 
> the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled 
> via a kernel module parameter, "poll_start_rate".
> 
> When polling is active for a virtqueue, the guest is asked to
> disable notification (kicks), and the worker thread continuously checks 
> for
> new buffers. When it does discover new buffers, it simulates a "kick" by
> invoking the underlying backend driver (such as vhost-net), which thinks 
> it
> got a real kick from the guest, and acts accordingly. If the underlying
> driver asks not to be kicked, we disable polling on this virtqueue.
> 
> We start polling on a virtqueue when we notice it has
> work to do. Polling on this virtqueue is later disabled after 3 seconds of
> polling turning up no new work, as in this case we are better off 
> returning
> to the exit-based notification mechanism. The default timeout of 3 seconds
> can be changed with the "poll_stop_idle" kernel module parameter.
> 
> This polling approach makes lot of sense for new HW with posted-interrupts
> for which we have exitless host-to-guest notifications. But even with 
> support 
> for posted interrupts, guest-to-host communication still causes exits. 
> Polling adds the missing part.
> 
> When systems are overloaded, there won?t be enough cpu time for the 
> various 
> vhost threads to poll their guests' devices. For these scenarios, we plan 
> to add support for vhost threads that can be shared by multiple devices, 
> even of multiple vms. 
> Our ultimate goal is to implement the I/O acceleration features described 
> in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) 
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> 
>  
> Comments are welcome, 
> Thank you,
> Razya
> 
> From: Razya Ladelsky <razya@il.ibm.com>
> 
> Add an optional polling mode to continuously poll the virtqueues
> for new buffers, and avoid asking the guest to kick us.
> 
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>

This is an optimization patch, isn't it?
Could you please include some numbers showing its
effect?


> ---
>  drivers/vhost/net.c   |    6 +-
>  drivers/vhost/scsi.c  |    5 +-
>  drivers/vhost/vhost.c |  247 
> +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   37 +++++++-
>  4 files changed, 277 insertions(+), 18 deletions(-)


Whitespace seems mangled to the point of making patch
unreadable. Can you pls repost?

> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 971a760..558aecb 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct 
> file *f)
>         }
>         vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>  
> -       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, 
> dev);
> -       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, 
> dev);
> +       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> +                       vqs[VHOST_NET_VQ_TX]);
> +       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> +                       vqs[VHOST_NET_VQ_RX]);
>  
>         f->private_data = n;
>  
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4f4ffa4..56f0233 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1528,9 +1528,8 @@ static int vhost_scsi_open(struct inode *inode, 
> struct file *f)
>         if (!vqs)
>                 goto err_vqs;
>  
> -       vhost_work_init(&vs->vs_completion_work, 
> vhost_scsi_complete_cmd_work);
> -       vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> -
> +       vhost_work_init(&vs->vs_completion_work, NULL, 
> vhost_scsi_complete_cmd_work);
> +       vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
>         vs->vs_events_nr = 0;
>         vs->vs_events_missed = false;
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c90f437..678d766 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -24,9 +24,17 @@
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
>  #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>  #include <linux/module.h>
>  
>  #include "vhost.h"
> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue 
> when rate of events is at least this number per jiffy. If 0, never start 
> polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue 
> after this many jiffies of no work.");
>  
>  enum {
>         VHOST_MEMORY_MAX_NREGIONS = 64,
> @@ -58,27 +66,27 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
> unsigned mode, int sync,
>         return 0;
>  }
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq, 
> vhost_work_fn_t fn)
>  {
>         INIT_LIST_HEAD(&work->node);
>         work->fn = fn;
>         init_waitqueue_head(&work->done);
>         work->flushing = 0;
>         work->queue_seq = work->done_seq = 0;
> +       work->vq = vq;
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_init);
>  
>  /* Init poll structure */
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -                    unsigned long mask, struct vhost_dev *dev)
> +                    unsigned long mask, struct vhost_virtqueue *vq)
>  {
>         init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
>         init_poll_funcptr(&poll->table, vhost_poll_func);
>         poll->mask = mask;
> -       poll->dev = dev;
> +       poll->dev = vq->dev;
>         poll->wqh = NULL;
> -
> -       vhost_work_init(&poll->work, fn);
> +       vhost_work_init(&poll->work, vq, fn);
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_init);
>  
> @@ -174,6 +182,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_queue);
>  
> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
> + *
> + * Enabling this mode it tells the guest not to notify ("kick") us when 
> its
> + * has made more work available on this virtqueue; Rather, we will 
> continuously
> + * poll this virtqueue in the worker thread. If multiple virtqueues are 
> polled,
> + * the worker thread polls them all, e.g., in a round-robin fashion.
> + * Note that vqpoll.enabled doesn't always mean that this virtqueue is
> + * actually being polled: The backend (e.g., net.c) may temporarily 
> disable it
> + * using vhost_disable/enable_notify(), while vqpoll.enabled is 
> unchanged.
> + *
> + * It is assumed that these functions are called relatively rarely, when 
> vhost
> + * notices that this virtqueue's usage pattern significantly changed in a 
> way
> + * that makes polling more efficient than notification, or vice versa.
> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
> + * reclaimed.
> + */
> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +       if (vq->vqpoll.enabled)
> +               return; /* already enabled, nothing to do */
> +       if (!vq->handle_kick)
> +               return; /* polling will be a waste of time if no callback! 
> */
> +       if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> +               /* vq has guest notifications enabled. Disable them,
> +                  and instead add vq to the polling list */
> +               vhost_disable_notify(vq->dev, vq);
> +               list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
> +       }
> +       vq->vqpoll.jiffies_last_kick = jiffies;
> +       __get_user(vq->avail_idx, &vq->avail->idx); 
> +       vq->vqpoll.enabled = true;
> +
> +       /* Map userspace's vq->avail to the kernel's memory space. */
> +       if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> +               &vq->vqpoll.avail_page) != 1) {
> +               /* TODO: can this happen, as we check access
> +               to vq->avail in advance? */
> +               BUG();
> +       }
> +       vq->vqpoll.avail_mapped = (struct vring_avail *) (
> +               (unsigned long)kmap(vq->vqpoll.avail_page) |
> +               ((unsigned long)vq->avail & ~PAGE_MASK));
> +}
> +
> +/*
> + * This function doesn't always succeed in changing the mode. Sometimes
> + * a temporary race condition prevents turning on guest notifications, so
> + * vq should be polled next time again.
> + */
> +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +       if (!vq->vqpoll.enabled) {
> +               return; /* already disabled, nothing to do */
> +       }
> +       vq->vqpoll.enabled = false;
> +
> +       if (!list_empty(&vq->vqpoll.link)) {
> +               /* vq is on the polling list, remove it from this list and
> +                * instead enable guest notifications. */
> +               list_del_init(&vq->vqpoll.link);
> +               if (unlikely(vhost_enable_notify(vq->dev,vq))
> +                       && !vq->vqpoll.shutdown) {
> +                       /* Race condition: guest wrote before we enabled
> +                        * notification, so we'll never get a notification 
> for
> +                        * this work - so continue polling mode for a 
> while. */
> +                       vhost_disable_notify(vq->dev, vq);
> +                       vq->vqpoll.enabled = true;
> +                       vhost_enable_notify(vq->dev, vq);
> +                       return;
> +               }
> +       }
> +
> +       if (vq->vqpoll.avail_mapped) {
> +               kunmap(vq->vqpoll.avail_page);
> +               put_page(vq->vqpoll.avail_page);
> +               vq->vqpoll.avail_mapped = 0;
> +       }
> +}
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>                            struct vhost_virtqueue *vq)
>  {
> @@ -199,6 +287,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>         vq->call = NULL;
>         vq->log_ctx = NULL;
>         vq->memory = NULL;
> +       INIT_LIST_HEAD(&vq->vqpoll.link);
> +       vq->vqpoll.enabled = false;
> +       vq->vqpoll.shutdown = false;
> +       vq->vqpoll.avail_mapped = NULL;
> +}
> +
> +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
> + * virtqueues which the caller should kick, or NULL in case none should 
> be
> + * kicked. roundrobin_poll() also disables polling on a virtqueue which 
> has
> + * been polled for too long without success.
> + *
> + * This current implementation (the "round-robin" implementation) only
> + * polls the first vq in the list, returning it or NULL as appropriate, 
> and
> + * moves this vq to the end of the list, so next time a different one is
> + * polled.
> + */
> +static struct vhost_virtqueue *roundrobin_poll(struct list_head *list) {
> +       struct vhost_virtqueue *vq;
> +       u16 avail_idx;
> +
> +
> +       if (list_empty(list))
> +               return NULL;
> +
> +       vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> +       WARN_ON(!vq->vqpoll.enabled);
> +       list_move_tail(&vq->vqpoll.link, list);
> +
> +       /* See if there is any new work available from the guest. */
> +       /* TODO: can check the optional idx feature, and if we haven't
> +        * reached that idx yet, don't kick... */
> +       avail_idx = vq->vqpoll.avail_mapped->idx;
> +       if (avail_idx != vq->last_avail_idx) {
> +               return vq;
> +       }
> +       if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> +               /* We've been polling this virtqueue for a long time with 
> no
> +                * results, so switch back to guest notification
> +                */
> +               vhost_vq_disable_vqpoll(vq);
> +       }
> +       return NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -237,12 +367,66 @@ static int vhost_worker(void *data)
>                 spin_unlock_irq(&dev->work_lock);
>  
>                 if (work) {
> +                       struct vhost_virtqueue *vq = work->vq;
>                         __set_current_state(TASK_RUNNING);
>                         work->fn(work);
> +                       /* Keep track of the work rate, for deciding when 
> to
> +                        * enable polling */
> +                       if (vq) {
> +                               if (vq->vqpoll.jiffies_last_work != 
> jiffies) {
> +                                       vq->vqpoll.jiffies_last_work = 
> jiffies;
> +                                       vq->vqpoll.work_this_jiffy = 0;
> +                               }
> +                               vq->vqpoll.work_this_jiffy++;
> +                       }
> +                       /* If vq is in the round-robin list of virtqueues 
> being
> +                        * constantly checked by this thread, move vq the 
> end
> +                        * of the queue, because it had its fair chance 
> now.
> +                        */
> +                       if (vq && !list_empty(&vq->vqpoll.link)) {
> +                               list_move_tail(&vq->vqpoll.link,
> +                                       &dev->vqpoll_list);
> +                       }
> +                       /* Otherwise, if this vq is looking for 
> notifications
> +                        * but vq polling is not enabled for it, do it 
> now.
> +                        */
> +                       else if (poll_start_rate && vq && vq->handle_kick 
> &&
> +                               !vq->vqpoll.enabled &&
> +                               !vq->vqpoll.shutdown &&
> +                               !(vq->used_flags & VRING_USED_F_NO_NOTIFY) 
> &&
> +                               vq->vqpoll.work_this_jiffy >=
> +                                       poll_start_rate) {
> +                               vhost_vq_enable_vqpoll(vq);
> +                       }
> +               }
> +               /* Check one virtqueue from the round-robin list */
> +               if (!list_empty(&dev->vqpoll_list)) {
> +                       struct vhost_virtqueue *vq;
> +
> +                       vq = roundrobin_poll(&dev->vqpoll_list);
> +
> +                       if (vq) {
> +                               vq->handle_kick(&vq->poll.work);
> +                               vq->vqpoll.jiffies_last_kick=jiffies;
> +                       }
> +
> +                       /* If our polling list isn't empty, ask to 
> continue
> +                        * running this thread, don't yield.
> +                        */
> +                       __set_current_state(TASK_RUNNING);
>                         if (need_resched())
> +                        schedule();
> +
> +               }
> +               else {
> +                       if (work)
> +                       {
> +                               if (need_resched())
> +                                       schedule();
> +                       }
> +                       else
>                                 schedule();
> -               } else
> -                       schedule();
> +               }
>  
>         }
>         unuse_mm(dev->mm);
> @@ -306,6 +490,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>         dev->mm = NULL;
>         spin_lock_init(&dev->work_lock);
>         INIT_LIST_HEAD(&dev->work_list);
> +   INIT_LIST_HEAD(&dev->vqpoll_list);
>         dev->worker = NULL;
>  
>         for (i = 0; i < dev->nvqs; ++i) {
> @@ -318,7 +503,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>                 vhost_vq_reset(dev, vq);
>                 if (vq->handle_kick)
>                         vhost_poll_init(&vq->poll, vq->handle_kick,
> -                                       POLLIN, dev);
> +                                       POLLIN, vq);
>         }
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_init);
> @@ -350,7 +535,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
>         struct vhost_attach_cgroups_struct attach;
>  
>         attach.owner = current;
> -       vhost_work_init(&attach.work, vhost_attach_cgroups_work);
> +       vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
>         vhost_work_queue(dev, &attach.work);
>         vhost_work_flush(dev, &attach.work);
>         return attach.ret;
> @@ -444,6 +629,25 @@ void vhost_dev_stop(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_stop);
>  
> +/* shutdown_vqpoll() asks the worker thread to shut down virtqueue 
> polling
> + * mode for a given virtqueue which is itself being shut down. We ask the
> + * worker thread to do this rather than doing it directly, so that we 
> don't
> + * race with the worker thread's use of the queue.
> + */
> +static void shutdown_vqpoll_work(struct vhost_work *work)
> +{
> +       work->vq->vqpoll.shutdown = true;
> +       vhost_vq_disable_vqpoll(work->vq);
> +       WARN_ON(work->vq->vqpoll.avail_mapped);
> +}
> +
> +static void shutdown_vqpoll(struct vhost_virtqueue *vq)
> +{
> +       struct vhost_work work;
> +       vhost_work_init(&work, vq, shutdown_vqpoll_work);
> +       vhost_work_queue(vq->dev, &work);
> +       vhost_work_flush(vq->dev, &work);
> +}
>  /* Caller should have device mutex if and only if locked is set */
>  void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
>  {
> @@ -460,6 +664,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool 
> locked)
>                         eventfd_ctx_put(dev->vqs[i]->call_ctx);
>                 if (dev->vqs[i]->call)
>                         fput(dev->vqs[i]->call);
> +               shutdown_vqpoll(dev->vqs[i]);
>                 vhost_vq_reset(dev, dev->vqs[i]);
>         }
>         vhost_dev_free_iovecs(dev);
> @@ -1491,6 +1696,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
>         u16 avail_idx;
>         int r;
>  
> +       /* In polling mode, when the backend (e.g., net.c) asks to enable
> +        * notifications, we don't enable guest notifications. Instead, 
> start
> +        * polling on this vq by adding it to the round-robin list.
> +        */
> +       if (vq->vqpoll.enabled) {
> +               if (list_empty(&vq->vqpoll.link)) {
> +                       list_add_tail(&vq->vqpoll.link,
> +                               &vq->dev->vqpoll_list);
> +                       vq->vqpoll.jiffies_last_kick = jiffies;
> +               }
> +               return false;
> +       }
> +
>         if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
>                 return false;
>         vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
> @@ -1528,6 +1746,17 @@ void vhost_disable_notify(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
>  {
>         int r;
>  
> +       /* If this virtqueue is vqpoll.enabled, and on the polling list, 
> it
> +        * will generate notifications even if the guest is asked not to 
> send
> +        * them. So we must remove it from the round-robin polling list.
> +        * Note that vqpoll.enabled remains set.
> +        */
> +       if (vq->vqpoll.enabled) {
> +               if(!list_empty(&vq->vqpoll.link))
> +                       list_del_init(&vq->vqpoll.link);
> +               return;
> +       }
> +
>         if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
>                 return;
>         vq->used_flags |= VRING_USED_F_NO_NOTIFY;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3eda654..feb16d6 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -24,6 +24,7 @@ struct vhost_work {
>         int                       flushing;
>         unsigned                  queue_seq;
>         unsigned                  done_seq;
> +       struct vhost_virtqueue    *vq;
>  };
>  
>  /* Poll a file (eventfd or socket) */
> @@ -37,11 +38,11 @@ struct vhost_poll {
>         struct vhost_dev         *dev;
>  };
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq, 
> vhost_work_fn_t fn);
>  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
>  
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -                    unsigned long mask, struct vhost_dev *dev);
> +                    unsigned long mask, struct vhost_virtqueue  *vq);
>  int vhost_poll_start(struct vhost_poll *poll, struct file *file);
>  void vhost_poll_stop(struct vhost_poll *poll);
>  void vhost_poll_flush(struct vhost_poll *poll);
> @@ -54,8 +55,6 @@ struct vhost_log {
>         u64 len;
>  };
>  
> -struct vhost_virtqueue;
> -
>  /* The virtqueue structure describes a queue attached to a device. */
>  struct vhost_virtqueue {
>         struct vhost_dev *dev;
> @@ -110,6 +109,35 @@ struct vhost_virtqueue {
>         /* Log write descriptors */
>         void __user *log_base;
>         struct vhost_log *log;
> +   struct {
> +      /* When a virtqueue is in vqpoll.enabled mode, it declares
> +       * that instead of using guest notifications (kicks) to
> +       * discover new work, we prefer to continuously poll this
> +       * virtqueue in the worker thread.
> +       * If !enabled, the rest of the fields below are undefined.
> +       */
> +      bool enabled;
> +      /* vqpoll.enabled doesn't always mean that this virtqueue is
> +       * actually being polled: The backend (e.g., net.c) may
> +       * temporarily disable it using vhost_disable/enable_notify().
> +       * vqpoll.link is used to maintain the thread's round-robin
> +       * list of virtqueues that actually need to be polled.
> +       * Note list_empty(link) means this virtqueue isn't polled.
> +       */
> +      struct list_head link;
> +      /* If this flag is true, the virtqueue is being shut down,
> +       * so vqpoll should not be re-enabled.
> +       */
> +      bool shutdown;
> +      /* Various counters used to decide when to enter polling mode
> +       * or leave it and return to notification mode.
> +       */
> +      unsigned long jiffies_last_kick;
> +      unsigned long jiffies_last_work;
> +      int work_this_jiffy;
> +      struct page *avail_page;
> +      volatile struct vring_avail *avail_mapped;
> +   } vqpoll;
>  };
>  
>  struct vhost_dev {
> @@ -123,6 +151,7 @@ struct vhost_dev {
>         spinlock_t work_lock;
>         struct list_head work_list;
>         struct task_struct *worker;
> +        struct list_head vqpoll_list;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int 
> nvqs);
> -- 
> 1.7.9.5

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-29  1:30         ` Zhang Haoyu
@ 2014-07-29  7:15           ` Razya Ladelsky
  0 siblings, 0 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-07-29  7:15 UTC (permalink / raw)
  To: Zhang Haoyu
  Cc: Abel Gordon, Alex Glikson, Eran Raichstein, Jason Wang,
	Joel Nider, kvm, kvm-owner, Michael S. Tsirkin, Yossi Kuperman1

kvm-owner@vger.kernel.org wrote on 29/07/2014 04:30:34 AM:

> From: "Zhang Haoyu" <zhanghy@sangfor.com>
> To: "Jason Wang" <jasowang@redhat.com>, "Abel Gordon" 
> <abel.gordon@gmail.com>, 
> Cc: Razya Ladelsky/Haifa/IBM@IBMIL, Alex Glikson/Haifa/IBM@IBMIL, 
> Eran Raichstein/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, "kvm" 
> <kvm@vger.kernel.org>, "Michael S. Tsirkin" <mst@redhat.com>, Yossi 
> Kuperman1/Haifa/IBM@IBMIL
> Date: 29/07/2014 04:35 AM
> Subject: Re: [PATCH] vhost: Add polling mode
> Sent by: kvm-owner@vger.kernel.org
> 
> Maybe tie a knot between "vhost-net scalability tuning: threading 
> for many VMs" and "vhost: Add polling mode" is a good marriage,
> because it's more possibility to get work to do with less polling 
> time, so less cpu cycles waste.
> 

Hi Zhang,
Indeed have one vhost thread shared by multiple vms, polling for their 
requests is
the ultimate goal of this plan.
The current challenge with it is that the cgroup mechanism needs to be 
supported/incorporated somehow by this shared vhost thread, as it now 
serves multiple vms(processes).
B.T.W. - if someone wants to help with this effort (mainly the cgroup 
issue),
it would be greatly appreciated...! 
 
Thank you,
Razya 

> Thanks,
> Zhang Haoyu
> 
> >>>>>> > >>> Hello All,
> >>>>>> > >>>
> >>>>>> > >>> When vhost is waiting for buffers from the guest driver
> (e.g., more
> >>>>>> > >>> packets
> >>>>>> > >>> to send in vhost-net's transmit queue), it normally 
> goes to sleep and
> >>>>>> > >>> waits
> >>>>>> > >>> for the guest to "kick" it. This kick involves a PIO in
> the guest, and
> >>>>>> > >>> therefore an exit (and possibly userspace involvement 
> in translating
> >>>> > > this
> >>>>>> > >>> PIO
> >>>>>> > >>> exit into a file descriptor event), all of which hurts 
> performance.
> >>>>>> > >>>
> >>>>>> > >>> If the system is under-utilized (has cpu time to 
> spare), vhost can
> >>>>>> > >>> continuously poll the virtqueues for new buffers, and 
> avoid asking
> >>>>>> > >>> the guest to kick us.
> >>>>>> > >>> This patch adds an optional polling mode to vhost, that
> can be enabled
> >>>>>> > >>> via a kernel module parameter, "poll_start_rate".
> >>>>>> > >>>
> >>>>>> > >>> When polling is active for a virtqueue, the guest is asked 
to
> >>>>>> > >>> disable notification (kicks), and the worker thread 
continuously
> >>>> > > checks
> >>>>>> > >>> for
> >>>>>> > >>> new buffers. When it does discover new buffers, it 
> simulates a "kick"
> >>>> > > by
> >>>>>> > >>> invoking the underlying backend driver (such as vhost-net), 
which
> >>>> > > thinks
> >>>>>> > >>> it
> >>>>>> > >>> got a real kick from the guest, and acts accordingly. If 
the
> >>>> > > underlying
> >>>>>> > >>> driver asks not to be kicked, we disable polling on 
> this virtqueue.
> >>>>>> > >>>
> >>>>>> > >>> We start polling on a virtqueue when we notice it has
> >>>>>> > >>> work to do. Polling on this virtqueue is later disabled 
after 3
> >>>> > > seconds of
> >>>>>> > >>> polling turning up no new work, as in this case we are 
better off
> >>>>>> > >>> returning
> >>>>>> > >>> to the exit-based notification mechanism. The default 
> timeout of 3
> >>>> > > seconds
> >>>>>> > >>> can be changed with the "poll_stop_idle" kernel module 
parameter.
> >>>>>> > >>>
> >>>>>> > >>> This polling approach makes lot of sense for new HW with
> >>>> > > posted-interrupts
> >>>>>> > >>> for which we have exitless host-to-guest notifications.
> But even with
> >>>>>> > >>> support
> >>>>>> > >>> for posted interrupts, guest-to-host communication 
> still causes exits.
> >>>>>> > >>> Polling adds the missing part.
> >>>>>> > >>>
> >>>>>> > >>> When systems are overloaded, there won?t be enough cpu 
> time for the
> >>>>>> > >>> various
> >>>>>> > >>> vhost threads to poll their guests' devices. For these 
> scenarios, we
> >>>> > > plan
> >>>>>> > >>> to add support for vhost threads that can be shared by 
multiple
> >>>> > > devices,
> >>>>>> > >>> even of multiple vms.
> >>>>>> > >>> Our ultimate goal is to implement the I/O acceleration 
features
> >>>> > > described
> >>>>>> > >>> in:
> >>>>>> > >>> KVM Forum 2013: Efficient and Scalable Virtio (by Abel 
Gordon)
> >>>>>> > >>> https://www.youtube.com/watch?v=9EyweibHfEs
> >>>>>> > >>> and
> >>>>>> > >>> 
https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> >>>>>> > >>>
> >>>>>> > >>>
> >>>>>> > >>> Comments are welcome,
> >>>>>> > >>> Thank you,
> >>>>>> > >>> Razya
> >>>>> > >> Thanks for the work. Do you have perf numbers for this?
> >>>>> > >>
> >>>> > > Hi Jason,
> >>>> > > Thanks for reviewing. I ran some experiments with TCP 
> stream netperf and
> >>>> > > filebench (having 2 threads performing random reads) 
> benchmarks on an IBM
> >>>> > > System x3650 M4.
> >>>> > > All runs loaded the guests in a way that they were (cpu) 
saturated.
> >>>> > > The system had two cores per guest, as to allow for both 
> the vcpu and the
> >>>> > > vhost thread to
> >>>> > > run concurrently for maximum throughput (but I didn't pin 
> the threads to
> >>>> > > specific cores)
> >>>> > > I get:
> >>>> > >
> >>>> > > Netperf, 1 vm:
> >>>> > > The polling patch improved throughput by ~33%. Number of 
exits/sec
> >>>> > > decreased 6x.
> >>>> > > The same improvement was shown when I tested with 3 vms 
> running netperf.
> >>>> > >
> >>>> > > filebench, 1 vm:
> >>>> > > ops/sec improved by 13% with the polling patch. Number of exits 
was
> >>>> > > reduced by 31%.
> >>>> > > The same experiment with 3 vms running filebench showed 
> similar numbers.
> >>> >
> >>> > Looks good, may worth to add the result in the commit log.
> >>>> > >
> >>>>> > >> And looks like the patch only poll for virtqueue. In the 
> future, may
> >>>>> > >> worth to add callbacks for vhost_net to poll socket. Then
> it could be
> >>>>> > >> used with rx busy polling in host which may speedup the rx 
also.
> >>>> > > Did you mean polling the network device to avoid interrupts?
> >>> >
> >>> > Yes, recent linux host support rx busy polling which can reduce 
the
> >>> > interrupts. If vhost can utilize this, it can also reduce the 
latency
> >>> > caused by vhost thread wakeups.
> >>> >
> >>> > And I'm also working on virtio-net busy polling in guest, if vhost 
can
> >>> > poll socket, it can also help in guest rx polling.
> >> Nice :)  Note that you may want to check if if the processor support
> >> posted interrupts. I guess that if CPU supports posted interrupts 
then
> >> benefits of polling in the front-end (from performance perspective)
> >> may not worth the cpu cycles wasted in the guest.
> >>
> >
> >Yes it's worth to check. But I think busy polling in guest may still
> >help since it may reduce the overhead of irq and NAPI in guest, also 
can
> >reduce the latency by eliminating wakeups of both vcpu thread in host
> >and userspace process in guest.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-23  8:48       ` Abel Gordon
  2014-07-24  5:57         ` Jason Wang
@ 2014-07-29  1:30         ` Zhang Haoyu
  2014-07-29  7:15           ` Razya Ladelsky
  1 sibling, 1 reply; 55+ messages in thread
From: Zhang Haoyu @ 2014-07-29  1:30 UTC (permalink / raw)
  To: Jason Wang, Abel Gordon
  Cc: Razya Ladelsky, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	Michael S. Tsirkin, Yossi Kuperman1

Maybe tie a knot between "vhost-net scalability tuning: threading for many VMs" and "vhost: Add polling mode" is a good marriage,
because it's more possibility to get work to do with less polling time, so less cpu cycles waste.

Thanks,
Zhang Haoyu

>>>>>> > >>> Hello All,
>>>>>> > >>>
>>>>>> > >>> When vhost is waiting for buffers from the guest driver (e.g., more
>>>>>> > >>> packets
>>>>>> > >>> to send in vhost-net's transmit queue), it normally goes to sleep and
>>>>>> > >>> waits
>>>>>> > >>> for the guest to "kick" it. This kick involves a PIO in the guest, and
>>>>>> > >>> therefore an exit (and possibly userspace involvement in translating
>>>> > > this
>>>>>> > >>> PIO
>>>>>> > >>> exit into a file descriptor event), all of which hurts performance.
>>>>>> > >>>
>>>>>> > >>> If the system is under-utilized (has cpu time to spare), vhost can
>>>>>> > >>> continuously poll the virtqueues for new buffers, and avoid asking
>>>>>> > >>> the guest to kick us.
>>>>>> > >>> This patch adds an optional polling mode to vhost, that can be enabled
>>>>>> > >>> via a kernel module parameter, "poll_start_rate".
>>>>>> > >>>
>>>>>> > >>> When polling is active for a virtqueue, the guest is asked to
>>>>>> > >>> disable notification (kicks), and the worker thread continuously
>>>> > > checks
>>>>>> > >>> for
>>>>>> > >>> new buffers. When it does discover new buffers, it simulates a "kick"
>>>> > > by
>>>>>> > >>> invoking the underlying backend driver (such as vhost-net), which
>>>> > > thinks
>>>>>> > >>> it
>>>>>> > >>> got a real kick from the guest, and acts accordingly. If the
>>>> > > underlying
>>>>>> > >>> driver asks not to be kicked, we disable polling on this virtqueue.
>>>>>> > >>>
>>>>>> > >>> We start polling on a virtqueue when we notice it has
>>>>>> > >>> work to do. Polling on this virtqueue is later disabled after 3
>>>> > > seconds of
>>>>>> > >>> polling turning up no new work, as in this case we are better off
>>>>>> > >>> returning
>>>>>> > >>> to the exit-based notification mechanism. The default timeout of 3
>>>> > > seconds
>>>>>> > >>> can be changed with the "poll_stop_idle" kernel module parameter.
>>>>>> > >>>
>>>>>> > >>> This polling approach makes lot of sense for new HW with
>>>> > > posted-interrupts
>>>>>> > >>> for which we have exitless host-to-guest notifications. But even with
>>>>>> > >>> support
>>>>>> > >>> for posted interrupts, guest-to-host communication still causes exits.
>>>>>> > >>> Polling adds the missing part.
>>>>>> > >>>
>>>>>> > >>> When systems are overloaded, there won?t be enough cpu time for the
>>>>>> > >>> various
>>>>>> > >>> vhost threads to poll their guests' devices. For these scenarios, we
>>>> > > plan
>>>>>> > >>> to add support for vhost threads that can be shared by multiple
>>>> > > devices,
>>>>>> > >>> even of multiple vms.
>>>>>> > >>> Our ultimate goal is to implement the I/O acceleration features
>>>> > > described
>>>>>> > >>> in:
>>>>>> > >>> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
>>>>>> > >>> https://www.youtube.com/watch?v=9EyweibHfEs
>>>>>> > >>> and
>>>>>> > >>> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
>>>>>> > >>>
>>>>>> > >>>
>>>>>> > >>> Comments are welcome,
>>>>>> > >>> Thank you,
>>>>>> > >>> Razya
>>>>> > >> Thanks for the work. Do you have perf numbers for this?
>>>>> > >>
>>>> > > Hi Jason,
>>>> > > Thanks for reviewing. I ran some experiments with TCP stream netperf and
>>>> > > filebench (having 2 threads performing random reads) benchmarks on an IBM
>>>> > > System x3650 M4.
>>>> > > All runs loaded the guests in a way that they were (cpu) saturated.
>>>> > > The system had two cores per guest, as to allow for both the vcpu and the
>>>> > > vhost thread to
>>>> > > run concurrently for maximum throughput (but I didn't pin the threads to
>>>> > > specific cores)
>>>> > > I get:
>>>> > >
>>>> > > Netperf, 1 vm:
>>>> > > The polling patch improved throughput by ~33%. Number of exits/sec
>>>> > > decreased 6x.
>>>> > > The same improvement was shown when I tested with 3 vms running netperf.
>>>> > >
>>>> > > filebench, 1 vm:
>>>> > > ops/sec improved by 13% with the polling patch. Number of exits was
>>>> > > reduced by 31%.
>>>> > > The same experiment with 3 vms running filebench showed similar numbers.
>>> >
>>> > Looks good, may worth to add the result in the commit log.
>>>> > >
>>>>> > >> And looks like the patch only poll for virtqueue. In the future, may
>>>>> > >> worth to add callbacks for vhost_net to poll socket. Then it could be
>>>>> > >> used with rx busy polling in host which may speedup the rx also.
>>>> > > Did you mean polling the network device to avoid interrupts?
>>> >
>>> > Yes, recent linux host support rx busy polling which can reduce the
>>> > interrupts. If vhost can utilize this, it can also reduce the latency
>>> > caused by vhost thread wakeups.
>>> >
>>> > And I'm also working on virtio-net busy polling in guest, if vhost can
>>> > poll socket, it can also help in guest rx polling.
>> Nice :)  Note that you may want to check if if the processor support
>> posted interrupts. I guess that if CPU supports posted interrupts then
>> benefits of polling in the front-end (from performance perspective)
>> may not worth the cpu cycles wasted in the guest.
>>
>
>Yes it's worth to check. But I think busy polling in guest may still
>help since it may reduce the overhead of irq and NAPI in guest, also can
>reduce the latency by eliminating wakeups of both vcpu thread in host
>and userspace process in guest.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-23  8:48       ` Abel Gordon
@ 2014-07-24  5:57         ` Jason Wang
  2014-07-29  1:30         ` Zhang Haoyu
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Wang @ 2014-07-24  5:57 UTC (permalink / raw)
  To: Abel Gordon
  Cc: Razya Ladelsky, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	Michael S. Tsirkin, Yossi Kuperman1

On 07/23/2014 04:48 PM, Abel Gordon wrote:
> On Wed, Jul 23, 2014 at 11:42 AM, Jason Wang <jasowang@redhat.com> wrote:
>> >
>> > On 07/23/2014 04:12 PM, Razya Ladelsky wrote:
>>> > > Jason Wang <jasowang@redhat.com> wrote on 23/07/2014 08:26:36 AM:
>>> > >
>>>> > >> From: Jason Wang <jasowang@redhat.com>
>>>> > >> To: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, "Michael S.
>>>> > >> Tsirkin" <mst@redhat.com>,
>>>> > >> Cc: abel.gordon@gmail.com, Joel Nider/Haifa/IBM@IBMIL, Yossi
>>>> > >> Kuperman1/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Alex
>>>> > >> Glikson/Haifa/IBM@IBMIL
>>>> > >> Date: 23/07/2014 08:26 AM
>>>> > >> Subject: Re: [PATCH] vhost: Add polling mode
>>>> > >>
>>>> > >> On 07/21/2014 09:23 PM, Razya Ladelsky wrote:
>>>>> > >>> Hello All,
>>>>> > >>>
>>>>> > >>> When vhost is waiting for buffers from the guest driver (e.g., more
>>>>> > >>> packets
>>>>> > >>> to send in vhost-net's transmit queue), it normally goes to sleep and
>>>>> > >>> waits
>>>>> > >>> for the guest to "kick" it. This kick involves a PIO in the guest, and
>>>>> > >>> therefore an exit (and possibly userspace involvement in translating
>>> > > this
>>>>> > >>> PIO
>>>>> > >>> exit into a file descriptor event), all of which hurts performance.
>>>>> > >>>
>>>>> > >>> If the system is under-utilized (has cpu time to spare), vhost can
>>>>> > >>> continuously poll the virtqueues for new buffers, and avoid asking
>>>>> > >>> the guest to kick us.
>>>>> > >>> This patch adds an optional polling mode to vhost, that can be enabled
>>>>> > >>> via a kernel module parameter, "poll_start_rate".
>>>>> > >>>
>>>>> > >>> When polling is active for a virtqueue, the guest is asked to
>>>>> > >>> disable notification (kicks), and the worker thread continuously
>>> > > checks
>>>>> > >>> for
>>>>> > >>> new buffers. When it does discover new buffers, it simulates a "kick"
>>> > > by
>>>>> > >>> invoking the underlying backend driver (such as vhost-net), which
>>> > > thinks
>>>>> > >>> it
>>>>> > >>> got a real kick from the guest, and acts accordingly. If the
>>> > > underlying
>>>>> > >>> driver asks not to be kicked, we disable polling on this virtqueue.
>>>>> > >>>
>>>>> > >>> We start polling on a virtqueue when we notice it has
>>>>> > >>> work to do. Polling on this virtqueue is later disabled after 3
>>> > > seconds of
>>>>> > >>> polling turning up no new work, as in this case we are better off
>>>>> > >>> returning
>>>>> > >>> to the exit-based notification mechanism. The default timeout of 3
>>> > > seconds
>>>>> > >>> can be changed with the "poll_stop_idle" kernel module parameter.
>>>>> > >>>
>>>>> > >>> This polling approach makes lot of sense for new HW with
>>> > > posted-interrupts
>>>>> > >>> for which we have exitless host-to-guest notifications. But even with
>>>>> > >>> support
>>>>> > >>> for posted interrupts, guest-to-host communication still causes exits.
>>>>> > >>> Polling adds the missing part.
>>>>> > >>>
>>>>> > >>> When systems are overloaded, there won?t be enough cpu time for the
>>>>> > >>> various
>>>>> > >>> vhost threads to poll their guests' devices. For these scenarios, we
>>> > > plan
>>>>> > >>> to add support for vhost threads that can be shared by multiple
>>> > > devices,
>>>>> > >>> even of multiple vms.
>>>>> > >>> Our ultimate goal is to implement the I/O acceleration features
>>> > > described
>>>>> > >>> in:
>>>>> > >>> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
>>>>> > >>> https://www.youtube.com/watch?v=9EyweibHfEs
>>>>> > >>> and
>>>>> > >>> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> Comments are welcome,
>>>>> > >>> Thank you,
>>>>> > >>> Razya
>>>> > >> Thanks for the work. Do you have perf numbers for this?
>>>> > >>
>>> > > Hi Jason,
>>> > > Thanks for reviewing. I ran some experiments with TCP stream netperf and
>>> > > filebench (having 2 threads performing random reads) benchmarks on an IBM
>>> > > System x3650 M4.
>>> > > All runs loaded the guests in a way that they were (cpu) saturated.
>>> > > The system had two cores per guest, as to allow for both the vcpu and the
>>> > > vhost thread to
>>> > > run concurrently for maximum throughput (but I didn't pin the threads to
>>> > > specific cores)
>>> > > I get:
>>> > >
>>> > > Netperf, 1 vm:
>>> > > The polling patch improved throughput by ~33%. Number of exits/sec
>>> > > decreased 6x.
>>> > > The same improvement was shown when I tested with 3 vms running netperf.
>>> > >
>>> > > filebench, 1 vm:
>>> > > ops/sec improved by 13% with the polling patch. Number of exits was
>>> > > reduced by 31%.
>>> > > The same experiment with 3 vms running filebench showed similar numbers.
>> >
>> > Looks good, may worth to add the result in the commit log.
>>> > >
>>>> > >> And looks like the patch only poll for virtqueue. In the future, may
>>>> > >> worth to add callbacks for vhost_net to poll socket. Then it could be
>>>> > >> used with rx busy polling in host which may speedup the rx also.
>>> > > Did you mean polling the network device to avoid interrupts?
>> >
>> > Yes, recent linux host support rx busy polling which can reduce the
>> > interrupts. If vhost can utilize this, it can also reduce the latency
>> > caused by vhost thread wakeups.
>> >
>> > And I'm also working on virtio-net busy polling in guest, if vhost can
>> > poll socket, it can also help in guest rx polling.
> Nice :)  Note that you may want to check if if the processor support
> posted interrupts. I guess that if CPU supports posted interrupts then
> benefits of polling in the front-end (from performance perspective)
> may not worth the cpu cycles wasted in the guest.
>

Yes it's worth to check. But I think busy polling in guest may still
help since it may reduce the overhead of irq and NAPI in guest, also can
reduce the latency by eliminating wakeups of both vcpu thread in host
and userspace process in guest.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-23  8:42     ` Jason Wang
@ 2014-07-23  8:48       ` Abel Gordon
  2014-07-24  5:57         ` Jason Wang
  2014-07-29  1:30         ` Zhang Haoyu
  0 siblings, 2 replies; 55+ messages in thread
From: Abel Gordon @ 2014-07-23  8:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: Razya Ladelsky, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	Michael S. Tsirkin, Yossi Kuperman1

On Wed, Jul 23, 2014 at 11:42 AM, Jason Wang <jasowang@redhat.com> wrote:
>
> On 07/23/2014 04:12 PM, Razya Ladelsky wrote:
> > Jason Wang <jasowang@redhat.com> wrote on 23/07/2014 08:26:36 AM:
> >
> >> From: Jason Wang <jasowang@redhat.com>
> >> To: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, "Michael S.
> >> Tsirkin" <mst@redhat.com>,
> >> Cc: abel.gordon@gmail.com, Joel Nider/Haifa/IBM@IBMIL, Yossi
> >> Kuperman1/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Alex
> >> Glikson/Haifa/IBM@IBMIL
> >> Date: 23/07/2014 08:26 AM
> >> Subject: Re: [PATCH] vhost: Add polling mode
> >>
> >> On 07/21/2014 09:23 PM, Razya Ladelsky wrote:
> >>> Hello All,
> >>>
> >>> When vhost is waiting for buffers from the guest driver (e.g., more
> >>> packets
> >>> to send in vhost-net's transmit queue), it normally goes to sleep and
> >>> waits
> >>> for the guest to "kick" it. This kick involves a PIO in the guest, and
> >>> therefore an exit (and possibly userspace involvement in translating
> > this
> >>> PIO
> >>> exit into a file descriptor event), all of which hurts performance.
> >>>
> >>> If the system is under-utilized (has cpu time to spare), vhost can
> >>> continuously poll the virtqueues for new buffers, and avoid asking
> >>> the guest to kick us.
> >>> This patch adds an optional polling mode to vhost, that can be enabled
> >>> via a kernel module parameter, "poll_start_rate".
> >>>
> >>> When polling is active for a virtqueue, the guest is asked to
> >>> disable notification (kicks), and the worker thread continuously
> > checks
> >>> for
> >>> new buffers. When it does discover new buffers, it simulates a "kick"
> > by
> >>> invoking the underlying backend driver (such as vhost-net), which
> > thinks
> >>> it
> >>> got a real kick from the guest, and acts accordingly. If the
> > underlying
> >>> driver asks not to be kicked, we disable polling on this virtqueue.
> >>>
> >>> We start polling on a virtqueue when we notice it has
> >>> work to do. Polling on this virtqueue is later disabled after 3
> > seconds of
> >>> polling turning up no new work, as in this case we are better off
> >>> returning
> >>> to the exit-based notification mechanism. The default timeout of 3
> > seconds
> >>> can be changed with the "poll_stop_idle" kernel module parameter.
> >>>
> >>> This polling approach makes lot of sense for new HW with
> > posted-interrupts
> >>> for which we have exitless host-to-guest notifications. But even with
> >>> support
> >>> for posted interrupts, guest-to-host communication still causes exits.
> >>> Polling adds the missing part.
> >>>
> >>> When systems are overloaded, there won?t be enough cpu time for the
> >>> various
> >>> vhost threads to poll their guests' devices. For these scenarios, we
> > plan
> >>> to add support for vhost threads that can be shared by multiple
> > devices,
> >>> even of multiple vms.
> >>> Our ultimate goal is to implement the I/O acceleration features
> > described
> >>> in:
> >>> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon)
> >>> https://www.youtube.com/watch?v=9EyweibHfEs
> >>> and
> >>> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> >>>
> >>>
> >>> Comments are welcome,
> >>> Thank you,
> >>> Razya
> >> Thanks for the work. Do you have perf numbers for this?
> >>
> > Hi Jason,
> > Thanks for reviewing. I ran some experiments with TCP stream netperf and
> > filebench (having 2 threads performing random reads) benchmarks on an IBM
> > System x3650 M4.
> > All runs loaded the guests in a way that they were (cpu) saturated.
> > The system had two cores per guest, as to allow for both the vcpu and the
> > vhost thread to
> > run concurrently for maximum throughput (but I didn't pin the threads to
> > specific cores)
> > I get:
> >
> > Netperf, 1 vm:
> > The polling patch improved throughput by ~33%. Number of exits/sec
> > decreased 6x.
> > The same improvement was shown when I tested with 3 vms running netperf.
> >
> > filebench, 1 vm:
> > ops/sec improved by 13% with the polling patch. Number of exits was
> > reduced by 31%.
> > The same experiment with 3 vms running filebench showed similar numbers.
>
> Looks good, may worth to add the result in the commit log.
> >
> >> And looks like the patch only poll for virtqueue. In the future, may
> >> worth to add callbacks for vhost_net to poll socket. Then it could be
> >> used with rx busy polling in host which may speedup the rx also.
> > Did you mean polling the network device to avoid interrupts?
>
> Yes, recent linux host support rx busy polling which can reduce the
> interrupts. If vhost can utilize this, it can also reduce the latency
> caused by vhost thread wakeups.
>
> And I'm also working on virtio-net busy polling in guest, if vhost can
> poll socket, it can also help in guest rx polling.

Nice :)  Note that you may want to check if if the processor support
posted interrupts. I guess that if CPU supports posted interrupts then
benefits of polling in the front-end (from performance perspective)
may not worth the cpu cycles wasted in the guest.


> >>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >>> index c90f437..678d766 100644
> >>> --- a/drivers/vhost/vhost.c
> >>> +++ b/drivers/vhost/vhost.c
> >>> @@ -24,9 +24,17 @@
> >>>  #include <linux/slab.h>
> >>>  #include <linux/kthread.h>
> >>>  #include <linux/cgroup.h>
> >>> +#include <linux/jiffies.h>
> >>>  #include <linux/module.h>
> >>>
> >>>  #include "vhost.h"
> >>> +static int poll_start_rate = 0;
> >>> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> >>> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of
> > virtqueue
> >>> when rate of events is at least this number per jiffy. If 0, never
> > start
> >>> polling.");
> >>> +
> >>> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> >>> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> >>> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of
> > virtqueue
> >>> after this many jiffies of no work.");
> >>>
> >> I'm not sure using jiffy is good enough since user need know HZ value.
> >> May worth to look at sk_busy_loop() which use sched_clock() and us.
> > Ok, Will look into it, thanks.
> >
> >>> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a
> > virtqueue.
> >>> + *
> >>> + * Enabling this mode it tells the guest not to notify ("kick") us
> > when
> >>> its
> >>> + * has made more work available on this virtqueue; Rather, we will
> >>> continuously
> >>> + * poll this virtqueue in the worker thread. If multiple virtqueues
> > are
> >>> polled,
> >>> + * the worker thread polls them all, e.g., in a round-robin fashion.
> >>> + * Note that vqpoll.enabled doesn't always mean that this virtqueue
> > is
> >>> + * actually being polled: The backend (e.g., net.c) may temporarily
> >>> disable it
> >>> + * using vhost_disable/enable_notify(), while vqpoll.enabled is
> >>> unchanged.
> >>> + *
> >>> + * It is assumed that these functions are called relatively rarely,
> > when
> >>> vhost
> >>> + * notices that this virtqueue's usage pattern significantly changed
> > in a
> >>> way
> >>> + * that makes polling more efficient than notification, or vice
> > versa.
> >>> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on
> > vq
> >>> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can
> > be
> >>> + * reclaimed.
> >>> + */
> >>> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> >>> +{
> >>> +       if (vq->vqpoll.enabled)
> >>> +               return; /* already enabled, nothing to do */
> >>> +       if (!vq->handle_kick)
> >>> +               return; /* polling will be a waste of time if no
> > callback!
> >>> */
> >>> +       if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> >>> +               /* vq has guest notifications enabled. Disable them,
> >>> +                  and instead add vq to the polling list */
> >>>
> >>> +               list_add_tail(&vq->vqpoll.link,
> > &vq->dev->vqpoll_list);
> >> This may work when there're at most two vqs in the list. But consider
> >> you may want to poll a lot of vqs in the future, it may take a long time
> >> for this vq to get polled. So probably we can just keep the used_flags
> >> untouched, if the vq get kicked, it can be served soon.
> > Indeed there is a patch ready for polling multiple virtqueues, and it has
> > a better scheduling algorithm that avoids a virtqueue starvation.
>
> Ok.
> >
> >
> >>> +       }
> >>> +       vq->vqpoll.jiffies_last_kick = jiffies;
> >>> +       __get_user(vq->avail_idx, &vq->avail->idx);
> >>> +       vq->vqpoll.enabled = true;
> >>> +
> >>> +       /* Map userspace's vq->avail to the kernel's memory space. */
> >>> +       if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> >>> +               &vq->vqpoll.avail_page) != 1) {
> >>> +               /* TODO: can this happen, as we check access
> >>> +               to vq->avail in advance? */
> >>> +               BUG();
> >>> +       }
> >>> +       vq->vqpoll.avail_mapped = (struct vring_avail *) (
> >>> +               (unsigned long)kmap(vq->vqpoll.avail_page) |
> >>> +               ((unsigned long)vq->avail & ~PAGE_MASK));
> >> Is it a must to map avail page here?
> >>
> > No. This is indeed in preparation for the next patch handling multiple
> > queues by a single vhost thread, where we'd like to map these pages for
> > performance.
>
> Make sense. Then for the gup failure, we may just safely disable the
> polling in this case.
> >
> >
> >
> >>> +                               vq->vqpoll.work_this_jiffy >=
> >>> +                                       poll_start_rate) {
> >>> +                               vhost_vq_enable_vqpoll(vq);
> >>> +                       }
> >>> +               }
> >>> +               /* Check one virtqueue from the round-robin list */
> >>> +               if (!list_empty(&dev->vqpoll_list)) {
> >> If we still have another work in work_list, we may want to serve it
> > first.
> >
> > You maybe right. We've done a lot of experiments with this method,
> > which seems to work well. I prefer leaving it this way for now, but your
> > approach
> > is worthwhile to investigate as well.
>
> Ok.
> >
> >> [...]
> >>>  struct vhost_dev {
> >>> @@ -123,6 +151,7 @@ struct vhost_dev {
> >>>         spinlock_t work_lock;
> >>>         struct list_head work_list;
> >>>         struct task_struct *worker;
> >>> +        struct list_head vqpoll_list;
> >>>  };
> >>>
> >>>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
> > int
> >>> nvqs);
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-23  8:12   ` Razya Ladelsky
@ 2014-07-23  8:42     ` Jason Wang
  2014-07-23  8:48       ` Abel Gordon
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2014-07-23  8:42 UTC (permalink / raw)
  To: Razya Ladelsky
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	Michael S. Tsirkin, Yossi Kuperman1

On 07/23/2014 04:12 PM, Razya Ladelsky wrote:
> Jason Wang <jasowang@redhat.com> wrote on 23/07/2014 08:26:36 AM:
>
>> From: Jason Wang <jasowang@redhat.com>
>> To: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, "Michael S.
>> Tsirkin" <mst@redhat.com>, 
>> Cc: abel.gordon@gmail.com, Joel Nider/Haifa/IBM@IBMIL, Yossi 
>> Kuperman1/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Alex 
>> Glikson/Haifa/IBM@IBMIL
>> Date: 23/07/2014 08:26 AM
>> Subject: Re: [PATCH] vhost: Add polling mode
>>
>> On 07/21/2014 09:23 PM, Razya Ladelsky wrote:
>>> Hello All,
>>>
>>> When vhost is waiting for buffers from the guest driver (e.g., more 
>>> packets
>>> to send in vhost-net's transmit queue), it normally goes to sleep and 
>>> waits
>>> for the guest to "kick" it. This kick involves a PIO in the guest, and
>>> therefore an exit (and possibly userspace involvement in translating 
> this 
>>> PIO
>>> exit into a file descriptor event), all of which hurts performance.
>>>
>>> If the system is under-utilized (has cpu time to spare), vhost can 
>>> continuously poll the virtqueues for new buffers, and avoid asking 
>>> the guest to kick us.
>>> This patch adds an optional polling mode to vhost, that can be enabled 
>>> via a kernel module parameter, "poll_start_rate".
>>>
>>> When polling is active for a virtqueue, the guest is asked to
>>> disable notification (kicks), and the worker thread continuously 
> checks 
>>> for
>>> new buffers. When it does discover new buffers, it simulates a "kick" 
> by
>>> invoking the underlying backend driver (such as vhost-net), which 
> thinks 
>>> it
>>> got a real kick from the guest, and acts accordingly. If the 
> underlying
>>> driver asks not to be kicked, we disable polling on this virtqueue.
>>>
>>> We start polling on a virtqueue when we notice it has
>>> work to do. Polling on this virtqueue is later disabled after 3 
> seconds of
>>> polling turning up no new work, as in this case we are better off 
>>> returning
>>> to the exit-based notification mechanism. The default timeout of 3 
> seconds
>>> can be changed with the "poll_stop_idle" kernel module parameter.
>>>
>>> This polling approach makes lot of sense for new HW with 
> posted-interrupts
>>> for which we have exitless host-to-guest notifications. But even with 
>>> support 
>>> for posted interrupts, guest-to-host communication still causes exits. 
>>> Polling adds the missing part.
>>>
>>> When systems are overloaded, there won?t be enough cpu time for the 
>>> various 
>>> vhost threads to poll their guests' devices. For these scenarios, we 
> plan 
>>> to add support for vhost threads that can be shared by multiple 
> devices, 
>>> even of multiple vms. 
>>> Our ultimate goal is to implement the I/O acceleration features 
> described 
>>> in:
>>> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) 
>>> https://www.youtube.com/watch?v=9EyweibHfEs
>>> and
>>> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
>>>
>>>
>>> Comments are welcome, 
>>> Thank you,
>>> Razya
>> Thanks for the work. Do you have perf numbers for this?
>>
> Hi Jason,
> Thanks for reviewing. I ran some experiments with TCP stream netperf and 
> filebench (having 2 threads performing random reads) benchmarks on an IBM 
> System x3650 M4.
> All runs loaded the guests in a way that they were (cpu) saturated.
> The system had two cores per guest, as to allow for both the vcpu and the 
> vhost thread to
> run concurrently for maximum throughput (but I didn't pin the threads to 
> specific cores)
> I get:
>
> Netperf, 1 vm:
> The polling patch improved throughput by ~33%. Number of exits/sec 
> decreased 6x.
> The same improvement was shown when I tested with 3 vms running netperf.
>
> filebench, 1 vm:
> ops/sec improved by 13% with the polling patch. Number of exits was 
> reduced by 31%.
> The same experiment with 3 vms running filebench showed similar numbers.

Looks good, may worth to add the result in the commit log.
>
>> And looks like the patch only poll for virtqueue. In the future, may
>> worth to add callbacks for vhost_net to poll socket. Then it could be
>> used with rx busy polling in host which may speedup the rx also.
> Did you mean polling the network device to avoid interrupts?

Yes, recent linux host support rx busy polling which can reduce the
interrupts. If vhost can utilize this, it can also reduce the latency
caused by vhost thread wakeups.

And I'm also working on virtio-net busy polling in guest, if vhost can
poll socket, it can also help in guest rx polling.
>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>> index c90f437..678d766 100644
>>> --- a/drivers/vhost/vhost.c
>>> +++ b/drivers/vhost/vhost.c
>>> @@ -24,9 +24,17 @@
>>>  #include <linux/slab.h>
>>>  #include <linux/kthread.h>
>>>  #include <linux/cgroup.h>
>>> +#include <linux/jiffies.h>
>>>  #include <linux/module.h>
>>>
>>>  #include "vhost.h"
>>> +static int poll_start_rate = 0;
>>> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
>>> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
> virtqueue 
>>> when rate of events is at least this number per jiffy. If 0, never 
> start 
>>> polling.");
>>> +
>>> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
>>> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
>>> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
> virtqueue 
>>> after this many jiffies of no work.");
>>>
>> I'm not sure using jiffy is good enough since user need know HZ value.
>> May worth to look at sk_busy_loop() which use sched_clock() and us. 
> Ok, Will look into it, thanks.
>
>>> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a 
> virtqueue.
>>> + *
>>> + * Enabling this mode it tells the guest not to notify ("kick") us 
> when 
>>> its
>>> + * has made more work available on this virtqueue; Rather, we will 
>>> continuously
>>> + * poll this virtqueue in the worker thread. If multiple virtqueues 
> are 
>>> polled,
>>> + * the worker thread polls them all, e.g., in a round-robin fashion.
>>> + * Note that vqpoll.enabled doesn't always mean that this virtqueue 
> is
>>> + * actually being polled: The backend (e.g., net.c) may temporarily 
>>> disable it
>>> + * using vhost_disable/enable_notify(), while vqpoll.enabled is 
>>> unchanged.
>>> + *
>>> + * It is assumed that these functions are called relatively rarely, 
> when 
>>> vhost
>>> + * notices that this virtqueue's usage pattern significantly changed 
> in a 
>>> way
>>> + * that makes polling more efficient than notification, or vice 
> versa.
>>> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on 
> vq
>>> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can 
> be
>>> + * reclaimed.
>>> + */
>>> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
>>> +{
>>> +       if (vq->vqpoll.enabled)
>>> +               return; /* already enabled, nothing to do */
>>> +       if (!vq->handle_kick)
>>> +               return; /* polling will be a waste of time if no 
> callback! 
>>> */
>>> +       if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
>>> +               /* vq has guest notifications enabled. Disable them,
>>> +                  and instead add vq to the polling list */
>>>
>>> +               list_add_tail(&vq->vqpoll.link, 
> &vq->dev->vqpoll_list);
>> This may work when there're at most two vqs in the list. But consider
>> you may want to poll a lot of vqs in the future, it may take a long time
>> for this vq to get polled. So probably we can just keep the used_flags
>> untouched, if the vq get kicked, it can be served soon.
> Indeed there is a patch ready for polling multiple virtqueues, and it has 
> a better scheduling algorithm that avoids a virtqueue starvation. 

Ok.
>
>
>>> +       }
>>> +       vq->vqpoll.jiffies_last_kick = jiffies;
>>> +       __get_user(vq->avail_idx, &vq->avail->idx); 
>>> +       vq->vqpoll.enabled = true;
>>> +
>>> +       /* Map userspace's vq->avail to the kernel's memory space. */
>>> +       if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
>>> +               &vq->vqpoll.avail_page) != 1) {
>>> +               /* TODO: can this happen, as we check access
>>> +               to vq->avail in advance? */
>>> +               BUG();
>>> +       }
>>> +       vq->vqpoll.avail_mapped = (struct vring_avail *) (
>>> +               (unsigned long)kmap(vq->vqpoll.avail_page) |
>>> +               ((unsigned long)vq->avail & ~PAGE_MASK));
>> Is it a must to map avail page here?
>>
> No. This is indeed in preparation for the next patch handling multiple 
> queues by a single vhost thread, where we'd like to map these pages for
> performance. 

Make sense. Then for the gup failure, we may just safely disable the
polling in this case.
>
>
>
>>> +                               vq->vqpoll.work_this_jiffy >=
>>> +                                       poll_start_rate) {
>>> +                               vhost_vq_enable_vqpoll(vq);
>>> +                       }
>>> +               }
>>> +               /* Check one virtqueue from the round-robin list */
>>> +               if (!list_empty(&dev->vqpoll_list)) {
>> If we still have another work in work_list, we may want to serve it 
> first.
>
> You maybe right. We've done a lot of experiments with this method, 
> which seems to work well. I prefer leaving it this way for now, but your 
> approach 
> is worthwhile to investigate as well. 

Ok.
>
>> [...]
>>>  struct vhost_dev {
>>> @@ -123,6 +151,7 @@ struct vhost_dev {
>>>         spinlock_t work_lock;
>>>         struct list_head work_list;
>>>         struct task_struct *worker;
>>> +        struct list_head vqpoll_list;
>>>  };
>>>
>>>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, 
> int 
>>> nvqs);
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-23  5:26 ` Jason Wang
@ 2014-07-23  8:12   ` Razya Ladelsky
  2014-07-23  8:42     ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Razya Ladelsky @ 2014-07-23  8:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: abel.gordon, Alex Glikson, Eran Raichstein, Joel Nider, kvm,
	Michael S. Tsirkin, Yossi Kuperman1

Jason Wang <jasowang@redhat.com> wrote on 23/07/2014 08:26:36 AM:

> From: Jason Wang <jasowang@redhat.com>
> To: Razya Ladelsky/Haifa/IBM@IBMIL, kvm@vger.kernel.org, "Michael S.
> Tsirkin" <mst@redhat.com>, 
> Cc: abel.gordon@gmail.com, Joel Nider/Haifa/IBM@IBMIL, Yossi 
> Kuperman1/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, Alex 
> Glikson/Haifa/IBM@IBMIL
> Date: 23/07/2014 08:26 AM
> Subject: Re: [PATCH] vhost: Add polling mode
> 
> On 07/21/2014 09:23 PM, Razya Ladelsky wrote:
> > Hello All,
> >
> > When vhost is waiting for buffers from the guest driver (e.g., more 
> > packets
> > to send in vhost-net's transmit queue), it normally goes to sleep and 
> > waits
> > for the guest to "kick" it. This kick involves a PIO in the guest, and
> > therefore an exit (and possibly userspace involvement in translating 
this 
> > PIO
> > exit into a file descriptor event), all of which hurts performance.
> >
> > If the system is under-utilized (has cpu time to spare), vhost can 
> > continuously poll the virtqueues for new buffers, and avoid asking 
> > the guest to kick us.
> > This patch adds an optional polling mode to vhost, that can be enabled 

> > via a kernel module parameter, "poll_start_rate".
> >
> > When polling is active for a virtqueue, the guest is asked to
> > disable notification (kicks), and the worker thread continuously 
checks 
> > for
> > new buffers. When it does discover new buffers, it simulates a "kick" 
by
> > invoking the underlying backend driver (such as vhost-net), which 
thinks 
> > it
> > got a real kick from the guest, and acts accordingly. If the 
underlying
> > driver asks not to be kicked, we disable polling on this virtqueue.
> >
> > We start polling on a virtqueue when we notice it has
> > work to do. Polling on this virtqueue is later disabled after 3 
seconds of
> > polling turning up no new work, as in this case we are better off 
> > returning
> > to the exit-based notification mechanism. The default timeout of 3 
seconds
> > can be changed with the "poll_stop_idle" kernel module parameter.
> >
> > This polling approach makes lot of sense for new HW with 
posted-interrupts
> > for which we have exitless host-to-guest notifications. But even with 
> > support 
> > for posted interrupts, guest-to-host communication still causes exits. 

> > Polling adds the missing part.
> >
> > When systems are overloaded, there won?t be enough cpu time for the 
> > various 
> > vhost threads to poll their guests' devices. For these scenarios, we 
plan 
> > to add support for vhost threads that can be shared by multiple 
devices, 
> > even of multiple vms. 
> > Our ultimate goal is to implement the I/O acceleration features 
described 
> > in:
> > KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) 
> > https://www.youtube.com/watch?v=9EyweibHfEs
> > and
> > https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
> >
> > 
> > Comments are welcome, 
> > Thank you,
> > Razya
> 
> Thanks for the work. Do you have perf numbers for this?
> 

Hi Jason,
Thanks for reviewing. I ran some experiments with TCP stream netperf and 
filebench (having 2 threads performing random reads) benchmarks on an IBM 
System x3650 M4.
All runs loaded the guests in a way that they were (cpu) saturated.
The system had two cores per guest, as to allow for both the vcpu and the 
vhost thread to
run concurrently for maximum throughput (but I didn't pin the threads to 
specific cores)
I get:

Netperf, 1 vm:
The polling patch improved throughput by ~33%. Number of exits/sec 
decreased 6x.
The same improvement was shown when I tested with 3 vms running netperf.

filebench, 1 vm:
ops/sec improved by 13% with the polling patch. Number of exits was 
reduced by 31%.
The same experiment with 3 vms running filebench showed similar numbers.


> And looks like the patch only poll for virtqueue. In the future, may
> worth to add callbacks for vhost_net to poll socket. Then it could be
> used with rx busy polling in host which may speedup the rx also.

Did you mean polling the network device to avoid interrupts?

> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index c90f437..678d766 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -24,9 +24,17 @@
> >  #include <linux/slab.h>
> >  #include <linux/kthread.h>
> >  #include <linux/cgroup.h>
> > +#include <linux/jiffies.h>
> >  #include <linux/module.h>
> > 
> >  #include "vhost.h"
> > +static int poll_start_rate = 0;
> > +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of 
virtqueue 
> > when rate of events is at least this number per jiffy. If 0, never 
start 
> > polling.");
> > +
> > +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> > +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> > +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of 
virtqueue 
> > after this many jiffies of no work.");
> > 
> 
> I'm not sure using jiffy is good enough since user need know HZ value.
> May worth to look at sk_busy_loop() which use sched_clock() and us. 

Ok, Will look into it, thanks.

> > 
> > +/* Enable or disable virtqueue polling (vqpoll.enabled) for a 
virtqueue.
> > + *
> > + * Enabling this mode it tells the guest not to notify ("kick") us 
when 
> > its
> > + * has made more work available on this virtqueue; Rather, we will 
> > continuously
> > + * poll this virtqueue in the worker thread. If multiple virtqueues 
are 
> > polled,
> > + * the worker thread polls them all, e.g., in a round-robin fashion.
> > + * Note that vqpoll.enabled doesn't always mean that this virtqueue 
is
> > + * actually being polled: The backend (e.g., net.c) may temporarily 
> > disable it
> > + * using vhost_disable/enable_notify(), while vqpoll.enabled is 
> > unchanged.
> > + *
> > + * It is assumed that these functions are called relatively rarely, 
when 
> > vhost
> > + * notices that this virtqueue's usage pattern significantly changed 
in a 
> > way
> > + * that makes polling more efficient than notification, or vice 
versa.
> > + * Also, we assume that vhost_vq_disable_vqpoll() is always called on 
vq
> > + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can 
be
> > + * reclaimed.
> > + */
> > +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> > +{
> > +       if (vq->vqpoll.enabled)
> > +               return; /* already enabled, nothing to do */
> > +       if (!vq->handle_kick)
> > +               return; /* polling will be a waste of time if no 
callback! 
> > */
> > +       if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> > +               /* vq has guest notifications enabled. Disable them,
> > +                  and instead add vq to the polling list */
> >
> > +               list_add_tail(&vq->vqpoll.link, 
&vq->dev->vqpoll_list);
> 
> This may work when there're at most two vqs in the list. But consider
> you may want to poll a lot of vqs in the future, it may take a long time
> for this vq to get polled. So probably we can just keep the used_flags
> untouched, if the vq get kicked, it can be served soon.

Indeed there is a patch ready for polling multiple virtqueues, and it has 
a better scheduling algorithm that avoids a virtqueue starvation. 


> > +       }
> > +       vq->vqpoll.jiffies_last_kick = jiffies;
> > +       __get_user(vq->avail_idx, &vq->avail->idx); 
> > +       vq->vqpoll.enabled = true;
> > +
> > +       /* Map userspace's vq->avail to the kernel's memory space. */
> > +       if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> > +               &vq->vqpoll.avail_page) != 1) {
> > +               /* TODO: can this happen, as we check access
> > +               to vq->avail in advance? */
> > +               BUG();
> > +       }
> > +       vq->vqpoll.avail_mapped = (struct vring_avail *) (
> > +               (unsigned long)kmap(vq->vqpoll.avail_page) |
> > +               ((unsigned long)vq->avail & ~PAGE_MASK));
> 
> Is it a must to map avail page here?
> 

No. This is indeed in preparation for the next patch handling multiple 
queues by a single vhost thread, where we'd like to map these pages for
performance. 



> > +                               vq->vqpoll.work_this_jiffy >=
> > +                                       poll_start_rate) {
> > +                               vhost_vq_enable_vqpoll(vq);
> > +                       }
> > +               }
> > +               /* Check one virtqueue from the round-robin list */
> > +               if (!list_empty(&dev->vqpoll_list)) {
> 
> If we still have another work in work_list, we may want to serve it 
first.

You maybe right. We've done a lot of experiments with this method, 
which seems to work well. I prefer leaving it this way for now, but your 
approach 
is worthwhile to investigate as well. 


> [...]
> >  struct vhost_dev {
> > @@ -123,6 +151,7 @@ struct vhost_dev {
> >         spinlock_t work_lock;
> >         struct list_head work_list;
> >         struct task_struct *worker;
> > +        struct list_head vqpoll_list;
> >  };
> > 
> >  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, 
int 
> > nvqs);
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH] vhost: Add polling mode
  2014-07-21 13:23 Razya Ladelsky
@ 2014-07-23  5:26 ` Jason Wang
  2014-07-23  8:12   ` Razya Ladelsky
  2014-07-29  8:06 ` Michael S. Tsirkin
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Wang @ 2014-07-23  5:26 UTC (permalink / raw)
  To: Razya Ladelsky, kvm, Michael S. Tsirkin
  Cc: abel.gordon, Joel Nider, Yossi Kuperman1, Eran Raichstein, Alex Glikson

On 07/21/2014 09:23 PM, Razya Ladelsky wrote:
> Hello All,
>
> When vhost is waiting for buffers from the guest driver (e.g., more 
> packets
> to send in vhost-net's transmit queue), it normally goes to sleep and 
> waits
> for the guest to "kick" it. This kick involves a PIO in the guest, and
> therefore an exit (and possibly userspace involvement in translating this 
> PIO
> exit into a file descriptor event), all of which hurts performance.
>
> If the system is under-utilized (has cpu time to spare), vhost can 
> continuously poll the virtqueues for new buffers, and avoid asking 
> the guest to kick us.
> This patch adds an optional polling mode to vhost, that can be enabled 
> via a kernel module parameter, "poll_start_rate".
>
> When polling is active for a virtqueue, the guest is asked to
> disable notification (kicks), and the worker thread continuously checks 
> for
> new buffers. When it does discover new buffers, it simulates a "kick" by
> invoking the underlying backend driver (such as vhost-net), which thinks 
> it
> got a real kick from the guest, and acts accordingly. If the underlying
> driver asks not to be kicked, we disable polling on this virtqueue.
>
> We start polling on a virtqueue when we notice it has
> work to do. Polling on this virtqueue is later disabled after 3 seconds of
> polling turning up no new work, as in this case we are better off 
> returning
> to the exit-based notification mechanism. The default timeout of 3 seconds
> can be changed with the "poll_stop_idle" kernel module parameter.
>
> This polling approach makes lot of sense for new HW with posted-interrupts
> for which we have exitless host-to-guest notifications. But even with 
> support 
> for posted interrupts, guest-to-host communication still causes exits. 
> Polling adds the missing part.
>
> When systems are overloaded, there won?t be enough cpu time for the 
> various 
> vhost threads to poll their guests' devices. For these scenarios, we plan 
> to add support for vhost threads that can be shared by multiple devices, 
> even of multiple vms. 
> Our ultimate goal is to implement the I/O acceleration features described 
> in:
> KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) 
> https://www.youtube.com/watch?v=9EyweibHfEs
> and
> https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html
>
>  
> Comments are welcome, 
> Thank you,
> Razya

Thanks for the work. Do you have perf numbers for this?

And looks like the patch only poll for virtqueue. In the future, may
worth to add callbacks for vhost_net to poll socket. Then it could be
used with rx busy polling in host which may speedup the rx also.
>
> From: Razya Ladelsky <razya@il.ibm.com>
>
> Add an optional polling mode to continuously poll the virtqueues
> for new buffers, and avoid asking the guest to kick us.
>
> Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
> ---
>  drivers/vhost/net.c   |    6 +-
>  drivers/vhost/scsi.c  |    5 +-
>  drivers/vhost/vhost.c |  247 
> +++++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/vhost/vhost.h |   37 +++++++-
>  4 files changed, 277 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 971a760..558aecb 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct 
> file *f)
>         }
>         vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
>  
> -       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, 
> dev);
> -       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, 
> dev);
> +       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
> +                       vqs[VHOST_NET_VQ_TX]);
> +       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
> +                       vqs[VHOST_NET_VQ_RX]);
>  
>         f->private_data = n;
>  
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 4f4ffa4..56f0233 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -1528,9 +1528,8 @@ static int vhost_scsi_open(struct inode *inode, 
> struct file *f)
>         if (!vqs)
>                 goto err_vqs;
>  
> -       vhost_work_init(&vs->vs_completion_work, 
> vhost_scsi_complete_cmd_work);
> -       vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
> -
> +       vhost_work_init(&vs->vs_completion_work, NULL, 
> vhost_scsi_complete_cmd_work);
> +       vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
>         vs->vs_events_nr = 0;
>         vs->vs_events_missed = false;
>  
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c90f437..678d766 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -24,9 +24,17 @@
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
>  #include <linux/cgroup.h>
> +#include <linux/jiffies.h>
>  #include <linux/module.h>
>  
>  #include "vhost.h"
> +static int poll_start_rate = 0;
> +module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue 
> when rate of events is at least this number per jiffy. If 0, never start 
> polling.");
> +
> +static int poll_stop_idle = 3*HZ; /* 3 seconds */
> +module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
> +MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue 
> after this many jiffies of no work.");
>  

I'm not sure using jiffy is good enough since user need know HZ value.
May worth to look at sk_busy_loop() which use sched_clock() and us.  
>  enum {
>         VHOST_MEMORY_MAX_NREGIONS = 64,
> @@ -58,27 +66,27 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
> unsigned mode, int sync,
>         return 0;
>  }
>  
> -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
> +void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq, 
> vhost_work_fn_t fn)
>  {
>         INIT_LIST_HEAD(&work->node);
>         work->fn = fn;
>         init_waitqueue_head(&work->done);
>         work->flushing = 0;
>         work->queue_seq = work->done_seq = 0;
> +       work->vq = vq;
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_init);
>  
>  /* Init poll structure */
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
> -                    unsigned long mask, struct vhost_dev *dev)
> +                    unsigned long mask, struct vhost_virtqueue *vq)
>  {
>         init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
>         init_poll_funcptr(&poll->table, vhost_poll_func);
>         poll->mask = mask;
> -       poll->dev = dev;
> +       poll->dev = vq->dev;
>         poll->wqh = NULL;
> -
> -       vhost_work_init(&poll->work, fn);
> +       vhost_work_init(&poll->work, vq, fn);
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_init);
>  
> @@ -174,6 +182,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
>  }
>  EXPORT_SYMBOL_GPL(vhost_poll_queue);
>  
> +/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
> + *
> + * Enabling this mode it tells the guest not to notify ("kick") us when 
> its
> + * has made more work available on this virtqueue; Rather, we will 
> continuously
> + * poll this virtqueue in the worker thread. If multiple virtqueues are 
> polled,
> + * the worker thread polls them all, e.g., in a round-robin fashion.
> + * Note that vqpoll.enabled doesn't always mean that this virtqueue is
> + * actually being polled: The backend (e.g., net.c) may temporarily 
> disable it
> + * using vhost_disable/enable_notify(), while vqpoll.enabled is 
> unchanged.
> + *
> + * It is assumed that these functions are called relatively rarely, when 
> vhost
> + * notices that this virtqueue's usage pattern significantly changed in a 
> way
> + * that makes polling more efficient than notification, or vice versa.
> + * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
> + * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
> + * reclaimed.
> + */
> +static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +       if (vq->vqpoll.enabled)
> +               return; /* already enabled, nothing to do */
> +       if (!vq->handle_kick)
> +               return; /* polling will be a waste of time if no callback! 
> */
> +       if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
> +               /* vq has guest notifications enabled. Disable them,
> +                  and instead add vq to the polling list */
>
> +               list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);

This may work when there're at most two vqs in the list. But consider
you may want to poll a lot of vqs in the future, it may take a long time
for this vq to get polled. So probably we can just keep the used_flags
untouched, if the vq get kicked, it can be served soon.
> +       }
> +       vq->vqpoll.jiffies_last_kick = jiffies;
> +       __get_user(vq->avail_idx, &vq->avail->idx); 
> +       vq->vqpoll.enabled = true;
> +
> +       /* Map userspace's vq->avail to the kernel's memory space. */
> +       if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
> +               &vq->vqpoll.avail_page) != 1) {
> +               /* TODO: can this happen, as we check access
> +               to vq->avail in advance? */
> +               BUG();
> +       }
> +       vq->vqpoll.avail_mapped = (struct vring_avail *) (
> +               (unsigned long)kmap(vq->vqpoll.avail_page) |
> +               ((unsigned long)vq->avail & ~PAGE_MASK));

Is it a must to map avail page here?

Isn't it enough to do __get_user(vq->avail_idx, &vq->avail->idx) and
then compare it with vq->last_avail_idx in roundrobin_poll() ?
> +}
> +
> +/*
> + * This function doesn't always succeed in changing the mode. Sometimes
> + * a temporary race condition prevents turning on guest notifications, so
> + * vq should be polled next time again.
> + */
> +static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
> +{
> +       if (!vq->vqpoll.enabled) {
> +               return; /* already disabled, nothing to do */
> +       }
> +       vq->vqpoll.enabled = false;
> +
> +       if (!list_empty(&vq->vqpoll.link)) {
> +               /* vq is on the polling list, remove it from this list and
> +                * instead enable guest notifications. */
> +               list_del_init(&vq->vqpoll.link);
> +               if (unlikely(vhost_enable_notify(vq->dev,vq))
> +                       && !vq->vqpoll.shutdown) {
> +                       /* Race condition: guest wrote before we enabled
> +                        * notification, so we'll never get a notification 
> for
> +                        * this work - so continue polling mode for a 
> while. */
> +                       vhost_disable_notify(vq->dev, vq);
> +                       vq->vqpoll.enabled = true;
> +                       vhost_enable_notify(vq->dev, vq);
> +                       return;
> +               }
> +       }
> +
> +       if (vq->vqpoll.avail_mapped) {
> +               kunmap(vq->vqpoll.avail_page);
> +               put_page(vq->vqpoll.avail_page);
> +               vq->vqpoll.avail_mapped = 0;
> +       }
> +}
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>                            struct vhost_virtqueue *vq)
>  {
> @@ -199,6 +287,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>         vq->call = NULL;
>         vq->log_ctx = NULL;
>         vq->memory = NULL;
> +       INIT_LIST_HEAD(&vq->vqpoll.link);
> +       vq->vqpoll.enabled = false;
> +       vq->vqpoll.shutdown = false;
> +       vq->vqpoll.avail_mapped = NULL;
> +}
> +
> +/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
> + * virtqueues which the caller should kick, or NULL in case none should 
> be
> + * kicked. roundrobin_poll() also disables polling on a virtqueue which 
> has
> + * been polled for too long without success.
> + *
> + * This current implementation (the "round-robin" implementation) only
> + * polls the first vq in the list, returning it or NULL as appropriate, 
> and
> + * moves this vq to the end of the list, so next time a different one is
> + * polled.
> + */
> +static struct vhost_virtqueue *roundrobin_poll(struct list_head *list) {
> +       struct vhost_virtqueue *vq;
> +       u16 avail_idx;
> +
> +
> +       if (list_empty(list))
> +               return NULL;
> +
> +       vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
> +       WARN_ON(!vq->vqpoll.enabled);
> +       list_move_tail(&vq->vqpoll.link, list);
> +
> +       /* See if there is any new work available from the guest. */
> +       /* TODO: can check the optional idx feature, and if we haven't
> +        * reached that idx yet, don't kick... */
> +       avail_idx = vq->vqpoll.avail_mapped->idx;
> +       if (avail_idx != vq->last_avail_idx) {
> +               return vq;
> +       }
> +       if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
> +               /* We've been polling this virtqueue for a long time with 
> no
> +                * results, so switch back to guest notification
> +                */
> +               vhost_vq_disable_vqpoll(vq);
> +       }
> +       return NULL;
>  }
>  
>  static int vhost_worker(void *data)
> @@ -237,12 +367,66 @@ static int vhost_worker(void *data)
>                 spin_unlock_irq(&dev->work_lock);
>  
>                 if (work) {
> +                       struct vhost_virtqueue *vq = work->vq;
>                         __set_current_state(TASK_RUNNING);
>                         work->fn(work);
> +                       /* Keep track of the work rate, for deciding when 
> to
> +                        * enable polling */
> +                       if (vq) {
> +                               if (vq->vqpoll.jiffies_last_work != 
> jiffies) {
> +                                       vq->vqpoll.jiffies_last_work = 
> jiffies;
> +                                       vq->vqpoll.work_this_jiffy = 0;
> +                               }
> +                               vq->vqpoll.work_this_jiffy++;
> +                       }
> +                       /* If vq is in the round-robin list of virtqueues 
> being
> +                        * constantly checked by this thread, move vq the 
> end
> +                        * of the queue, because it had its fair chance 
> now.
> +                        */
> +                       if (vq && !list_empty(&vq->vqpoll.link)) {
> +                               list_move_tail(&vq->vqpoll.link,
> +                                       &dev->vqpoll_list);
> +                       }
> +                       /* Otherwise, if this vq is looking for 
> notifications
> +                        * but vq polling is not enabled for it, do it 
> now.
> +                        */
> +                       else if (poll_start_rate && vq && vq->handle_kick 
> &&
> +                               !vq->vqpoll.enabled &&
> +                               !vq->vqpoll.shutdown &&
> +                               !(vq->used_flags & VRING_USED_F_NO_NOTIFY) 
> &&
> +                               vq->vqpoll.work_this_jiffy >=
> +                                       poll_start_rate) {
> +                               vhost_vq_enable_vqpoll(vq);
> +                       }
> +               }
> +               /* Check one virtqueue from the round-robin list */
> +               if (!list_empty(&dev->vqpoll_list)) {

If we still have another work in work_list, we may want to serve it first.
[...]
>  struct vhost_dev {
> @@ -123,6 +151,7 @@ struct vhost_dev {
>         spinlock_t work_lock;
>         struct list_head work_list;
>         struct task_struct *worker;
> +        struct list_head vqpoll_list;
>  };
>  
>  void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int 
> nvqs);


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] vhost: Add polling mode
@ 2014-07-21 13:23 Razya Ladelsky
  2014-07-23  5:26 ` Jason Wang
  2014-07-29  8:06 ` Michael S. Tsirkin
  0 siblings, 2 replies; 55+ messages in thread
From: Razya Ladelsky @ 2014-07-21 13:23 UTC (permalink / raw)
  To: kvm, Michael S. Tsirkin
  Cc: abel.gordon, Joel Nider, Yossi Kuperman1, Eran Raichstein, Alex Glikson

Hello All,

When vhost is waiting for buffers from the guest driver (e.g., more 
packets
to send in vhost-net's transmit queue), it normally goes to sleep and 
waits
for the guest to "kick" it. This kick involves a PIO in the guest, and
therefore an exit (and possibly userspace involvement in translating this 
PIO
exit into a file descriptor event), all of which hurts performance.

If the system is under-utilized (has cpu time to spare), vhost can 
continuously poll the virtqueues for new buffers, and avoid asking 
the guest to kick us.
This patch adds an optional polling mode to vhost, that can be enabled 
via a kernel module parameter, "poll_start_rate".

When polling is active for a virtqueue, the guest is asked to
disable notification (kicks), and the worker thread continuously checks 
for
new buffers. When it does discover new buffers, it simulates a "kick" by
invoking the underlying backend driver (such as vhost-net), which thinks 
it
got a real kick from the guest, and acts accordingly. If the underlying
driver asks not to be kicked, we disable polling on this virtqueue.

We start polling on a virtqueue when we notice it has
work to do. Polling on this virtqueue is later disabled after 3 seconds of
polling turning up no new work, as in this case we are better off 
returning
to the exit-based notification mechanism. The default timeout of 3 seconds
can be changed with the "poll_stop_idle" kernel module parameter.

This polling approach makes lot of sense for new HW with posted-interrupts
for which we have exitless host-to-guest notifications. But even with 
support 
for posted interrupts, guest-to-host communication still causes exits. 
Polling adds the missing part.

When systems are overloaded, there won?t be enough cpu time for the 
various 
vhost threads to poll their guests' devices. For these scenarios, we plan 
to add support for vhost threads that can be shared by multiple devices, 
even of multiple vms. 
Our ultimate goal is to implement the I/O acceleration features described 
in:
KVM Forum 2013: Efficient and Scalable Virtio (by Abel Gordon) 
https://www.youtube.com/watch?v=9EyweibHfEs
and
https://www.mail-archive.com/kvm@vger.kernel.org/msg98179.html

 
Comments are welcome, 
Thank you,
Razya

From: Razya Ladelsky <razya@il.ibm.com>

Add an optional polling mode to continuously poll the virtqueues
for new buffers, and avoid asking the guest to kick us.

Signed-off-by: Razya Ladelsky <razya@il.ibm.com>
---
 drivers/vhost/net.c   |    6 +-
 drivers/vhost/scsi.c  |    5 +-
 drivers/vhost/vhost.c |  247 
+++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/vhost/vhost.h |   37 +++++++-
 4 files changed, 277 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 971a760..558aecb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -742,8 +742,10 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
        }
        vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);
 
-       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, 
dev);
-       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, 
dev);
+       vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+                       vqs[VHOST_NET_VQ_TX]);
+       vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+                       vqs[VHOST_NET_VQ_RX]);
 
        f->private_data = n;
 
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4f4ffa4..56f0233 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1528,9 +1528,8 @@ static int vhost_scsi_open(struct inode *inode, 
struct file *f)
        if (!vqs)
                goto err_vqs;
 
-       vhost_work_init(&vs->vs_completion_work, 
vhost_scsi_complete_cmd_work);
-       vhost_work_init(&vs->vs_event_work, tcm_vhost_evt_work);
-
+       vhost_work_init(&vs->vs_completion_work, NULL, 
vhost_scsi_complete_cmd_work);
+       vhost_work_init(&vs->vs_event_work, NULL, tcm_vhost_evt_work);
        vs->vs_events_nr = 0;
        vs->vs_events_missed = false;
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c90f437..678d766 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -24,9 +24,17 @@
 #include <linux/slab.h>
 #include <linux/kthread.h>
 #include <linux/cgroup.h>
+#include <linux/jiffies.h>
 #include <linux/module.h>
 
 #include "vhost.h"
+static int poll_start_rate = 0;
+module_param(poll_start_rate, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_start_rate, "Start continuous polling of virtqueue 
when rate of events is at least this number per jiffy. If 0, never start 
polling.");
+
+static int poll_stop_idle = 3*HZ; /* 3 seconds */
+module_param(poll_stop_idle, int, S_IRUGO|S_IWUSR);
+MODULE_PARM_DESC(poll_stop_idle, "Stop continuous polling of virtqueue 
after this many jiffies of no work.");
 
 enum {
        VHOST_MEMORY_MAX_NREGIONS = 64,
@@ -58,27 +66,27 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
unsigned mode, int sync,
        return 0;
 }
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn)
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq, 
vhost_work_fn_t fn)
 {
        INIT_LIST_HEAD(&work->node);
        work->fn = fn;
        init_waitqueue_head(&work->done);
        work->flushing = 0;
        work->queue_seq = work->done_seq = 0;
+       work->vq = vq;
 }
 EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-                    unsigned long mask, struct vhost_dev *dev)
+                    unsigned long mask, struct vhost_virtqueue *vq)
 {
        init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
        init_poll_funcptr(&poll->table, vhost_poll_func);
        poll->mask = mask;
-       poll->dev = dev;
+       poll->dev = vq->dev;
        poll->wqh = NULL;
-
-       vhost_work_init(&poll->work, fn);
+       vhost_work_init(&poll->work, vq, fn);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
@@ -174,6 +182,86 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+/* Enable or disable virtqueue polling (vqpoll.enabled) for a virtqueue.
+ *
+ * Enabling this mode it tells the guest not to notify ("kick") us when 
its
+ * has made more work available on this virtqueue; Rather, we will 
continuously
+ * poll this virtqueue in the worker thread. If multiple virtqueues are 
polled,
+ * the worker thread polls them all, e.g., in a round-robin fashion.
+ * Note that vqpoll.enabled doesn't always mean that this virtqueue is
+ * actually being polled: The backend (e.g., net.c) may temporarily 
disable it
+ * using vhost_disable/enable_notify(), while vqpoll.enabled is 
unchanged.
+ *
+ * It is assumed that these functions are called relatively rarely, when 
vhost
+ * notices that this virtqueue's usage pattern significantly changed in a 
way
+ * that makes polling more efficient than notification, or vice versa.
+ * Also, we assume that vhost_vq_disable_vqpoll() is always called on vq
+ * cleanup, so any allocations done by vhost_vq_enable_vqpoll() can be
+ * reclaimed.
+ */
+static void vhost_vq_enable_vqpoll(struct vhost_virtqueue *vq)
+{
+       if (vq->vqpoll.enabled)
+               return; /* already enabled, nothing to do */
+       if (!vq->handle_kick)
+               return; /* polling will be a waste of time if no callback! 
*/
+       if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) {
+               /* vq has guest notifications enabled. Disable them,
+                  and instead add vq to the polling list */
+               vhost_disable_notify(vq->dev, vq);
+               list_add_tail(&vq->vqpoll.link, &vq->dev->vqpoll_list);
+       }
+       vq->vqpoll.jiffies_last_kick = jiffies;
+       __get_user(vq->avail_idx, &vq->avail->idx); 
+       vq->vqpoll.enabled = true;
+
+       /* Map userspace's vq->avail to the kernel's memory space. */
+       if (get_user_pages_fast((unsigned long)vq->avail, 1, 0,
+               &vq->vqpoll.avail_page) != 1) {
+               /* TODO: can this happen, as we check access
+               to vq->avail in advance? */
+               BUG();
+       }
+       vq->vqpoll.avail_mapped = (struct vring_avail *) (
+               (unsigned long)kmap(vq->vqpoll.avail_page) |
+               ((unsigned long)vq->avail & ~PAGE_MASK));
+}
+
+/*
+ * This function doesn't always succeed in changing the mode. Sometimes
+ * a temporary race condition prevents turning on guest notifications, so
+ * vq should be polled next time again.
+ */
+static void vhost_vq_disable_vqpoll(struct vhost_virtqueue *vq)
+{
+       if (!vq->vqpoll.enabled) {
+               return; /* already disabled, nothing to do */
+       }
+       vq->vqpoll.enabled = false;
+
+       if (!list_empty(&vq->vqpoll.link)) {
+               /* vq is on the polling list, remove it from this list and
+                * instead enable guest notifications. */
+               list_del_init(&vq->vqpoll.link);
+               if (unlikely(vhost_enable_notify(vq->dev,vq))
+                       && !vq->vqpoll.shutdown) {
+                       /* Race condition: guest wrote before we enabled
+                        * notification, so we'll never get a notification 
for
+                        * this work - so continue polling mode for a 
while. */
+                       vhost_disable_notify(vq->dev, vq);
+                       vq->vqpoll.enabled = true;
+                       vhost_enable_notify(vq->dev, vq);
+                       return;
+               }
+       }
+
+       if (vq->vqpoll.avail_mapped) {
+               kunmap(vq->vqpoll.avail_page);
+               put_page(vq->vqpoll.avail_page);
+               vq->vqpoll.avail_mapped = 0;
+       }
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
                           struct vhost_virtqueue *vq)
 {
@@ -199,6 +287,48 @@ static void vhost_vq_reset(struct vhost_dev *dev,
        vq->call = NULL;
        vq->log_ctx = NULL;
        vq->memory = NULL;
+       INIT_LIST_HEAD(&vq->vqpoll.link);
+       vq->vqpoll.enabled = false;
+       vq->vqpoll.shutdown = false;
+       vq->vqpoll.avail_mapped = NULL;
+}
+
+/* roundrobin_poll() takes worker->vqpoll_list, and returns one of the
+ * virtqueues which the caller should kick, or NULL in case none should 
be
+ * kicked. roundrobin_poll() also disables polling on a virtqueue which 
has
+ * been polled for too long without success.
+ *
+ * This current implementation (the "round-robin" implementation) only
+ * polls the first vq in the list, returning it or NULL as appropriate, 
and
+ * moves this vq to the end of the list, so next time a different one is
+ * polled.
+ */
+static struct vhost_virtqueue *roundrobin_poll(struct list_head *list) {
+       struct vhost_virtqueue *vq;
+       u16 avail_idx;
+
+
+       if (list_empty(list))
+               return NULL;
+
+       vq = list_first_entry(list, struct vhost_virtqueue, vqpoll.link);
+       WARN_ON(!vq->vqpoll.enabled);
+       list_move_tail(&vq->vqpoll.link, list);
+
+       /* See if there is any new work available from the guest. */
+       /* TODO: can check the optional idx feature, and if we haven't
+        * reached that idx yet, don't kick... */
+       avail_idx = vq->vqpoll.avail_mapped->idx;
+       if (avail_idx != vq->last_avail_idx) {
+               return vq;
+       }
+       if (jiffies > vq->vqpoll.jiffies_last_kick + poll_stop_idle) {
+               /* We've been polling this virtqueue for a long time with 
no
+                * results, so switch back to guest notification
+                */
+               vhost_vq_disable_vqpoll(vq);
+       }
+       return NULL;
 }
 
 static int vhost_worker(void *data)
@@ -237,12 +367,66 @@ static int vhost_worker(void *data)
                spin_unlock_irq(&dev->work_lock);
 
                if (work) {
+                       struct vhost_virtqueue *vq = work->vq;
                        __set_current_state(TASK_RUNNING);
                        work->fn(work);
+                       /* Keep track of the work rate, for deciding when 
to
+                        * enable polling */
+                       if (vq) {
+                               if (vq->vqpoll.jiffies_last_work != 
jiffies) {
+                                       vq->vqpoll.jiffies_last_work = 
jiffies;
+                                       vq->vqpoll.work_this_jiffy = 0;
+                               }
+                               vq->vqpoll.work_this_jiffy++;
+                       }
+                       /* If vq is in the round-robin list of virtqueues 
being
+                        * constantly checked by this thread, move vq the 
end
+                        * of the queue, because it had its fair chance 
now.
+                        */
+                       if (vq && !list_empty(&vq->vqpoll.link)) {
+                               list_move_tail(&vq->vqpoll.link,
+                                       &dev->vqpoll_list);
+                       }
+                       /* Otherwise, if this vq is looking for 
notifications
+                        * but vq polling is not enabled for it, do it 
now.
+                        */
+                       else if (poll_start_rate && vq && vq->handle_kick 
&&
+                               !vq->vqpoll.enabled &&
+                               !vq->vqpoll.shutdown &&
+                               !(vq->used_flags & VRING_USED_F_NO_NOTIFY) 
&&
+                               vq->vqpoll.work_this_jiffy >=
+                                       poll_start_rate) {
+                               vhost_vq_enable_vqpoll(vq);
+                       }
+               }
+               /* Check one virtqueue from the round-robin list */
+               if (!list_empty(&dev->vqpoll_list)) {
+                       struct vhost_virtqueue *vq;
+
+                       vq = roundrobin_poll(&dev->vqpoll_list);
+
+                       if (vq) {
+                               vq->handle_kick(&vq->poll.work);
+                               vq->vqpoll.jiffies_last_kick=jiffies;
+                       }
+
+                       /* If our polling list isn't empty, ask to 
continue
+                        * running this thread, don't yield.
+                        */
+                       __set_current_state(TASK_RUNNING);
                        if (need_resched())
+                        schedule();
+
+               }
+               else {
+                       if (work)
+                       {
+                               if (need_resched())
+                                       schedule();
+                       }
+                       else
                                schedule();
-               } else
-                       schedule();
+               }
 
        }
        unuse_mm(dev->mm);
@@ -306,6 +490,7 @@ void vhost_dev_init(struct vhost_dev *dev,
        dev->mm = NULL;
        spin_lock_init(&dev->work_lock);
        INIT_LIST_HEAD(&dev->work_list);
+   INIT_LIST_HEAD(&dev->vqpoll_list);
        dev->worker = NULL;
 
        for (i = 0; i < dev->nvqs; ++i) {
@@ -318,7 +503,7 @@ void vhost_dev_init(struct vhost_dev *dev,
                vhost_vq_reset(dev, vq);
                if (vq->handle_kick)
                        vhost_poll_init(&vq->poll, vq->handle_kick,
-                                       POLLIN, dev);
+                                       POLLIN, vq);
        }
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
@@ -350,7 +535,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
        struct vhost_attach_cgroups_struct attach;
 
        attach.owner = current;
-       vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+       vhost_work_init(&attach.work, NULL, vhost_attach_cgroups_work);
        vhost_work_queue(dev, &attach.work);
        vhost_work_flush(dev, &attach.work);
        return attach.ret;
@@ -444,6 +629,25 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
+/* shutdown_vqpoll() asks the worker thread to shut down virtqueue 
polling
+ * mode for a given virtqueue which is itself being shut down. We ask the
+ * worker thread to do this rather than doing it directly, so that we 
don't
+ * race with the worker thread's use of the queue.
+ */
+static void shutdown_vqpoll_work(struct vhost_work *work)
+{
+       work->vq->vqpoll.shutdown = true;
+       vhost_vq_disable_vqpoll(work->vq);
+       WARN_ON(work->vq->vqpoll.avail_mapped);
+}
+
+static void shutdown_vqpoll(struct vhost_virtqueue *vq)
+{
+       struct vhost_work work;
+       vhost_work_init(&work, vq, shutdown_vqpoll_work);
+       vhost_work_queue(vq->dev, &work);
+       vhost_work_flush(vq->dev, &work);
+}
 /* Caller should have device mutex if and only if locked is set */
 void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 {
@@ -460,6 +664,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool 
locked)
                        eventfd_ctx_put(dev->vqs[i]->call_ctx);
                if (dev->vqs[i]->call)
                        fput(dev->vqs[i]->call);
+               shutdown_vqpoll(dev->vqs[i]);
                vhost_vq_reset(dev, dev->vqs[i]);
        }
        vhost_dev_free_iovecs(dev);
@@ -1491,6 +1696,19 @@ bool vhost_enable_notify(struct vhost_dev *dev, 
struct vhost_virtqueue *vq)
        u16 avail_idx;
        int r;
 
+       /* In polling mode, when the backend (e.g., net.c) asks to enable
+        * notifications, we don't enable guest notifications. Instead, 
start
+        * polling on this vq by adding it to the round-robin list.
+        */
+       if (vq->vqpoll.enabled) {
+               if (list_empty(&vq->vqpoll.link)) {
+                       list_add_tail(&vq->vqpoll.link,
+                               &vq->dev->vqpoll_list);
+                       vq->vqpoll.jiffies_last_kick = jiffies;
+               }
+               return false;
+       }
+
        if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY))
                return false;
        vq->used_flags &= ~VRING_USED_F_NO_NOTIFY;
@@ -1528,6 +1746,17 @@ void vhost_disable_notify(struct vhost_dev *dev, 
struct vhost_virtqueue *vq)
 {
        int r;
 
+       /* If this virtqueue is vqpoll.enabled, and on the polling list, 
it
+        * will generate notifications even if the guest is asked not to 
send
+        * them. So we must remove it from the round-robin polling list.
+        * Note that vqpoll.enabled remains set.
+        */
+       if (vq->vqpoll.enabled) {
+               if(!list_empty(&vq->vqpoll.link))
+                       list_del_init(&vq->vqpoll.link);
+               return;
+       }
+
        if (vq->used_flags & VRING_USED_F_NO_NOTIFY)
                return;
        vq->used_flags |= VRING_USED_F_NO_NOTIFY;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3eda654..feb16d6 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -24,6 +24,7 @@ struct vhost_work {
        int                       flushing;
        unsigned                  queue_seq;
        unsigned                  done_seq;
+       struct vhost_virtqueue    *vq;
 };
 
 /* Poll a file (eventfd or socket) */
@@ -37,11 +38,11 @@ struct vhost_poll {
        struct vhost_dev         *dev;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
+void vhost_work_init(struct vhost_work *work, struct vhost_virtqueue *vq, 
vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-                    unsigned long mask, struct vhost_dev *dev);
+                    unsigned long mask, struct vhost_virtqueue  *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -54,8 +55,6 @@ struct vhost_log {
        u64 len;
 };
 
-struct vhost_virtqueue;
-
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
        struct vhost_dev *dev;
@@ -110,6 +109,35 @@ struct vhost_virtqueue {
        /* Log write descriptors */
        void __user *log_base;
        struct vhost_log *log;
+   struct {
+      /* When a virtqueue is in vqpoll.enabled mode, it declares
+       * that instead of using guest notifications (kicks) to
+       * discover new work, we prefer to continuously poll this
+       * virtqueue in the worker thread.
+       * If !enabled, the rest of the fields below are undefined.
+       */
+      bool enabled;
+      /* vqpoll.enabled doesn't always mean that this virtqueue is
+       * actually being polled: The backend (e.g., net.c) may
+       * temporarily disable it using vhost_disable/enable_notify().
+       * vqpoll.link is used to maintain the thread's round-robin
+       * list of virtqueues that actually need to be polled.
+       * Note list_empty(link) means this virtqueue isn't polled.
+       */
+      struct list_head link;
+      /* If this flag is true, the virtqueue is being shut down,
+       * so vqpoll should not be re-enabled.
+       */
+      bool shutdown;
+      /* Various counters used to decide when to enter polling mode
+       * or leave it and return to notification mode.
+       */
+      unsigned long jiffies_last_kick;
+      unsigned long jiffies_last_work;
+      int work_this_jiffy;
+      struct page *avail_page;
+      volatile struct vring_avail *avail_mapped;
+   } vqpoll;
 };
 
 struct vhost_dev {
@@ -123,6 +151,7 @@ struct vhost_dev {
        spinlock_t work_lock;
        struct list_head work_list;
        struct task_struct *worker;
+        struct list_head vqpoll_list;
 };
 
 void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs, int 
nvqs);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2016-09-04  8:45 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1407659404-razya@il.ibm.com>
2014-08-10  8:30 ` [PATCH] vhost: Add polling mode Razya Ladelsky
2014-08-10  8:30   ` Razya Ladelsky
2014-08-10  8:30 ` Razya Ladelsky
2014-08-10 19:45   ` Michael S. Tsirkin
2014-08-10 19:45     ` Michael S. Tsirkin
2014-08-11 19:46     ` David Miller
2014-08-11 19:46       ` David Miller
2014-08-12  9:18       ` Michael S. Tsirkin
2014-08-12  9:18         ` Michael S. Tsirkin
2014-08-12 10:57         ` Razya Ladelsky
2014-08-12 10:57           ` Razya Ladelsky
2014-08-13 12:15           ` Michael S. Tsirkin
2014-08-13 12:15             ` Michael S. Tsirkin
2014-08-17 12:35             ` Razya Ladelsky
2014-08-17 12:35               ` Razya Ladelsky
2014-08-17 12:58               ` Michael S. Tsirkin
2014-08-17 12:58                 ` Michael S. Tsirkin
2014-08-19  8:36                 ` Razya Ladelsky
2014-08-19  8:36                   ` Razya Ladelsky
2014-08-20 11:05                   ` Michael S. Tsirkin
2014-08-20 11:05                     ` Michael S. Tsirkin
2016-09-04  8:45     ` Razya Ladelsky
2016-09-04  8:45     ` Razya Ladelsky
2014-08-20  8:41   ` Christian Borntraeger
2014-08-20  8:41     ` Christian Borntraeger
2014-08-20 10:32     ` Michael S. Tsirkin
2014-08-20 10:32       ` Michael S. Tsirkin
2014-08-21 13:53     ` Razya Ladelsky
2014-08-21 13:53       ` Razya Ladelsky
2014-08-22  9:30       ` Zhang Haoyu
2014-08-22 10:01       ` Zhang Haoyu
2014-08-20 10:57   ` Michael S. Tsirkin
2014-08-20 10:57     ` Michael S. Tsirkin
2014-08-21 14:23     ` Razya Ladelsky
2014-08-21 14:23       ` Razya Ladelsky
2014-08-21 14:29       ` David Laight
2014-08-21 14:29         ` David Laight
2014-08-24 12:26         ` Razya Ladelsky
2014-08-24 12:26           ` Razya Ladelsky
2014-08-10  8:30 ` Razya Ladelsky
2014-08-10  8:30 ` Razya Ladelsky
2014-07-21 13:23 Razya Ladelsky
2014-07-23  5:26 ` Jason Wang
2014-07-23  8:12   ` Razya Ladelsky
2014-07-23  8:42     ` Jason Wang
2014-07-23  8:48       ` Abel Gordon
2014-07-24  5:57         ` Jason Wang
2014-07-29  1:30         ` Zhang Haoyu
2014-07-29  7:15           ` Razya Ladelsky
2014-07-29  8:06 ` Michael S. Tsirkin
2014-07-29 10:30   ` Razya Ladelsky
2014-07-29 10:44     ` Michael S. Tsirkin
2014-07-29 12:23       ` Razya Ladelsky
2014-07-29 12:40         ` Michael S. Tsirkin
2014-07-30  6:32           ` Razya Ladelsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.