All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2][RFC] vhost: improve transmit rate with virtqueue polling
@ 2012-02-17 23:02 Anthony Liguori
  2012-02-17 23:02 ` [PATCH 1/2] vhost: allow multiple workers threads Anthony Liguori
  2012-02-17 23:02 ` [PATCH 2/2] vhost-net: add a spin_threshold parameter Anthony Liguori
  0 siblings, 2 replies; 23+ messages in thread
From: Anthony Liguori @ 2012-02-17 23:02 UTC (permalink / raw)
  To: netdev; +Cc: Michael Tsirkin

Hi,

We've been studying small packet performance under KVM.  Currently, this type of
workload performs fairly poorly in KVM compared to other hypervisors.

After a lot of study, we concluded that two factors currently are bottlenecking
performance.

1) vhost uses a single thread to read both the transmit and receive rings.  The
   code is clearly structured to support multiple threads but only does a single
   thread today.  As these patches show, this has a significant affect on
   performance when running a two VCPU Linux guest under KVM.

2) When dealing with a workload like multiple TCP_RR instances, we process
   packets off the transmit queue too quickly.  It seems to be rare to actually
   get significant batching.  vhost does not use a timer for TX mitigation
   instead relying on scheduling latency to encourage batching.  But in an
   unloaded system, the vhost thread simply gets scheduled too quickly and the
   exit cost dominates the workload.

   In the second patch, we introduce an mechanism to poll the transmit ring for
   a short period of time.  While this series doesn't show it, in our testing
   with similar code, this can have a dramatic affect on throughput.

The second patch does manual tuning but we have a third patch that uses a simple
adaptive algorithm.  I'll follow up soon with this patch and results for it.

We're looking to get some feedback on the approach here.  I think the first
patch should be pretty non-controversial (other than the broadcast wake-up).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-17 23:02 [PATCH 0/2][RFC] vhost: improve transmit rate with virtqueue polling Anthony Liguori
@ 2012-02-17 23:02 ` Anthony Liguori
  2012-02-19 14:41   ` Michael S. Tsirkin
  2012-02-17 23:02 ` [PATCH 2/2] vhost-net: add a spin_threshold parameter Anthony Liguori
  1 sibling, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2012-02-17 23:02 UTC (permalink / raw)
  To: netdev; +Cc: Michael Tsirkin, Anthony Liguori, Tom Lendacky, Cristian Viana

This patch allows vhost to have multiple worker threads for devices such as
virtio-net which may have multiple virtqueues.

Since virtqueues are a lockless ring queue, in an ideal world data is being
produced by the producer as fast as data is being consumed by the consumer.
These loops will continue to consume data until none is left.

vhost currently multiplexes the consumer side of the queue on a single thread
by attempting to read from the queue until everything is read or it cannot
process anymore.  This means that activity on one queue may stall another queue.

This is exacerbated when using any form of polling to read from the queues (as
we'll introduce in the next patch).  By spawning a thread per-virtqueue, this
is addressed.

The only problem with this patch right now is how the wake up of the threads is
done.  It's essentially a broadcast and we have seen lock contention as a
result.  We've tried some approaches to signal a single thread but I'm not
confident that that code is correct yet so I'm only sending the broadcast
version.

Here are some performance results from this change.  There's a modest
improvement with stream although a fair bit of variability too.

With RR, there's pretty significant improvements as the instance rate drives up.

Test, Size, Instance, Baseline, Patch, Relative

TCP_RR
	Tx:256 Rx:256
		1	9,639.66	10,164.06	105.44%
		10	62,819.55	54,059.78	86.06%
		30	84,715.60	131,241.86	154.92%
		60	124,614.71	148,720.66	119.34%

UDP_RR
	Tx:256 Rx:256
		1	9,652.50	10,343.72	107.16%
		10	53,830.26	58,235.90	108.18%
		30	89,471.01	97,634.53	109.12%
		60	103,640.59	164,035.01	158.27%

TCP_STREAM
	Tx: 256
		1	2,622.63	2,610.71	99.55%
		4	4,928.02	4,812.05	97.65%

	Tx: 1024
		1	5,639.89	5,751.28	101.97%
		4	5,874.72	6,575.55	111.93%

	Tx: 4096
		1	6,257.42	7,655.22	122.34%
		4	5,370.78	6,044.83	112.55%

	Tx: 16384
		1	6,346.63	7,267.44	114.51%
		4	5,198.02	5,657.12	108.83%

TCP_MAERTS
	Rx: 256
		1	2,091.38	1,765.62	84.42%
		4	5,319.52	5,619.49	105.64%

	Rx: 1024
		1	7,030.66	7,593.61	108.01%
		4	9,040.53	7,275.84	80.48%

	Rx: 4096
		1	9,160.93	9,318.15	101.72%
		4	9,372.49	8,875.63	94.70%

	Rx: 16384
		1	9,183.28	9,134.02	99.46%
		4	9,377.17	8,877.52	94.67%

							106.46%

Cc: Tom Lendacky <toml@us.ibm.com>
Cc: Cristian Viana <vianac@br.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
 drivers/vhost/net.c   |    6 ++++-
 drivers/vhost/vhost.c |   51 ++++++++++++++++++++++++++++++------------------
 drivers/vhost/vhost.h |    8 +++++-
 3 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9dab1f5..47175cd 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -33,6 +33,10 @@ static int experimental_zcopytx;
 module_param(experimental_zcopytx, int, 0444);
 MODULE_PARM_DESC(experimental_zcopytx, "Enable Experimental Zero Copy TX");
 
+static int workers = 2;
+module_param(workers, int, 0444);
+MODULE_PARM_DESC(workers, "Set the number of worker threads");
+
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
@@ -504,7 +508,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	dev = &n->dev;
 	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
 	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
+	r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);
 	if (r < 0) {
 		kfree(n);
 		return r;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c14c42b..676cead 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -144,9 +144,12 @@ static inline void vhost_work_queue(struct vhost_dev *dev,
 
 	spin_lock_irqsave(&dev->work_lock, flags);
 	if (list_empty(&work->node)) {
+		int i;
+
 		list_add_tail(&work->node, &dev->work_list);
 		work->queue_seq++;
-		wake_up_process(dev->worker);
+		for (i = 0; i < dev->nworkers; i++)
+			wake_up_process(dev->workers[i]);
 	}
 	spin_unlock_irqrestore(&dev->work_lock, flags);
 }
@@ -287,7 +290,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
 }
 
 long vhost_dev_init(struct vhost_dev *dev,
-		    struct vhost_virtqueue *vqs, int nvqs)
+		    struct vhost_virtqueue *vqs, int nworkers, int nvqs)
 {
 	int i;
 
@@ -300,7 +303,9 @@ long vhost_dev_init(struct vhost_dev *dev,
 	dev->mm = NULL;
 	spin_lock_init(&dev->work_lock);
 	INIT_LIST_HEAD(&dev->work_list);
-	dev->worker = NULL;
+	dev->nworkers = min(nworkers, VHOST_MAX_WORKERS);
+	for (i = 0; i < dev->nworkers; i++)
+		dev->workers[i] = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		dev->vqs[i].log = NULL;
@@ -354,7 +359,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
 static long vhost_dev_set_owner(struct vhost_dev *dev)
 {
 	struct task_struct *worker;
-	int err;
+	int err, i;
 
 	/* Is there an owner already? */
 	if (dev->mm) {
@@ -364,28 +369,34 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
 
 	/* No owner, become one */
 	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
-	}
+	for (i = 0; i < dev->nworkers; i++) {
+		worker = kthread_create(vhost_worker, dev, "vhost-%d.%d", current->pid, i);
+		if (IS_ERR(worker)) {
+			err = PTR_ERR(worker);
+			goto err_worker;
+		}
 
-	dev->worker = worker;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
+		dev->workers[i] = worker;
+		wake_up_process(worker);	/* avoid contributing to loadavg */
+	}
 
 	err = vhost_attach_cgroups(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_worker;
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_worker;
 
 	return 0;
-err_cgroup:
-	kthread_stop(worker);
-	dev->worker = NULL;
+
 err_worker:
+	for (i = 0; i < dev->nworkers; i++) {
+		if (dev->workers[i]) {
+			kthread_stop(dev->workers[i]);
+			dev->workers[i] = NULL;
+		}
+	}
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
@@ -475,9 +486,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 					lockdep_is_held(&dev->mutex)));
 	RCU_INIT_POINTER(dev->memory, NULL);
 	WARN_ON(!list_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
+	for (i = 0; i < dev->nworkers; i++) {
+		if (dev->workers[i]) {
+			kthread_stop(dev->workers[i]);
+			dev->workers[i] = NULL;
+		}
 	}
 	if (dev->mm)
 		mmput(dev->mm);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a801e28..fa5e75e 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -18,6 +18,9 @@
 #define VHOST_DMA_DONE_LEN	1
 #define VHOST_DMA_CLEAR_LEN	0
 
+/* Maximum number of worker threads per-vhost device */
+#define VHOST_MAX_WORKERS	8
+
 struct vhost_device;
 
 struct vhost_work;
@@ -157,10 +160,11 @@ struct vhost_dev {
 	struct eventfd_ctx *log_ctx;
 	spinlock_t work_lock;
 	struct list_head work_list;
-	struct task_struct *worker;
+	int nworkers;
+	struct task_struct *workers[VHOST_MAX_WORKERS];
 };
 
-long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
+long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nworkers, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
 void vhost_dev_cleanup(struct vhost_dev *);
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-17 23:02 [PATCH 0/2][RFC] vhost: improve transmit rate with virtqueue polling Anthony Liguori
  2012-02-17 23:02 ` [PATCH 1/2] vhost: allow multiple workers threads Anthony Liguori
@ 2012-02-17 23:02 ` Anthony Liguori
  2012-02-19 14:51   ` Michael S. Tsirkin
  2012-03-12  8:12   ` Dor Laor
  1 sibling, 2 replies; 23+ messages in thread
From: Anthony Liguori @ 2012-02-17 23:02 UTC (permalink / raw)
  To: netdev; +Cc: Michael Tsirkin, Anthony Liguori, Tom Lendacky, Cristian Viana

With workloads that are dominated by very high rates of small packets, we see
considerable overhead in virtio notifications.

The best strategy we've been able to come up with to deal with this is adaptive
polling.  This patch simply adds the infrastructure needed to experiment with
polling strategies.  It is not meant for inclusion.

Here are the results with various polling values.  The spinning is not currently
a net win due to the high mutex contention caused by the broadcast wakeup.  With
a patch attempting to signal wakeup, we see up to 170+ transactions per second
with TCP_RR 60 instance.

N  Baseline	Spin 0		Spin 1000	Spin 5000

TCP_RR

1  9,639.66	10,164.06	9,825.43	9,827.45	101.95%
10 62,819.55	54,059.78	63,114.30	60,767.23	96.73%
30 84,715.60	131,241.86	120,922.38	89,776.39	105.97%
60 124,614.71	148,720.66	158,678.08	141,400.05	113.47%

UDP_RR

1  9,652.50	10,343.72	9,493.95	9,569.54	99.14%
10 53,830.26	58,235.90	50,145.29	48,820.53	90.69%
30 89,471.01	97,634.53	95,108.34	91,263.65	102.00%
60 103,640.59	164,035.01	157,002.22	128,646.73	124.13%

TCP_STREAM
1  2,622.63	2,610.71	2,688.49	2,678.61	102.13%
4  4,928.02	4,812.05	4,971.00	5,104.57	103.58%

1  5,639.89	5,751.28	5,819.81	5,593.62	99.18%
4  5,874.72	6,575.55	6,324.87	6,502.33	110.68%

1  6,257.42	7,655.22	7,610.52	7,424.74	118.65%
4  5,370.78	6,044.83	5,784.23	6,209.93	115.62%

1  6,346.63	7,267.44	7,567.39	7,677.93	120.98%
4  5,198.02	5,657.12	5,528.94	5,792.42	111.44%

TCP_MAERTS

1  2,091.38	1,765.62	2,142.56	2,312.94	110.59%
4  5,319.52	5,619.49	5,544.50	5,645.81	106.13%

1  7,030.66	7,593.61	7,575.67	7,622.07	108.41%
4  9,040.53	7,275.84	7,322.07	6,681.34	73.90%

1  9,160.93	9,318.15	9,065.82	8,586.82	93.73%
4  9,372.49	8,875.63	8,959.03	9,056.07	96.62%

1  9,183.28	9,134.02	8,945.12	8,657.72	94.28%
4  9,377.17	8,877.52	8,959.54	9,071.53	96.74%

Cc: Tom Lendacky <toml@us.ibm.com>
Cc: Cristian Viana <vianac@br.ibm.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
 drivers/vhost/net.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 47175cd..e9e5866 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -37,6 +37,10 @@ static int workers = 2;
 module_param(workers, int, 0444);
 MODULE_PARM_DESC(workers, "Set the number of worker threads");
 
+static ulong spin_threshold = 0;
+module_param(spin_threshold, ulong, 0444);
+MODULE_PARM_DESC(spin_threshold, "The polling threshold for the tx queue");
+
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
@@ -65,6 +69,7 @@ struct vhost_net {
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
 	enum vhost_net_poll_state tx_poll_state;
+	size_t spin_threshold;
 };
 
 static bool vhost_sock_zcopy(struct socket *sock)
@@ -149,6 +154,7 @@ static void handle_tx(struct vhost_net *net)
 	size_t hdr_size;
 	struct socket *sock;
 	struct vhost_ubuf_ref *uninitialized_var(ubufs);
+	size_t spin_count;
 	bool zcopy;
 
 	/* TODO: check that we are running from vhost_worker? */
@@ -172,6 +178,7 @@ static void handle_tx(struct vhost_net *net)
 	hdr_size = vq->vhost_hlen;
 	zcopy = vhost_sock_zcopy(sock);
 
+	spin_count = 0;
 	for (;;) {
 		/* Release DMAs done buffers first */
 		if (zcopy)
@@ -205,9 +212,15 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
+			if (spin_count < net->spin_threshold) {
+				spin_count++;
+				continue;
+			}
 			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
 				vhost_disable_notify(&net->dev, vq);
 				continue;
+			} else {
+				spin_count = 0;
 			}
 			break;
 		}
@@ -506,6 +519,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		return -ENOMEM;
 
 	dev = &n->dev;
+	n->spin_threshold = spin_threshold;
 	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
 	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
 	r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-17 23:02 ` [PATCH 1/2] vhost: allow multiple workers threads Anthony Liguori
@ 2012-02-19 14:41   ` Michael S. Tsirkin
  2012-02-20 15:50     ` Tom Lendacky
  2012-02-21  4:51     ` Jason Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Michael S. Tsirkin @ 2012-02-19 14:41 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: netdev, Tom Lendacky, Cristian Viana

On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
> This patch allows vhost to have multiple worker threads for devices such as
> virtio-net which may have multiple virtqueues.
> 
> Since virtqueues are a lockless ring queue, in an ideal world data is being
> produced by the producer as fast as data is being consumed by the consumer.
> These loops will continue to consume data until none is left.
> 
> vhost currently multiplexes the consumer side of the queue on a single thread
> by attempting to read from the queue until everything is read or it cannot
> process anymore.  This means that activity on one queue may stall another queue.

There's actually an attempt to address this: look up
VHOST_NET_WEIGHT in the code. I take it, this isn't effective?

> This is exacerbated when using any form of polling to read from the queues (as
> we'll introduce in the next patch).  By spawning a thread per-virtqueue, this
> is addressed.
> 
> The only problem with this patch right now is how the wake up of the threads is
> done.  It's essentially a broadcast and we have seen lock contention as a
> result.

On which lock?

> We've tried some approaches to signal a single thread but I'm not
> confident that that code is correct yet so I'm only sending the broadcast
> version.

Yes, that looks like an obvious question.

> Here are some performance results from this change.  There's a modest
> improvement with stream although a fair bit of variability too.
> 
> With RR, there's pretty significant improvements as the instance rate drives up.

Interesting. This was actually tested at one time and we saw
a significant performance improvement from using
a single thread especially with a single
stream in the guest. Profiling indicated that
with a single thread we get too many context
switches between TX and RX, since guest networking
tends to run TX and RX processing on the same
guest VCPU.

Maybe we were wrong or maybe this went away
for some reason. I'll see if this can be reproduced.

> 
> Test, Size, Instance, Baseline, Patch, Relative
> 
> TCP_RR
> 	Tx:256 Rx:256
> 		1	9,639.66	10,164.06	105.44%
> 		10	62,819.55	54,059.78	86.06%
> 		30	84,715.60	131,241.86	154.92%
> 		60	124,614.71	148,720.66	119.34%
> 
> UDP_RR
> 	Tx:256 Rx:256
> 		1	9,652.50	10,343.72	107.16%
> 		10	53,830.26	58,235.90	108.18%
> 		30	89,471.01	97,634.53	109.12%
> 		60	103,640.59	164,035.01	158.27%
> 
> TCP_STREAM
> 	Tx: 256
> 		1	2,622.63	2,610.71	99.55%
> 		4	4,928.02	4,812.05	97.65%
> 
> 	Tx: 1024
> 		1	5,639.89	5,751.28	101.97%
> 		4	5,874.72	6,575.55	111.93%
> 
> 	Tx: 4096
> 		1	6,257.42	7,655.22	122.34%
> 		4	5,370.78	6,044.83	112.55%
> 
> 	Tx: 16384
> 		1	6,346.63	7,267.44	114.51%
> 		4	5,198.02	5,657.12	108.83%
> 
> TCP_MAERTS
> 	Rx: 256
> 		1	2,091.38	1,765.62	84.42%
> 		4	5,319.52	5,619.49	105.64%
> 
> 	Rx: 1024
> 		1	7,030.66	7,593.61	108.01%
> 		4	9,040.53	7,275.84	80.48%
> 
> 	Rx: 4096
> 		1	9,160.93	9,318.15	101.72%
> 		4	9,372.49	8,875.63	94.70%
> 
> 	Rx: 16384
> 		1	9,183.28	9,134.02	99.46%
> 		4	9,377.17	8,877.52	94.67%
> 
> 							106.46%

One point missing here is the ratio of
throughput divided by host CPU utilization.

The concern being to avoid hurting
heavily loaded systems.

These numbers also tend to be less variable as
they depend less on host scheduler decisions.


> Cc: Tom Lendacky <toml@us.ibm.com>
> Cc: Cristian Viana <vianac@br.ibm.com>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> ---
>  drivers/vhost/net.c   |    6 ++++-
>  drivers/vhost/vhost.c |   51 ++++++++++++++++++++++++++++++------------------
>  drivers/vhost/vhost.h |    8 +++++-
>  3 files changed, 43 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9dab1f5..47175cd 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -33,6 +33,10 @@ static int experimental_zcopytx;
>  module_param(experimental_zcopytx, int, 0444);
>  MODULE_PARM_DESC(experimental_zcopytx, "Enable Experimental Zero Copy TX");
>  
> +static int workers = 2;
> +module_param(workers, int, 0444);
> +MODULE_PARM_DESC(workers, "Set the number of worker threads");
> +
>  /* Max number of bytes transferred before requeueing the job.
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
> @@ -504,7 +508,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  	dev = &n->dev;
>  	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
>  	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> -	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
> +	r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);
>  	if (r < 0) {
>  		kfree(n);
>  		return r;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c14c42b..676cead 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -144,9 +144,12 @@ static inline void vhost_work_queue(struct vhost_dev *dev,
>  
>  	spin_lock_irqsave(&dev->work_lock, flags);
>  	if (list_empty(&work->node)) {
> +		int i;
> +
>  		list_add_tail(&work->node, &dev->work_list);
>  		work->queue_seq++;
> -		wake_up_process(dev->worker);
> +		for (i = 0; i < dev->nworkers; i++)
> +			wake_up_process(dev->workers[i]);
>  	}
>  	spin_unlock_irqrestore(&dev->work_lock, flags);
>  }
> @@ -287,7 +290,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev *dev)
>  }
>  
>  long vhost_dev_init(struct vhost_dev *dev,
> -		    struct vhost_virtqueue *vqs, int nvqs)
> +		    struct vhost_virtqueue *vqs, int nworkers, int nvqs)
>  {
>  	int i;
>  
> @@ -300,7 +303,9 @@ long vhost_dev_init(struct vhost_dev *dev,
>  	dev->mm = NULL;
>  	spin_lock_init(&dev->work_lock);
>  	INIT_LIST_HEAD(&dev->work_list);
> -	dev->worker = NULL;
> +	dev->nworkers = min(nworkers, VHOST_MAX_WORKERS);
> +	for (i = 0; i < dev->nworkers; i++)
> +		dev->workers[i] = NULL;
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
>  		dev->vqs[i].log = NULL;
> @@ -354,7 +359,7 @@ static int vhost_attach_cgroups(struct vhost_dev *dev)
>  static long vhost_dev_set_owner(struct vhost_dev *dev)
>  {
>  	struct task_struct *worker;
> -	int err;
> +	int err, i;
>  
>  	/* Is there an owner already? */
>  	if (dev->mm) {
> @@ -364,28 +369,34 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
>  
>  	/* No owner, become one */
>  	dev->mm = get_task_mm(current);
> -	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
> -	if (IS_ERR(worker)) {
> -		err = PTR_ERR(worker);
> -		goto err_worker;
> -	}
> +	for (i = 0; i < dev->nworkers; i++) {
> +		worker = kthread_create(vhost_worker, dev, "vhost-%d.%d", current->pid, i);
> +		if (IS_ERR(worker)) {
> +			err = PTR_ERR(worker);
> +			goto err_worker;
> +		}
>  
> -	dev->worker = worker;
> -	wake_up_process(worker);	/* avoid contributing to loadavg */
> +		dev->workers[i] = worker;
> +		wake_up_process(worker);	/* avoid contributing to loadavg */
> +	}
>  
>  	err = vhost_attach_cgroups(dev);
>  	if (err)
> -		goto err_cgroup;
> +		goto err_worker;
>  
>  	err = vhost_dev_alloc_iovecs(dev);
>  	if (err)
> -		goto err_cgroup;
> +		goto err_worker;
>  
>  	return 0;
> -err_cgroup:
> -	kthread_stop(worker);
> -	dev->worker = NULL;
> +
>  err_worker:
> +	for (i = 0; i < dev->nworkers; i++) {
> +		if (dev->workers[i]) {
> +			kthread_stop(dev->workers[i]);
> +			dev->workers[i] = NULL;
> +		}
> +	}
>  	if (dev->mm)
>  		mmput(dev->mm);
>  	dev->mm = NULL;
> @@ -475,9 +486,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  					lockdep_is_held(&dev->mutex)));
>  	RCU_INIT_POINTER(dev->memory, NULL);
>  	WARN_ON(!list_empty(&dev->work_list));
> -	if (dev->worker) {
> -		kthread_stop(dev->worker);
> -		dev->worker = NULL;
> +	for (i = 0; i < dev->nworkers; i++) {
> +		if (dev->workers[i]) {
> +			kthread_stop(dev->workers[i]);
> +			dev->workers[i] = NULL;
> +		}
>  	}
>  	if (dev->mm)
>  		mmput(dev->mm);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index a801e28..fa5e75e 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -18,6 +18,9 @@
>  #define VHOST_DMA_DONE_LEN	1
>  #define VHOST_DMA_CLEAR_LEN	0
>  
> +/* Maximum number of worker threads per-vhost device */
> +#define VHOST_MAX_WORKERS	8
> +
>  struct vhost_device;
>  
>  struct vhost_work;
> @@ -157,10 +160,11 @@ struct vhost_dev {
>  	struct eventfd_ctx *log_ctx;
>  	spinlock_t work_lock;
>  	struct list_head work_list;
> -	struct task_struct *worker;
> +	int nworkers;
> +	struct task_struct *workers[VHOST_MAX_WORKERS];
>  };
>  
> -long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nworkers, int nvqs);
>  long vhost_dev_check_owner(struct vhost_dev *);
>  long vhost_dev_reset_owner(struct vhost_dev *);
>  void vhost_dev_cleanup(struct vhost_dev *);
> -- 
> 1.7.4.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-17 23:02 ` [PATCH 2/2] vhost-net: add a spin_threshold parameter Anthony Liguori
@ 2012-02-19 14:51   ` Michael S. Tsirkin
  2012-02-21  1:35     ` Shirley Ma
  2012-03-12  8:12   ` Dor Laor
  1 sibling, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2012-02-19 14:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: netdev, Tom Lendacky, Cristian Viana

On Fri, Feb 17, 2012 at 05:02:06PM -0600, Anthony Liguori wrote:
> With workloads that are dominated by very high rates of small packets, we see
> considerable overhead in virtio notifications.
> 
> The best strategy we've been able to come up with to deal with this is adaptive
> polling.
>  This patch simply adds the infrastructure needed to experiment with
> polling strategies.  It is not meant for inclusion.
> 
> Here are the results with various polling values.  The spinning is not currently
> a net win due to the high mutex contention caused by the broadcast wakeup.  With
> a patch attempting to signal wakeup, we see up to 170+ transactions per second
> with TCP_RR 60 instance.
> 
> N  Baseline	Spin 0		Spin 1000	Spin 5000
> 
> TCP_RR
> 
> 1  9,639.66	10,164.06	9,825.43	9,827.45	101.95%
> 10 62,819.55	54,059.78	63,114.30	60,767.23	96.73%
> 30 84,715.60	131,241.86	120,922.38	89,776.39	105.97%
> 60 124,614.71	148,720.66	158,678.08	141,400.05	113.47%
> 
> UDP_RR
> 
> 1  9,652.50	10,343.72	9,493.95	9,569.54	99.14%
> 10 53,830.26	58,235.90	50,145.29	48,820.53	90.69%
> 30 89,471.01	97,634.53	95,108.34	91,263.65	102.00%
> 60 103,640.59	164,035.01	157,002.22	128,646.73	124.13%
> 
> TCP_STREAM
> 1  2,622.63	2,610.71	2,688.49	2,678.61	102.13%
> 4  4,928.02	4,812.05	4,971.00	5,104.57	103.58%
> 
> 1  5,639.89	5,751.28	5,819.81	5,593.62	99.18%
> 4  5,874.72	6,575.55	6,324.87	6,502.33	110.68%
> 
> 1  6,257.42	7,655.22	7,610.52	7,424.74	118.65%
> 4  5,370.78	6,044.83	5,784.23	6,209.93	115.62%
> 
> 1  6,346.63	7,267.44	7,567.39	7,677.93	120.98%
> 4  5,198.02	5,657.12	5,528.94	5,792.42	111.44%
> 
> TCP_MAERTS
> 
> 1  2,091.38	1,765.62	2,142.56	2,312.94	110.59%
> 4  5,319.52	5,619.49	5,544.50	5,645.81	106.13%
> 
> 1  7,030.66	7,593.61	7,575.67	7,622.07	108.41%
> 4  9,040.53	7,275.84	7,322.07	6,681.34	73.90%
> 
> 1  9,160.93	9,318.15	9,065.82	8,586.82	93.73%
> 4  9,372.49	8,875.63	8,959.03	9,056.07	96.62%
> 
> 1  9,183.28	9,134.02	8,945.12	8,657.72	94.28%
> 4  9,377.17	8,877.52	8,959.54	9,071.53	96.74%

An obvious question would be how are BW divided by CPU
numbers affected.

> Cc: Tom Lendacky <toml@us.ibm.com>
> Cc: Cristian Viana <vianac@br.ibm.com>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> ---
>  drivers/vhost/net.c |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 47175cd..e9e5866 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -37,6 +37,10 @@ static int workers = 2;
>  module_param(workers, int, 0444);
>  MODULE_PARM_DESC(workers, "Set the number of worker threads");
>  
> +static ulong spin_threshold = 0;
> +module_param(spin_threshold, ulong, 0444);
> +MODULE_PARM_DESC(spin_threshold, "The polling threshold for the tx queue");
> +
>  /* Max number of bytes transferred before requeueing the job.
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
> @@ -65,6 +69,7 @@ struct vhost_net {
>  	 * We only do this when socket buffer fills up.
>  	 * Protected by tx vq lock. */
>  	enum vhost_net_poll_state tx_poll_state;
> +	size_t spin_threshold;
>  };
>  
>  static bool vhost_sock_zcopy(struct socket *sock)
> @@ -149,6 +154,7 @@ static void handle_tx(struct vhost_net *net)
>  	size_t hdr_size;
>  	struct socket *sock;
>  	struct vhost_ubuf_ref *uninitialized_var(ubufs);
> +	size_t spin_count;
>  	bool zcopy;
>  
>  	/* TODO: check that we are running from vhost_worker? */
> @@ -172,6 +178,7 @@ static void handle_tx(struct vhost_net *net)
>  	hdr_size = vq->vhost_hlen;
>  	zcopy = vhost_sock_zcopy(sock);
>  
> +	spin_count = 0;
>  	for (;;) {
>  		/* Release DMAs done buffers first */
>  		if (zcopy)
> @@ -205,9 +212,15 @@ static void handle_tx(struct vhost_net *net)
>  				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
>  				break;
>  			}
> +			if (spin_count < net->spin_threshold) {
> +				spin_count++;
> +				continue;
> +			}
>  			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>  				vhost_disable_notify(&net->dev, vq);
>  				continue;
> +			} else {
> +				spin_count = 0;
>  			}
>  			break;
>  		}
> @@ -506,6 +519,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>  		return -ENOMEM;
>  
>  	dev = &n->dev;
> +	n->spin_threshold = spin_threshold;
>  	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
>  	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
>  	r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);
> -- 
> 1.7.4.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-19 14:41   ` Michael S. Tsirkin
@ 2012-02-20 15:50     ` Tom Lendacky
  2012-02-20 19:27       ` Michael S. Tsirkin
  2012-02-21  4:51     ` Jason Wang
  1 sibling, 1 reply; 23+ messages in thread
From: Tom Lendacky @ 2012-02-20 15:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, netdev, Cristian Viana

"Michael S. Tsirkin" <mst@redhat.com> wrote on 02/19/2012 08:41:45 AM:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: Anthony Liguori/Austin/IBM@IBMUS
> Cc: netdev@vger.kernel.org, Tom Lendacky/Austin/IBM@IBMUS, Cristian
> Viana <vianac@br.ibm.com>
> Date: 02/19/2012 08:42 AM
> Subject: Re: [PATCH 1/2] vhost: allow multiple workers threads
>
> On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
> > This patch allows vhost to have multiple worker threads for devices
such as
> > virtio-net which may have multiple virtqueues.
> >
> > Since virtqueues are a lockless ring queue, in an ideal world data is
being
> > produced by the producer as fast as data is being consumed by the
consumer.
> > These loops will continue to consume data until none is left.
> >
> > vhost currently multiplexes the consumer side of the queue on a
> single thread
> > by attempting to read from the queue until everything is read or it
cannot
> > process anymore.  This means that activity on one queue may stall
> another queue.
>
> There's actually an attempt to address this: look up
> VHOST_NET_WEIGHT in the code. I take it, this isn't effective?
>
> > This is exacerbated when using any form of polling to read from
> the queues (as
> > we'll introduce in the next patch).  By spawning a thread per-
> virtqueue, this
> > is addressed.
> >
> > The only problem with this patch right now is how the wake up of
> the threads is
> > done.  It's essentially a broadcast and we have seen lock contention as
a
> > result.
>
> On which lock?

The mutex lock in the vhost_virtqueue struct.  This really shows up when
running with patch 2/2 and increasing the spin_threshold. Both threads wake
up and try to acquire the mutex.  As the spin_threshold increases you end
up
with one of the threads getting blocked for a longer and longer time and
unable to do any RX processing that might be needed.

Tom

>
> > We've tried some approaches to signal a single thread but I'm not
> > confident that that code is correct yet so I'm only sending the
broadcast
> > version.
>
> Yes, that looks like an obvious question.
>
> > Here are some performance results from this change.  There's a modest
> > improvement with stream although a fair bit of variability too.
> >
> > With RR, there's pretty significant improvements as the instance
> rate drives up.
>
> Interesting. This was actually tested at one time and we saw
> a significant performance improvement from using
> a single thread especially with a single
> stream in the guest. Profiling indicated that
> with a single thread we get too many context
> switches between TX and RX, since guest networking
> tends to run TX and RX processing on the same
> guest VCPU.
>
> Maybe we were wrong or maybe this went away
> for some reason. I'll see if this can be reproduced.
>
> >
> > Test, Size, Instance, Baseline, Patch, Relative
> >
> > TCP_RR
> >    Tx:256 Rx:256
> >       1   9,639.66   10,164.06   105.44%
> >       10   62,819.55   54,059.78   86.06%
> >       30   84,715.60   131,241.86   154.92%
> >       60   124,614.71   148,720.66   119.34%
> >
> > UDP_RR
> >    Tx:256 Rx:256
> >       1   9,652.50   10,343.72   107.16%
> >       10   53,830.26   58,235.90   108.18%
> >       30   89,471.01   97,634.53   109.12%
> >       60   103,640.59   164,035.01   158.27%
> >
> > TCP_STREAM
> >    Tx: 256
> >       1   2,622.63   2,610.71   99.55%
> >       4   4,928.02   4,812.05   97.65%
> >
> >    Tx: 1024
> >       1   5,639.89   5,751.28   101.97%
> >       4   5,874.72   6,575.55   111.93%
> >
> >    Tx: 4096
> >       1   6,257.42   7,655.22   122.34%
> >       4   5,370.78   6,044.83   112.55%
> >
> >    Tx: 16384
> >       1   6,346.63   7,267.44   114.51%
> >       4   5,198.02   5,657.12   108.83%
> >
> > TCP_MAERTS
> >    Rx: 256
> >       1   2,091.38   1,765.62   84.42%
> >       4   5,319.52   5,619.49   105.64%
> >
> >    Rx: 1024
> >       1   7,030.66   7,593.61   108.01%
> >       4   9,040.53   7,275.84   80.48%
> >
> >    Rx: 4096
> >       1   9,160.93   9,318.15   101.72%
> >       4   9,372.49   8,875.63   94.70%
> >
> >    Rx: 16384
> >       1   9,183.28   9,134.02   99.46%
> >       4   9,377.17   8,877.52   94.67%
> >
> >                      106.46%
>
> One point missing here is the ratio of
> throughput divided by host CPU utilization.
>
> The concern being to avoid hurting
> heavily loaded systems.
>
> These numbers also tend to be less variable as
> they depend less on host scheduler decisions.
>
>
> > Cc: Tom Lendacky <toml@us.ibm.com>
> > Cc: Cristian Viana <vianac@br.ibm.com>
> > Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> > ---
> >  drivers/vhost/net.c   |    6 ++++-
> >  drivers/vhost/vhost.c |   51 +++++++++++++++++++++++++++++
> +------------------
> >  drivers/vhost/vhost.h |    8 +++++-
> >  3 files changed, 43 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 9dab1f5..47175cd 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -33,6 +33,10 @@ static int experimental_zcopytx;
> >  module_param(experimental_zcopytx, int, 0444);
> >  MODULE_PARM_DESC(experimental_zcopytx, "Enable Experimental Zero Copy
TX");
> >
> > +static int workers = 2;
> > +module_param(workers, int, 0444);
> > +MODULE_PARM_DESC(workers, "Set the number of worker threads");
> > +
> >  /* Max number of bytes transferred before requeueing the job.
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> > @@ -504,7 +508,7 @@ static int vhost_net_open(struct inode *inode,
> struct file *f)
> >     dev = &n->dev;
> >     n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> >     n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> > -   r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
> > +   r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);
> >     if (r < 0) {
> >        kfree(n);
> >        return r;
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index c14c42b..676cead 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -144,9 +144,12 @@ static inline void vhost_work_queue(struct
> vhost_dev *dev,
> >
> >     spin_lock_irqsave(&dev->work_lock, flags);
> >     if (list_empty(&work->node)) {
> > +      int i;
> > +
> >        list_add_tail(&work->node, &dev->work_list);
> >        work->queue_seq++;
> > -      wake_up_process(dev->worker);
> > +      for (i = 0; i < dev->nworkers; i++)
> > +         wake_up_process(dev->workers[i]);
> >     }
> >     spin_unlock_irqrestore(&dev->work_lock, flags);
> >  }
> > @@ -287,7 +290,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev
*dev)
> >  }
> >
> >  long vhost_dev_init(struct vhost_dev *dev,
> > -          struct vhost_virtqueue *vqs, int nvqs)
> > +          struct vhost_virtqueue *vqs, int nworkers, int nvqs)
> >  {
> >     int i;
> >
> > @@ -300,7 +303,9 @@ long vhost_dev_init(struct vhost_dev *dev,
> >     dev->mm = NULL;
> >     spin_lock_init(&dev->work_lock);
> >     INIT_LIST_HEAD(&dev->work_list);
> > -   dev->worker = NULL;
> > +   dev->nworkers = min(nworkers, VHOST_MAX_WORKERS);
> > +   for (i = 0; i < dev->nworkers; i++)
> > +      dev->workers[i] = NULL;
> >
> >     for (i = 0; i < dev->nvqs; ++i) {
> >        dev->vqs[i].log = NULL;
> > @@ -354,7 +359,7 @@ static int vhost_attach_cgroups(struct vhost_dev
*dev)
> >  static long vhost_dev_set_owner(struct vhost_dev *dev)
> >  {
> >     struct task_struct *worker;
> > -   int err;
> > +   int err, i;
> >
> >     /* Is there an owner already? */
> >     if (dev->mm) {
> > @@ -364,28 +369,34 @@ static long vhost_dev_set_owner(struct vhost_dev
*dev)
> >
> >     /* No owner, become one */
> >     dev->mm = get_task_mm(current);
> > -   worker = kthread_create(vhost_worker, dev, "vhost-%d", current->
pid);
> > -   if (IS_ERR(worker)) {
> > -      err = PTR_ERR(worker);
> > -      goto err_worker;
> > -   }
> > +   for (i = 0; i < dev->nworkers; i++) {
> > +      worker = kthread_create(vhost_worker, dev, "vhost-%d.%d",
> current->pid, i);
> > +      if (IS_ERR(worker)) {
> > +         err = PTR_ERR(worker);
> > +         goto err_worker;
> > +      }
> >
> > -   dev->worker = worker;
> > -   wake_up_process(worker);   /* avoid contributing to loadavg */
> > +      dev->workers[i] = worker;
> > +      wake_up_process(worker);   /* avoid contributing to loadavg */
> > +   }
> >
> >     err = vhost_attach_cgroups(dev);
> >     if (err)
> > -      goto err_cgroup;
> > +      goto err_worker;
> >
> >     err = vhost_dev_alloc_iovecs(dev);
> >     if (err)
> > -      goto err_cgroup;
> > +      goto err_worker;
> >
> >     return 0;
> > -err_cgroup:
> > -   kthread_stop(worker);
> > -   dev->worker = NULL;
> > +
> >  err_worker:
> > +   for (i = 0; i < dev->nworkers; i++) {
> > +      if (dev->workers[i]) {
> > +         kthread_stop(dev->workers[i]);
> > +         dev->workers[i] = NULL;
> > +      }
> > +   }
> >     if (dev->mm)
> >        mmput(dev->mm);
> >     dev->mm = NULL;
> > @@ -475,9 +486,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> >                 lockdep_is_held(&dev->mutex)));
> >     RCU_INIT_POINTER(dev->memory, NULL);
> >     WARN_ON(!list_empty(&dev->work_list));
> > -   if (dev->worker) {
> > -      kthread_stop(dev->worker);
> > -      dev->worker = NULL;
> > +   for (i = 0; i < dev->nworkers; i++) {
> > +      if (dev->workers[i]) {
> > +         kthread_stop(dev->workers[i]);
> > +         dev->workers[i] = NULL;
> > +      }
> >     }
> >     if (dev->mm)
> >        mmput(dev->mm);
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index a801e28..fa5e75e 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -18,6 +18,9 @@
> >  #define VHOST_DMA_DONE_LEN   1
> >  #define VHOST_DMA_CLEAR_LEN   0
> >
> > +/* Maximum number of worker threads per-vhost device */
> > +#define VHOST_MAX_WORKERS   8
> > +
> >  struct vhost_device;
> >
> >  struct vhost_work;
> > @@ -157,10 +160,11 @@ struct vhost_dev {
> >     struct eventfd_ctx *log_ctx;
> >     spinlock_t work_lock;
> >     struct list_head work_list;
> > -   struct task_struct *worker;
> > +   int nworkers;
> > +   struct task_struct *workers[VHOST_MAX_WORKERS];
> >  };
> >
> > -long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue
> *vqs, int nvqs);
> > +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue
> *vqs, int nworkers, int nvqs);
> >  long vhost_dev_check_owner(struct vhost_dev *);
> >  long vhost_dev_reset_owner(struct vhost_dev *);
> >  void vhost_dev_cleanup(struct vhost_dev *);
> > --
> > 1.7.4.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-20 15:50     ` Tom Lendacky
@ 2012-02-20 19:27       ` Michael S. Tsirkin
  2012-02-20 19:46         ` Anthony Liguori
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2012-02-20 19:27 UTC (permalink / raw)
  To: Tom Lendacky; +Cc: Anthony Liguori, netdev, Cristian Viana

On Mon, Feb 20, 2012 at 09:50:37AM -0600, Tom Lendacky wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 02/19/2012 08:41:45 AM:
> 
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > To: Anthony Liguori/Austin/IBM@IBMUS
> > Cc: netdev@vger.kernel.org, Tom Lendacky/Austin/IBM@IBMUS, Cristian
> > Viana <vianac@br.ibm.com>
> > Date: 02/19/2012 08:42 AM
> > Subject: Re: [PATCH 1/2] vhost: allow multiple workers threads
> >
> > On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
> > > This patch allows vhost to have multiple worker threads for devices
> such as
> > > virtio-net which may have multiple virtqueues.
> > >
> > > Since virtqueues are a lockless ring queue, in an ideal world data is
> being
> > > produced by the producer as fast as data is being consumed by the
> consumer.
> > > These loops will continue to consume data until none is left.
> > >
> > > vhost currently multiplexes the consumer side of the queue on a
> > single thread
> > > by attempting to read from the queue until everything is read or it
> cannot
> > > process anymore.  This means that activity on one queue may stall
> > another queue.
> >
> > There's actually an attempt to address this: look up
> > VHOST_NET_WEIGHT in the code. I take it, this isn't effective?
> >
> > > This is exacerbated when using any form of polling to read from
> > the queues (as
> > > we'll introduce in the next patch).  By spawning a thread per-
> > virtqueue, this
> > > is addressed.
> > >
> > > The only problem with this patch right now is how the wake up of
> > the threads is
> > > done.  It's essentially a broadcast and we have seen lock contention as
> a
> > > result.
> >
> > On which lock?
> 
> The mutex lock in the vhost_virtqueue struct.  This really shows up when
> running with patch 2/2 and increasing the spin_threshold. Both threads wake
> up and try to acquire the mutex.  As the spin_threshold increases you end
> up
> with one of the threads getting blocked for a longer and longer time and
> unable to do any RX processing that might be needed.
> 
> Tom

Weird, I had the impression each thread handles one vq.
Isn't this the design?

> >
> > > We've tried some approaches to signal a single thread but I'm not
> > > confident that that code is correct yet so I'm only sending the
> broadcast
> > > version.
> >
> > Yes, that looks like an obvious question.
> >
> > > Here are some performance results from this change.  There's a modest
> > > improvement with stream although a fair bit of variability too.
> > >
> > > With RR, there's pretty significant improvements as the instance
> > rate drives up.
> >
> > Interesting. This was actually tested at one time and we saw
> > a significant performance improvement from using
> > a single thread especially with a single
> > stream in the guest. Profiling indicated that
> > with a single thread we get too many context
> > switches between TX and RX, since guest networking
> > tends to run TX and RX processing on the same
> > guest VCPU.
> >
> > Maybe we were wrong or maybe this went away
> > for some reason. I'll see if this can be reproduced.
> >
> > >
> > > Test, Size, Instance, Baseline, Patch, Relative
> > >
> > > TCP_RR
> > >    Tx:256 Rx:256
> > >       1   9,639.66   10,164.06   105.44%
> > >       10   62,819.55   54,059.78   86.06%
> > >       30   84,715.60   131,241.86   154.92%
> > >       60   124,614.71   148,720.66   119.34%
> > >
> > > UDP_RR
> > >    Tx:256 Rx:256
> > >       1   9,652.50   10,343.72   107.16%
> > >       10   53,830.26   58,235.90   108.18%
> > >       30   89,471.01   97,634.53   109.12%
> > >       60   103,640.59   164,035.01   158.27%
> > >
> > > TCP_STREAM
> > >    Tx: 256
> > >       1   2,622.63   2,610.71   99.55%
> > >       4   4,928.02   4,812.05   97.65%
> > >
> > >    Tx: 1024
> > >       1   5,639.89   5,751.28   101.97%
> > >       4   5,874.72   6,575.55   111.93%
> > >
> > >    Tx: 4096
> > >       1   6,257.42   7,655.22   122.34%
> > >       4   5,370.78   6,044.83   112.55%
> > >
> > >    Tx: 16384
> > >       1   6,346.63   7,267.44   114.51%
> > >       4   5,198.02   5,657.12   108.83%
> > >
> > > TCP_MAERTS
> > >    Rx: 256
> > >       1   2,091.38   1,765.62   84.42%
> > >       4   5,319.52   5,619.49   105.64%
> > >
> > >    Rx: 1024
> > >       1   7,030.66   7,593.61   108.01%
> > >       4   9,040.53   7,275.84   80.48%
> > >
> > >    Rx: 4096
> > >       1   9,160.93   9,318.15   101.72%
> > >       4   9,372.49   8,875.63   94.70%
> > >
> > >    Rx: 16384
> > >       1   9,183.28   9,134.02   99.46%
> > >       4   9,377.17   8,877.52   94.67%
> > >
> > >                      106.46%
> >
> > One point missing here is the ratio of
> > throughput divided by host CPU utilization.
> >
> > The concern being to avoid hurting
> > heavily loaded systems.
> >
> > These numbers also tend to be less variable as
> > they depend less on host scheduler decisions.
> >
> >
> > > Cc: Tom Lendacky <toml@us.ibm.com>
> > > Cc: Cristian Viana <vianac@br.ibm.com>
> > > Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> > > ---
> > >  drivers/vhost/net.c   |    6 ++++-
> > >  drivers/vhost/vhost.c |   51 +++++++++++++++++++++++++++++
> > +------------------
> > >  drivers/vhost/vhost.h |    8 +++++-
> > >  3 files changed, 43 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 9dab1f5..47175cd 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -33,6 +33,10 @@ static int experimental_zcopytx;
> > >  module_param(experimental_zcopytx, int, 0444);
> > >  MODULE_PARM_DESC(experimental_zcopytx, "Enable Experimental Zero Copy
> TX");
> > >
> > > +static int workers = 2;
> > > +module_param(workers, int, 0444);
> > > +MODULE_PARM_DESC(workers, "Set the number of worker threads");
> > > +
> > >  /* Max number of bytes transferred before requeueing the job.
> > >   * Using this limit prevents one virtqueue from starving others. */
> > >  #define VHOST_NET_WEIGHT 0x80000
> > > @@ -504,7 +508,7 @@ static int vhost_net_open(struct inode *inode,
> > struct file *f)
> > >     dev = &n->dev;
> > >     n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
> > >     n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
> > > -   r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
> > > +   r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);
> > >     if (r < 0) {
> > >        kfree(n);
> > >        return r;
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index c14c42b..676cead 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -144,9 +144,12 @@ static inline void vhost_work_queue(struct
> > vhost_dev *dev,
> > >
> > >     spin_lock_irqsave(&dev->work_lock, flags);
> > >     if (list_empty(&work->node)) {
> > > +      int i;
> > > +
> > >        list_add_tail(&work->node, &dev->work_list);
> > >        work->queue_seq++;
> > > -      wake_up_process(dev->worker);
> > > +      for (i = 0; i < dev->nworkers; i++)
> > > +         wake_up_process(dev->workers[i]);
> > >     }
> > >     spin_unlock_irqrestore(&dev->work_lock, flags);
> > >  }
> > > @@ -287,7 +290,7 @@ static void vhost_dev_free_iovecs(struct vhost_dev
> *dev)
> > >  }
> > >
> > >  long vhost_dev_init(struct vhost_dev *dev,
> > > -          struct vhost_virtqueue *vqs, int nvqs)
> > > +          struct vhost_virtqueue *vqs, int nworkers, int nvqs)
> > >  {
> > >     int i;
> > >
> > > @@ -300,7 +303,9 @@ long vhost_dev_init(struct vhost_dev *dev,
> > >     dev->mm = NULL;
> > >     spin_lock_init(&dev->work_lock);
> > >     INIT_LIST_HEAD(&dev->work_list);
> > > -   dev->worker = NULL;
> > > +   dev->nworkers = min(nworkers, VHOST_MAX_WORKERS);
> > > +   for (i = 0; i < dev->nworkers; i++)
> > > +      dev->workers[i] = NULL;
> > >
> > >     for (i = 0; i < dev->nvqs; ++i) {
> > >        dev->vqs[i].log = NULL;
> > > @@ -354,7 +359,7 @@ static int vhost_attach_cgroups(struct vhost_dev
> *dev)
> > >  static long vhost_dev_set_owner(struct vhost_dev *dev)
> > >  {
> > >     struct task_struct *worker;
> > > -   int err;
> > > +   int err, i;
> > >
> > >     /* Is there an owner already? */
> > >     if (dev->mm) {
> > > @@ -364,28 +369,34 @@ static long vhost_dev_set_owner(struct vhost_dev
> *dev)
> > >
> > >     /* No owner, become one */
> > >     dev->mm = get_task_mm(current);
> > > -   worker = kthread_create(vhost_worker, dev, "vhost-%d", current->
> pid);
> > > -   if (IS_ERR(worker)) {
> > > -      err = PTR_ERR(worker);
> > > -      goto err_worker;
> > > -   }
> > > +   for (i = 0; i < dev->nworkers; i++) {
> > > +      worker = kthread_create(vhost_worker, dev, "vhost-%d.%d",
> > current->pid, i);
> > > +      if (IS_ERR(worker)) {
> > > +         err = PTR_ERR(worker);
> > > +         goto err_worker;
> > > +      }
> > >
> > > -   dev->worker = worker;
> > > -   wake_up_process(worker);   /* avoid contributing to loadavg */
> > > +      dev->workers[i] = worker;
> > > +      wake_up_process(worker);   /* avoid contributing to loadavg */
> > > +   }
> > >
> > >     err = vhost_attach_cgroups(dev);
> > >     if (err)
> > > -      goto err_cgroup;
> > > +      goto err_worker;
> > >
> > >     err = vhost_dev_alloc_iovecs(dev);
> > >     if (err)
> > > -      goto err_cgroup;
> > > +      goto err_worker;
> > >
> > >     return 0;
> > > -err_cgroup:
> > > -   kthread_stop(worker);
> > > -   dev->worker = NULL;
> > > +
> > >  err_worker:
> > > +   for (i = 0; i < dev->nworkers; i++) {
> > > +      if (dev->workers[i]) {
> > > +         kthread_stop(dev->workers[i]);
> > > +         dev->workers[i] = NULL;
> > > +      }
> > > +   }
> > >     if (dev->mm)
> > >        mmput(dev->mm);
> > >     dev->mm = NULL;
> > > @@ -475,9 +486,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > >                 lockdep_is_held(&dev->mutex)));
> > >     RCU_INIT_POINTER(dev->memory, NULL);
> > >     WARN_ON(!list_empty(&dev->work_list));
> > > -   if (dev->worker) {
> > > -      kthread_stop(dev->worker);
> > > -      dev->worker = NULL;
> > > +   for (i = 0; i < dev->nworkers; i++) {
> > > +      if (dev->workers[i]) {
> > > +         kthread_stop(dev->workers[i]);
> > > +         dev->workers[i] = NULL;
> > > +      }
> > >     }
> > >     if (dev->mm)
> > >        mmput(dev->mm);
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index a801e28..fa5e75e 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -18,6 +18,9 @@
> > >  #define VHOST_DMA_DONE_LEN   1
> > >  #define VHOST_DMA_CLEAR_LEN   0
> > >
> > > +/* Maximum number of worker threads per-vhost device */
> > > +#define VHOST_MAX_WORKERS   8
> > > +
> > >  struct vhost_device;
> > >
> > >  struct vhost_work;
> > > @@ -157,10 +160,11 @@ struct vhost_dev {
> > >     struct eventfd_ctx *log_ctx;
> > >     spinlock_t work_lock;
> > >     struct list_head work_list;
> > > -   struct task_struct *worker;
> > > +   int nworkers;
> > > +   struct task_struct *workers[VHOST_MAX_WORKERS];
> > >  };
> > >
> > > -long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue
> > *vqs, int nvqs);
> > > +long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue
> > *vqs, int nworkers, int nvqs);
> > >  long vhost_dev_check_owner(struct vhost_dev *);
> > >  long vhost_dev_reset_owner(struct vhost_dev *);
> > >  void vhost_dev_cleanup(struct vhost_dev *);
> > > --
> > > 1.7.4.1
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-20 19:27       ` Michael S. Tsirkin
@ 2012-02-20 19:46         ` Anthony Liguori
  2012-02-20 21:00           ` Michael S. Tsirkin
  2012-02-21  4:32           ` Jason Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Anthony Liguori @ 2012-02-20 19:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Tom Lendacky, netdev, Cristian Viana

On 02/20/2012 01:27 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 20, 2012 at 09:50:37AM -0600, Tom Lendacky wrote:
>> "Michael S. Tsirkin"<mst@redhat.com>  wrote on 02/19/2012 08:41:45 AM:
>>
>>> From: "Michael S. Tsirkin"<mst@redhat.com>
>>> To: Anthony Liguori/Austin/IBM@IBMUS
>>> Cc: netdev@vger.kernel.org, Tom Lendacky/Austin/IBM@IBMUS, Cristian
>>> Viana<vianac@br.ibm.com>
>>> Date: 02/19/2012 08:42 AM
>>> Subject: Re: [PATCH 1/2] vhost: allow multiple workers threads
>>>
>>> On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
>>>> This patch allows vhost to have multiple worker threads for devices
>> such as
>>>> virtio-net which may have multiple virtqueues.
>>>>
>>>> Since virtqueues are a lockless ring queue, in an ideal world data is
>> being
>>>> produced by the producer as fast as data is being consumed by the
>> consumer.
>>>> These loops will continue to consume data until none is left.
>>>>
>>>> vhost currently multiplexes the consumer side of the queue on a
>>> single thread
>>>> by attempting to read from the queue until everything is read or it
>> cannot
>>>> process anymore.  This means that activity on one queue may stall
>>> another queue.
>>>
>>> There's actually an attempt to address this: look up
>>> VHOST_NET_WEIGHT in the code. I take it, this isn't effective?
>>>
>>>> This is exacerbated when using any form of polling to read from
>>> the queues (as
>>>> we'll introduce in the next patch).  By spawning a thread per-
>>> virtqueue, this
>>>> is addressed.
>>>>
>>>> The only problem with this patch right now is how the wake up of
>>> the threads is
>>>> done.  It's essentially a broadcast and we have seen lock contention as
>> a
>>>> result.
>>>
>>> On which lock?
>>
>> The mutex lock in the vhost_virtqueue struct.  This really shows up when
>> running with patch 2/2 and increasing the spin_threshold. Both threads wake
>> up and try to acquire the mutex.  As the spin_threshold increases you end
>> up
>> with one of the threads getting blocked for a longer and longer time and
>> unable to do any RX processing that might be needed.
>>
>> Tom
>
> Weird, I had the impression each thread handles one vq.
> Isn't this the design?

Not the way the code is structured today.  There is a single consumer/producer 
work queue and either the vq notification or other actions may get placed on it.

It would be possible to do three threads, one for background tasks and then one 
for each queue with a more invasive refactoring.

But I assumed that the reason the code was structured this was originally was 
because you saw some value in having a single producer/consumer queue for 
everything...

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-20 19:46         ` Anthony Liguori
@ 2012-02-20 21:00           ` Michael S. Tsirkin
  2012-02-21  1:04             ` Shirley Ma
  2012-02-21  4:32           ` Jason Wang
  1 sibling, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2012-02-20 21:00 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Tom Lendacky, netdev, Cristian Viana

On Mon, Feb 20, 2012 at 01:46:03PM -0600, Anthony Liguori wrote:
> On 02/20/2012 01:27 PM, Michael S. Tsirkin wrote:
> >On Mon, Feb 20, 2012 at 09:50:37AM -0600, Tom Lendacky wrote:
> >>"Michael S. Tsirkin"<mst@redhat.com>  wrote on 02/19/2012 08:41:45 AM:
> >>
> >>>From: "Michael S. Tsirkin"<mst@redhat.com>
> >>>To: Anthony Liguori/Austin/IBM@IBMUS
> >>>Cc: netdev@vger.kernel.org, Tom Lendacky/Austin/IBM@IBMUS, Cristian
> >>>Viana<vianac@br.ibm.com>
> >>>Date: 02/19/2012 08:42 AM
> >>>Subject: Re: [PATCH 1/2] vhost: allow multiple workers threads
> >>>
> >>>On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
> >>>>This patch allows vhost to have multiple worker threads for devices
> >>such as
> >>>>virtio-net which may have multiple virtqueues.
> >>>>
> >>>>Since virtqueues are a lockless ring queue, in an ideal world data is
> >>being
> >>>>produced by the producer as fast as data is being consumed by the
> >>consumer.
> >>>>These loops will continue to consume data until none is left.
> >>>>
> >>>>vhost currently multiplexes the consumer side of the queue on a
> >>>single thread
> >>>>by attempting to read from the queue until everything is read or it
> >>cannot
> >>>>process anymore.  This means that activity on one queue may stall
> >>>another queue.
> >>>
> >>>There's actually an attempt to address this: look up
> >>>VHOST_NET_WEIGHT in the code. I take it, this isn't effective?
> >>>
> >>>>This is exacerbated when using any form of polling to read from
> >>>the queues (as
> >>>>we'll introduce in the next patch).  By spawning a thread per-
> >>>virtqueue, this
> >>>>is addressed.
> >>>>
> >>>>The only problem with this patch right now is how the wake up of
> >>>the threads is
> >>>>done.  It's essentially a broadcast and we have seen lock contention as
> >>a
> >>>>result.
> >>>
> >>>On which lock?
> >>
> >>The mutex lock in the vhost_virtqueue struct.  This really shows up when
> >>running with patch 2/2 and increasing the spin_threshold. Both threads wake
> >>up and try to acquire the mutex.  As the spin_threshold increases you end
> >>up
> >>with one of the threads getting blocked for a longer and longer time and
> >>unable to do any RX processing that might be needed.
> >>
> >>Tom
> >
> >Weird, I had the impression each thread handles one vq.
> >Isn't this the design?
> 
> Not the way the code is structured today.  There is a single
> consumer/producer work queue and either the vq notification or other
> actions may get placed on it.

And then a random thread picks it up?
wont this cause packet reordering?
I'll go reread.

> It would be possible to do three threads, one for background tasks
> and then one for each queue with a more invasive refactoring.
> 
> But I assumed that the reason the code was structured this was
> originally was because you saw some value in having a single
> producer/consumer queue for everything...
> 
> Regards,
> 
> Anthony Liguori

The point was really to avoid scheduler overhead
as with tcp, tx and rx tend to run on the same cpu.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-20 21:00           ` Michael S. Tsirkin
@ 2012-02-21  1:04             ` Shirley Ma
  2012-02-21  3:21               ` Michael S. Tsirkin
  0 siblings, 1 reply; 23+ messages in thread
From: Shirley Ma @ 2012-02-21  1:04 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, Tom Lendacky, netdev, Cristian Viana

On Mon, 2012-02-20 at 23:00 +0200, Michael S. Tsirkin wrote:
> 
> The point was really to avoid scheduler overhead
> as with tcp, tx and rx tend to run on the same cpu.

We have tried different approaches in the past, like splitting vhost
thread to separate TX, RX threads; create per cpu vhost thread instead
of creating per VM per virtio_net vhost thread... 

We think per cpu vhost thread is a better approach based on the data we
have collected. It will reduce both vhost resource and scheduler
overhead. It will not depend on host scheduler, has less various. The
patch is under testing, we hope we can post it soon.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-19 14:51   ` Michael S. Tsirkin
@ 2012-02-21  1:35     ` Shirley Ma
  2012-02-21  5:34       ` Jason Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Shirley Ma @ 2012-02-21  1:35 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, netdev, Tom Lendacky, Cristian Viana

We tried similar approach before by using a minimum timer for handle_tx
to stay in the loop to accumulate more packets before enabling the guest
notification. It did have better TCP_RRs, UDP_RRs results. However, we
think this is just a debug patch. We really need to understand why
handle_tx can't see more packets to process for multiple instances
request/response type of workload first. Spinning in this loop is not a
good solution.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-21  1:04             ` Shirley Ma
@ 2012-02-21  3:21               ` Michael S. Tsirkin
  2012-02-21  4:03                 ` Shirley Ma
  0 siblings, 1 reply; 23+ messages in thread
From: Michael S. Tsirkin @ 2012-02-21  3:21 UTC (permalink / raw)
  To: Shirley Ma; +Cc: Anthony Liguori, Tom Lendacky, netdev, Cristian Viana

On Mon, Feb 20, 2012 at 05:04:10PM -0800, Shirley Ma wrote:
> On Mon, 2012-02-20 at 23:00 +0200, Michael S. Tsirkin wrote:
> > 
> > The point was really to avoid scheduler overhead
> > as with tcp, tx and rx tend to run on the same cpu.
> 
> We have tried different approaches in the past, like splitting vhost
> thread to separate TX, RX threads; create per cpu vhost thread instead
> of creating per VM per virtio_net vhost thread... 
> 
> We think per cpu vhost thread is a better approach based on the data we
> have collected. It will reduce both vhost resource and scheduler
> overhead. It will not depend on host scheduler, has less various. The
> patch is under testing, we hope we can post it soon.
> 
> Thanks
> Shirley

Yes, great, this is definitely interesting. I actually started with
a per-cpu one - it did not perform well but I did not
figure out why, switching to a single thread fixed it
and I did not dig into it.


-- 
MST

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-21  3:21               ` Michael S. Tsirkin
@ 2012-02-21  4:03                 ` Shirley Ma
  2012-03-05 13:21                   ` Anthony Liguori
  0 siblings, 1 reply; 23+ messages in thread
From: Shirley Ma @ 2012-02-21  4:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, Tom Lendacky, netdev, Cristian Viana

On Tue, 2012-02-21 at 05:21 +0200, Michael S. Tsirkin wrote:
> On Mon, Feb 20, 2012 at 05:04:10PM -0800, Shirley Ma wrote:
> > On Mon, 2012-02-20 at 23:00 +0200, Michael S. Tsirkin wrote:
> > > 
> > > The point was really to avoid scheduler overhead
> > > as with tcp, tx and rx tend to run on the same cpu.
> > 
> > We have tried different approaches in the past, like splitting vhost
> > thread to separate TX, RX threads; create per cpu vhost thread
> instead
> > of creating per VM per virtio_net vhost thread... 
> > 
> > We think per cpu vhost thread is a better approach based on the data
> we
> > have collected. It will reduce both vhost resource and scheduler
> > overhead. It will not depend on host scheduler, has less various.
> The
> > patch is under testing, we hope we can post it soon.
> > 
> > Thanks
> > Shirley
> 
> Yes, great, this is definitely interesting. I actually started with
> a per-cpu one - it did not perform well but I did not
> figure out why, switching to a single thread fixed it
> and I did not dig into it.

The patch includes per cpu vhost thread & vhost NUMA aware scheduling

It is very interesting. We are collecting performance data with
different workloads (streams, request/response) related to which VCPU
runs on which CPU, which vhost cpu thread is being scheduled, and which
NIC TX/RX queues is being used. The performance were different when
using different vhost scheduling approach for both TX/RX worker. The
results seems pretty good: like 60 UDP_RRs, the results event more than
doubled in our lab. However the TCP_RRs results couldn't catch up
UDP_RRs.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-20 19:46         ` Anthony Liguori
  2012-02-20 21:00           ` Michael S. Tsirkin
@ 2012-02-21  4:32           ` Jason Wang
  1 sibling, 0 replies; 23+ messages in thread
From: Jason Wang @ 2012-02-21  4:32 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, Tom Lendacky, netdev, Cristian Viana

On 02/21/2012 03:46 AM, Anthony Liguori wrote:
> On 02/20/2012 01:27 PM, Michael S. Tsirkin wrote:
>> On Mon, Feb 20, 2012 at 09:50:37AM -0600, Tom Lendacky wrote:
>>> "Michael S. Tsirkin"<mst@redhat.com>  wrote on 02/19/2012 08:41:45 AM:
>>>
>>>> From: "Michael S. Tsirkin"<mst@redhat.com>
>>>> To: Anthony Liguori/Austin/IBM@IBMUS
>>>> Cc: netdev@vger.kernel.org, Tom Lendacky/Austin/IBM@IBMUS, Cristian
>>>> Viana<vianac@br.ibm.com>
>>>> Date: 02/19/2012 08:42 AM
>>>> Subject: Re: [PATCH 1/2] vhost: allow multiple workers threads
>>>>
>>>> On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
>>>>> This patch allows vhost to have multiple worker threads for devices
>>> such as
>>>>> virtio-net which may have multiple virtqueues.
>>>>>
>>>>> Since virtqueues are a lockless ring queue, in an ideal world data is
>>> being
>>>>> produced by the producer as fast as data is being consumed by the
>>> consumer.
>>>>> These loops will continue to consume data until none is left.
>>>>>
>>>>> vhost currently multiplexes the consumer side of the queue on a
>>>> single thread
>>>>> by attempting to read from the queue until everything is read or it
>>> cannot
>>>>> process anymore.  This means that activity on one queue may stall
>>>> another queue.
>>>>
>>>> There's actually an attempt to address this: look up
>>>> VHOST_NET_WEIGHT in the code. I take it, this isn't effective?
>>>>
>>>>> This is exacerbated when using any form of polling to read from
>>>> the queues (as
>>>>> we'll introduce in the next patch).  By spawning a thread per-
>>>> virtqueue, this
>>>>> is addressed.
>>>>>
>>>>> The only problem with this patch right now is how the wake up of
>>>> the threads is
>>>>> done.  It's essentially a broadcast and we have seen lock 
>>>>> contention as
>>> a
>>>>> result.
>>>>
>>>> On which lock?
>>>
>>> The mutex lock in the vhost_virtqueue struct.  This really shows up 
>>> when
>>> running with patch 2/2 and increasing the spin_threshold. Both 
>>> threads wake
>>> up and try to acquire the mutex.  As the spin_threshold increases 
>>> you end
>>> up
>>> with one of the threads getting blocked for a longer and longer time 
>>> and
>>> unable to do any RX processing that might be needed.
>>>
>>> Tom
>>
>> Weird, I had the impression each thread handles one vq.
>> Isn't this the design?
>
> Not the way the code is structured today.  There is a single 
> consumer/producer work queue and either the vq notification or other 
> actions may get placed on it.
>
> It would be possible to do three threads, one for background tasks and 
> then one for each queue with a more invasive refactoring.
>
> But I assumed that the reason the code was structured this was 
> originally was because you saw some value in having a single 
> producer/consumer queue for everything...
>
> Regards,
>
> Anthony Liguori

Not sure I'm reading the code correctly, looks like with this series, 
two worker threads can try to handle the work of a same virtqueue? Looks 
strange as the notification from other side (guest/net) should be 
disabled even if vhost thread is spinning.
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-19 14:41   ` Michael S. Tsirkin
  2012-02-20 15:50     ` Tom Lendacky
@ 2012-02-21  4:51     ` Jason Wang
  1 sibling, 0 replies; 23+ messages in thread
From: Jason Wang @ 2012-02-21  4:51 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Anthony Liguori, netdev, Tom Lendacky, Cristian Viana

On 02/19/2012 10:41 PM, Michael S. Tsirkin wrote:
> On Fri, Feb 17, 2012 at 05:02:05PM -0600, Anthony Liguori wrote:
>> >  This patch allows vhost to have multiple worker threads for devices such as
>> >  virtio-net which may have multiple virtqueues.
>> >  
>> >  Since virtqueues are a lockless ring queue, in an ideal world data is being
>> >  produced by the producer as fast as data is being consumed by the consumer.
>> >  These loops will continue to consume data until none is left.
>> >  
>> >  vhost currently multiplexes the consumer side of the queue on a single thread
>> >  by attempting to read from the queue until everything is read or it cannot
>> >  process anymore.  This means that activity on one queue may stall another queue.
> There's actually an attempt to address this: look up
> VHOST_NET_WEIGHT in the code. I take it, this isn't effective?
>
>> >  This is exacerbated when using any form of polling to read from the queues (as
>> >  we'll introduce in the next patch).  By spawning a thread per-virtqueue, this
>> >  is addressed.
>> >  
>> >  The only problem with this patch right now is how the wake up of the threads is
>> >  done.  It's essentially a broadcast and we have seen lock contention as a
>> >  result.
> On which lock?
>
>> >  We've tried some approaches to signal a single thread but I'm not
>> >  confident that that code is correct yet so I'm only sending the broadcast
>> >  version.
> Yes, that looks like an obvious question.
>
>> >  Here are some performance results from this change.  There's a modest
>> >  improvement with stream although a fair bit of variability too.
>> >  
>> >  With RR, there's pretty significant improvements as the instance rate drives up.
> Interesting. This was actually tested at one time and we saw
> a significant performance improvement from using
> a single thread especially with a single
> stream in the guest. Profiling indicated that
> with a single thread we get too many context
> switches between TX and RX, since guest networking
> tends to run TX and RX processing on the same
> guest VCPU.
>
> Maybe we were wrong or maybe this went away
> for some reason. I'll see if this can be reproduced.
>

I've tried a similar test in Jan. The test uses one dedicated vhost 
thread to handle tx requests and another one for rx. Test result shows 
much degradation as the both of the #exits and #irq are increased. There 
are some differences as I test between local host and guest, and the 
guest does not have very recent virtio changes ( unlocked kick and 
exposing index immediately ). I would try the recent kernel.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-21  1:35     ` Shirley Ma
@ 2012-02-21  5:34       ` Jason Wang
  2012-02-21  6:28         ` Shirley Ma
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Wang @ 2012-02-21  5:34 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Anthony Liguori, netdev, Tom Lendacky,
	Cristian Viana

On 02/21/2012 09:35 AM, Shirley Ma wrote:
> We tried similar approach before by using a minimum timer for handle_tx
> to stay in the loop to accumulate more packets before enabling the guest
> notification. It did have better TCP_RRs, UDP_RRs results. However, we
> think this is just a debug patch. We really need to understand why
> handle_tx can't see more packets to process for multiple instances
> request/response type of workload first. Spinning in this loop is not a
> good solution.

Spinning help for the latency, but looks like we need some adaptive 
method to adjust the threshold dynamically such as monitor the minimum 
time gap between two packets. For throughput, if we can improve the 
batching of small packets we can improve it. I've tired to use event 
index to delay the tx kick until a specified number of packets were 
batched in the virtqueue. Test shows improvement of throughput in small 
packets as the number of #exit were reduced greatly ( the packets/#exit 
and cpu utilization were increased), but it damages the performance of 
other. This is only for debug, but it confirms that there's something we 
need to improve the batching.

>
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-21  5:34       ` Jason Wang
@ 2012-02-21  6:28         ` Shirley Ma
  2012-02-21  6:38           ` Jason Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Shirley Ma @ 2012-02-21  6:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Anthony Liguori, netdev, Tom Lendacky,
	Cristian Viana

On Tue, 2012-02-21 at 13:34 +0800, Jason Wang wrote:
> On 02/21/2012 09:35 AM, Shirley Ma wrote:
> > We tried similar approach before by using a minimum timer for
> handle_tx
> > to stay in the loop to accumulate more packets before enabling the
> guest
> > notification. It did have better TCP_RRs, UDP_RRs results. However,
> we
> > think this is just a debug patch. We really need to understand why
> > handle_tx can't see more packets to process for multiple instances
> > request/response type of workload first. Spinning in this loop is
> not a
> > good solution.
> 
> Spinning help for the latency, but looks like we need some adaptive 
> method to adjust the threshold dynamically such as monitor the
> minimum 
> time gap between two packets. For throughput, if we can improve the 
> batching of small packets we can improve it. I've tired to use event 
> index to delay the tx kick until a specified number of packets were 
> batched in the virtqueue. Test shows improvement of throughput in
> small 
> packets as the number of #exit were reduced greatly ( the
> packets/#exit 
> and cpu utilization were increased), but it damages the performance
> of 
> other. This is only for debug, but it confirms that there's something
> we 
> need to improve the batching.

Our test case was 60 instances 256/256 bytes tcp_rrs or udp_rrs. In
theory there should be multiple packets in the queue by the time vhost
gets notified, but from debugging output, there was only a few or even
one packet in the queue. So the questions here why the time gap between
two packets is that big?

Shirley

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-21  6:28         ` Shirley Ma
@ 2012-02-21  6:38           ` Jason Wang
  2012-02-21 11:09             ` Shirley Ma
  2012-02-21 16:08             ` Sridhar Samudrala
  0 siblings, 2 replies; 23+ messages in thread
From: Jason Wang @ 2012-02-21  6:38 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Anthony Liguori, netdev, Tom Lendacky,
	Cristian Viana

On 02/21/2012 02:28 PM, Shirley Ma wrote:
> On Tue, 2012-02-21 at 13:34 +0800, Jason Wang wrote:
>> On 02/21/2012 09:35 AM, Shirley Ma wrote:
>>> We tried similar approach before by using a minimum timer for
>> handle_tx
>>> to stay in the loop to accumulate more packets before enabling the
>> guest
>>> notification. It did have better TCP_RRs, UDP_RRs results. However,
>> we
>>> think this is just a debug patch. We really need to understand why
>>> handle_tx can't see more packets to process for multiple instances
>>> request/response type of workload first. Spinning in this loop is
>> not a
>>> good solution.
>> Spinning help for the latency, but looks like we need some adaptive
>> method to adjust the threshold dynamically such as monitor the
>> minimum
>> time gap between two packets. For throughput, if we can improve the
>> batching of small packets we can improve it. I've tired to use event
>> index to delay the tx kick until a specified number of packets were
>> batched in the virtqueue. Test shows improvement of throughput in
>> small
>> packets as the number of #exit were reduced greatly ( the
>> packets/#exit
>> and cpu utilization were increased), but it damages the performance
>> of
>> other. This is only for debug, but it confirms that there's something
>> we
>> need to improve the batching.
> Our test case was 60 instances 256/256 bytes tcp_rrs or udp_rrs. In
> theory there should be multiple packets in the queue by the time vhost
> gets notified, but from debugging output, there was only a few or even
> one packet in the queue. So the questions here why the time gap between
> two packets is that big?
>
> Shirley

Not sure whether it's related but did you try to disable the nagle 
algorithm during the test?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-21  6:38           ` Jason Wang
@ 2012-02-21 11:09             ` Shirley Ma
  2012-02-21 16:08             ` Sridhar Samudrala
  1 sibling, 0 replies; 23+ messages in thread
From: Shirley Ma @ 2012-02-21 11:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Anthony Liguori, netdev, Tom Lendacky,
	Cristian Viana

On Tue, 2012-02-21 at 14:38 +0800, Jason Wang wrote:
> Not sure whether it's related but did you try to disable the nagle 
> algorithm during the test?

UDP doesn't use nagle, TCP_RRs results had no difference disable/enable
nagle algorithm.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-21  6:38           ` Jason Wang
  2012-02-21 11:09             ` Shirley Ma
@ 2012-02-21 16:08             ` Sridhar Samudrala
  1 sibling, 0 replies; 23+ messages in thread
From: Sridhar Samudrala @ 2012-02-21 16:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Shirley Ma, Michael S. Tsirkin, Anthony Liguori, netdev,
	Tom Lendacky, Cristian Viana

On Tue, 2012-02-21 at 14:38 +0800, Jason Wang wrote:
> On 02/21/2012 02:28 PM, Shirley Ma wrote:
> > On Tue, 2012-02-21 at 13:34 +0800, Jason Wang wrote:
> >> On 02/21/2012 09:35 AM, Shirley Ma wrote:
> >>> We tried similar approach before by using a minimum timer for
> >> handle_tx
> >>> to stay in the loop to accumulate more packets before enabling the
> >> guest
> >>> notification. It did have better TCP_RRs, UDP_RRs results. However,
> >> we
> >>> think this is just a debug patch. We really need to understand why
> >>> handle_tx can't see more packets to process for multiple instances
> >>> request/response type of workload first. Spinning in this loop is
> >> not a
> >>> good solution.
> >> Spinning help for the latency, but looks like we need some adaptive
> >> method to adjust the threshold dynamically such as monitor the
> >> minimum
> >> time gap between two packets. For throughput, if we can improve the
> >> batching of small packets we can improve it. I've tired to use event
> >> index to delay the tx kick until a specified number of packets were
> >> batched in the virtqueue. Test shows improvement of throughput in
> >> small
> >> packets as the number of #exit were reduced greatly ( the
> >> packets/#exit
> >> and cpu utilization were increased), but it damages the performance
> >> of
> >> other. This is only for debug, but it confirms that there's something
> >> we
> >> need to improve the batching.
> > Our test case was 60 instances 256/256 bytes tcp_rrs or udp_rrs. In
> > theory there should be multiple packets in the queue by the time vhost
> > gets notified, but from debugging output, there was only a few or even
> > one packet in the queue. So the questions here why the time gap between
> > two packets is that big?
> >
> > Shirley
> 
> Not sure whether it's related but did you try to disable the nagle 
> algorithm during the test?

This is a 60-instance 256 byte request-response test. So each request
for each instance is independent and is sub-mtu size and the next
request is not sent until the response is received. So Nagle doesn't
delay any packets in this workload.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-02-21  4:03                 ` Shirley Ma
@ 2012-03-05 13:21                   ` Anthony Liguori
  2012-03-05 20:43                     ` Shirley Ma
  0 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2012-03-05 13:21 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Anthony Liguori, Tom Lendacky, netdev,
	Cristian Viana

On 02/20/2012 10:03 PM, Shirley Ma wrote:
> On Tue, 2012-02-21 at 05:21 +0200, Michael S. Tsirkin wrote:
>> On Mon, Feb 20, 2012 at 05:04:10PM -0800, Shirley Ma wrote:
>>> On Mon, 2012-02-20 at 23:00 +0200, Michael S. Tsirkin wrote:
>>>>
>>>> The point was really to avoid scheduler overhead
>>>> as with tcp, tx and rx tend to run on the same cpu.
>>>
>>> We have tried different approaches in the past, like splitting vhost
>>> thread to separate TX, RX threads; create per cpu vhost thread
>> instead
>>> of creating per VM per virtio_net vhost thread...
>>>
>>> We think per cpu vhost thread is a better approach based on the data
>> we
>>> have collected. It will reduce both vhost resource and scheduler
>>> overhead. It will not depend on host scheduler, has less various.
>> The
>>> patch is under testing, we hope we can post it soon.
>>>
>>> Thanks
>>> Shirley
>>
>> Yes, great, this is definitely interesting. I actually started with
>> a per-cpu one - it did not perform well but I did not
>> figure out why, switching to a single thread fixed it
>> and I did not dig into it.
>
> The patch includes per cpu vhost thread&  vhost NUMA aware scheduling

Hi Shirley,

Are you planning on posting these patches soon?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] vhost: allow multiple workers threads
  2012-03-05 13:21                   ` Anthony Liguori
@ 2012-03-05 20:43                     ` Shirley Ma
  0 siblings, 0 replies; 23+ messages in thread
From: Shirley Ma @ 2012-03-05 20:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Michael S. Tsirkin, Anthony Liguori, Tom Lendacky, netdev,
	Cristian Viana

On Mon, 2012-03-05 at 07:21 -0600, Anthony Liguori wrote:
> Hi Shirley,
> 
> Are you planning on posting these patches soon?
> 
> Regards,
> 
> Anthony Liguori

I thought to compare with different scheduling with different NICs
before submitting it. Maybe it's better to submit a RFC patch to get
review comments now. Let me clean up the patch and submit it here.

Thanks
Shirley

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/2] vhost-net: add a spin_threshold parameter
  2012-02-17 23:02 ` [PATCH 2/2] vhost-net: add a spin_threshold parameter Anthony Liguori
  2012-02-19 14:51   ` Michael S. Tsirkin
@ 2012-03-12  8:12   ` Dor Laor
  1 sibling, 0 replies; 23+ messages in thread
From: Dor Laor @ 2012-03-12  8:12 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: netdev, Michael Tsirkin, Tom Lendacky, Cristian Viana, Jason Wang

On 02/18/2012 01:02 AM, Anthony Liguori wrote:
> With workloads that are dominated by very high rates of small packets, we see
> considerable overhead in virtio notifications.
>
> The best strategy we've been able to come up with to deal with this is adaptive
> polling.  This patch simply adds the infrastructure needed to experiment with
> polling strategies.  It is not meant for inclusion.
>
> Here are the results with various polling values.  The spinning is not currently
> a net win due to the high mutex contention caused by the broadcast wakeup.  With
> a patch attempting to signal wakeup, we see up to 170+ transactions per second
> with TCP_RR 60 instance.

Do you really consider to add busy loop to the code? There would 
probably be another way to solve it.

One of the options we talked about it to add a ethtool-controllable knob 
in the guest to set the min/max time for wakeup.

>
> N  Baseline	Spin 0		Spin 1000	Spin 5000
>
> TCP_RR
>
> 1  9,639.66	10,164.06	9,825.43	9,827.45	101.95%
> 10 62,819.55	54,059.78	63,114.30	60,767.23	96.73%
> 30 84,715.60	131,241.86	120,922.38	89,776.39	105.97%
> 60 124,614.71	148,720.66	158,678.08	141,400.05	113.47%
>
> UDP_RR
>
> 1  9,652.50	10,343.72	9,493.95	9,569.54	99.14%
> 10 53,830.26	58,235.90	50,145.29	48,820.53	90.69%
> 30 89,471.01	97,634.53	95,108.34	91,263.65	102.00%
> 60 103,640.59	164,035.01	157,002.22	128,646.73	124.13%
>
> TCP_STREAM
> 1  2,622.63	2,610.71	2,688.49	2,678.61	102.13%
> 4  4,928.02	4,812.05	4,971.00	5,104.57	103.58%
>
> 1  5,639.89	5,751.28	5,819.81	5,593.62	99.18%
> 4  5,874.72	6,575.55	6,324.87	6,502.33	110.68%
>
> 1  6,257.42	7,655.22	7,610.52	7,424.74	118.65%
> 4  5,370.78	6,044.83	5,784.23	6,209.93	115.62%
>
> 1  6,346.63	7,267.44	7,567.39	7,677.93	120.98%
> 4  5,198.02	5,657.12	5,528.94	5,792.42	111.44%
>
> TCP_MAERTS
>
> 1  2,091.38	1,765.62	2,142.56	2,312.94	110.59%
> 4  5,319.52	5,619.49	5,544.50	5,645.81	106.13%
>
> 1  7,030.66	7,593.61	7,575.67	7,622.07	108.41%
> 4  9,040.53	7,275.84	7,322.07	6,681.34	73.90%
>
> 1  9,160.93	9,318.15	9,065.82	8,586.82	93.73%
> 4  9,372.49	8,875.63	8,959.03	9,056.07	96.62%
>
> 1  9,183.28	9,134.02	8,945.12	8,657.72	94.28%
> 4  9,377.17	8,877.52	8,959.54	9,071.53	96.74%
>
> Cc: Tom Lendacky<toml@us.ibm.com>
> Cc: Cristian Viana<vianac@br.ibm.com>
> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
> ---
>   drivers/vhost/net.c |   14 ++++++++++++++
>   1 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 47175cd..e9e5866 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -37,6 +37,10 @@ static int workers = 2;
>   module_param(workers, int, 0444);
>   MODULE_PARM_DESC(workers, "Set the number of worker threads");
>
> +static ulong spin_threshold = 0;
> +module_param(spin_threshold, ulong, 0444);
> +MODULE_PARM_DESC(spin_threshold, "The polling threshold for the tx queue");
> +
>   /* Max number of bytes transferred before requeueing the job.
>    * Using this limit prevents one virtqueue from starving others. */
>   #define VHOST_NET_WEIGHT 0x80000
> @@ -65,6 +69,7 @@ struct vhost_net {
>   	 * We only do this when socket buffer fills up.
>   	 * Protected by tx vq lock. */
>   	enum vhost_net_poll_state tx_poll_state;
> +	size_t spin_threshold;
>   };
>
>   static bool vhost_sock_zcopy(struct socket *sock)
> @@ -149,6 +154,7 @@ static void handle_tx(struct vhost_net *net)
>   	size_t hdr_size;
>   	struct socket *sock;
>   	struct vhost_ubuf_ref *uninitialized_var(ubufs);
> +	size_t spin_count;
>   	bool zcopy;
>
>   	/* TODO: check that we are running from vhost_worker? */
> @@ -172,6 +178,7 @@ static void handle_tx(struct vhost_net *net)
>   	hdr_size = vq->vhost_hlen;
>   	zcopy = vhost_sock_zcopy(sock);
>
> +	spin_count = 0;
>   	for (;;) {
>   		/* Release DMAs done buffers first */
>   		if (zcopy)
> @@ -205,9 +212,15 @@ static void handle_tx(struct vhost_net *net)
>   				set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
>   				break;
>   			}
> +			if (spin_count<  net->spin_threshold) {
> +				spin_count++;
> +				continue;
> +			}
>   			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>   				vhost_disable_notify(&net->dev, vq);
>   				continue;
> +			} else {
> +				spin_count = 0;
>   			}
>   			break;
>   		}
> @@ -506,6 +519,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>   		return -ENOMEM;
>
>   	dev =&n->dev;
> +	n->spin_threshold = spin_threshold;
>   	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
>   	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
>   	r = vhost_dev_init(dev, n->vqs, workers, VHOST_NET_VQ_MAX);

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-03-12  8:12 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-17 23:02 [PATCH 0/2][RFC] vhost: improve transmit rate with virtqueue polling Anthony Liguori
2012-02-17 23:02 ` [PATCH 1/2] vhost: allow multiple workers threads Anthony Liguori
2012-02-19 14:41   ` Michael S. Tsirkin
2012-02-20 15:50     ` Tom Lendacky
2012-02-20 19:27       ` Michael S. Tsirkin
2012-02-20 19:46         ` Anthony Liguori
2012-02-20 21:00           ` Michael S. Tsirkin
2012-02-21  1:04             ` Shirley Ma
2012-02-21  3:21               ` Michael S. Tsirkin
2012-02-21  4:03                 ` Shirley Ma
2012-03-05 13:21                   ` Anthony Liguori
2012-03-05 20:43                     ` Shirley Ma
2012-02-21  4:32           ` Jason Wang
2012-02-21  4:51     ` Jason Wang
2012-02-17 23:02 ` [PATCH 2/2] vhost-net: add a spin_threshold parameter Anthony Liguori
2012-02-19 14:51   ` Michael S. Tsirkin
2012-02-21  1:35     ` Shirley Ma
2012-02-21  5:34       ` Jason Wang
2012-02-21  6:28         ` Shirley Ma
2012-02-21  6:38           ` Jason Wang
2012-02-21 11:09             ` Shirley Ma
2012-02-21 16:08             ` Sridhar Samudrala
2012-03-12  8:12   ` Dor Laor

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.