All of lore.kernel.org
 help / color / mirror / Atom feed
* [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
@ 2010-10-20  8:54 Krishna Kumar
  2010-10-20  8:54 ` [v3 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Krishna Kumar @ 2010-10-20  8:54 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: arnd, eric.dumazet, netdev, avi, anthony, kvm, Krishna Kumar

Following set of patches implement transmit MQ in virtio-net.  Also
included is the user qemu changes.  MQ is disabled by default unless
qemu specifies it.

                  Changes from rev2:
                  ------------------
1. Define (in virtio_net.h) the maximum send txqs; and use in
   virtio-net and vhost-net.
2. vi->sq[i] is allocated individually, resulting in cache line
   aligned sq[0] to sq[n].  Another option was to define
   'send_queue' as:
       struct send_queue {
               struct virtqueue *svq;
               struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
       } ____cacheline_aligned_in_smp;
   and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
   the submitted method is preferable.
3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
   handles TX[0-n].
4. Further change TX handling such that vhost[0] handles both RX/TX
   for single stream case.

                  Enabling MQ on virtio:
                  -----------------------
When following options are passed to qemu:
        - smp > 1
        - vhost=on
        - mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using an
optional 'numtxqs' option.  e.g. for a smp=4 guest:
        vhost=on                   ->   #txqueues = 1
        vhost=on,mq=on             ->   #txqueues = 4
        vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
        vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8


                   Performance (guest -> local host):
                   -----------------------------------
System configuration:
        Host:  8 Intel Xeon, 8 GB memory
        Guest: 4 cpus, 2 GB memory
Test: Each test case runs for 60 secs, sum over three runs (except
when number of netperf sessions is 1, which has 10 runs of 12 secs
each).  No tuning (default netperf) other than taskset vhost's to
cpus 0-3.  numtxqs=32 gave the best results though the guest had
only 4 vcpus (I haven't tried beyond that).

______________ numtxqs=2, vhosts=3  ____________________
#sessions  BW%      CPU%    RCPU%    SD%      RSD%
________________________________________________________
1          4.46    -1.96     .19     -12.50   -6.06
2          4.93    -1.16    2.10      0       -2.38
4          46.17    64.77   33.72     19.51   -2.48
8          47.89    70.00   36.23     41.46    13.35
16         48.97    80.44   40.67     21.11   -5.46
24         49.03    78.78   41.22     20.51   -4.78
32         51.11    77.15   42.42     15.81   -6.87
40         51.60    71.65   42.43     9.75    -8.94
48         50.10    69.55   42.85     11.80   -5.81
64         46.24    68.42   42.67     14.18   -3.28
80         46.37    63.13   41.62     7.43    -6.73
96         46.40    63.31   42.20     9.36    -4.78
128        50.43    62.79   42.16     13.11   -1.23
________________________________________________________
BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%

______________ numtxqs=8, vhosts=5  ____________________
#sessions   BW%      CPU%     RCPU%     SD%      RSD%
________________________________________________________
1           -.76    -1.56     2.33      0        3.03
2           17.41    11.11    11.41     0       -4.76
4           42.12    55.11    30.20     19.51    .62
8           54.69    80.00    39.22     24.39    -3.88
16          54.77    81.62    40.89     20.34    -6.58
24          54.66    79.68    41.57     15.49    -8.99
32          54.92    76.82    41.79     17.59    -5.70
40          51.79    68.56    40.53     15.31    -3.87
48          51.72    66.40    40.84     9.72     -7.13
64          51.11    63.94    41.10     5.93     -8.82
80          46.51    59.50    39.80     9.33     -4.18
96          47.72    57.75    39.84     4.20     -7.62
128         54.35    58.95    40.66     3.24     -8.63
________________________________________________________
BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%

______________ numtxqs=16, vhosts=5  ___________________
#sessions   BW%      CPU%     RCPU%     SD%      RSD%
________________________________________________________
1           -1.43    -3.52    1.55      0          3.03
2           33.09     21.63   20.12    -10.00     -9.52
4           67.17     94.60   44.28     19.51     -11.80
8           75.72     108.14  49.15     25.00     -10.71
16          80.34     101.77  52.94     25.93     -4.49
24          70.84     93.12   43.62     27.63     -5.03
32          69.01     94.16   47.33     29.68     -1.51
40          58.56     63.47   25.91    -3.92      -25.85
48          61.16     74.70   34.88     .89       -22.08
64          54.37     69.09   26.80    -6.68      -30.04
80          36.22     22.73   -2.97    -8.25      -27.23
96          41.51     50.59   13.24     9.84      -16.77
128         48.98     38.15   6.41     -.33       -22.80
________________________________________________________
BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%

______________ numtxqs=32, vhosts=5  ___________________
#            BW%       CPU%    RCPU%    SD%     RSD%
________________________________________________________
1            7.62     -38.03   -26.26  -50.00   -33.33
2            28.95     20.46    21.62   0       -7.14
4            84.05     60.79    45.74  -2.43    -12.42
8            86.43     79.57    50.32   15.85   -3.10
16           88.63     99.48    58.17   9.47    -13.10
24           74.65     80.87    41.99  -1.81    -22.89
32           63.86     59.21    23.58  -18.13   -36.37
40           64.79     60.53    22.23  -15.77   -35.84
48           49.68     26.93    .51    -36.40   -49.61
64           54.69     36.50    5.41   -26.59   -43.23
80           45.06     12.72   -13.25  -37.79   -52.08
96           40.21    -3.16    -24.53  -39.92   -52.97
128          36.33    -33.19   -43.66  -5.68    -20.49
________________________________________________________
BW: 49.3%,  CPU/RCPU: 15.5%,-8.2%,  SD/RSD: -22.2%,-37.0%


Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [v3 RFC PATCH 1/4] Change virtqueue structure
  2010-10-20  8:54 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
@ 2010-10-20  8:54 ` Krishna Kumar
  2010-10-20  8:55 ` [v3 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Krishna Kumar @ 2010-10-20  8:54 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: eric.dumazet, kvm, netdev, arnd, avi, anthony, Krishna Kumar

Move queue_index from virtio_pci_vq_info to virtqueue.  This
allows callback handlers to figure out the queue number for
the vq that needs attention.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>  
---
 drivers/virtio/virtio_pci.c |   10 +++-------
 include/linux/virtio.h      |    1 +
 2 files changed, 4 insertions(+), 7 deletions(-)

diff -ruNp org/include/linux/virtio.h new.dynamic.optimize_vhost/include/linux/virtio.h
--- org/include/linux/virtio.h	2010-10-11 10:20:22.000000000 +0530
+++ new.dynamic.optimize_vhost/include/linux/virtio.h	2010-10-15 13:25:42.000000000 +0530
@@ -22,6 +22,7 @@ struct virtqueue {
 	void (*callback)(struct virtqueue *vq);
 	const char *name;
 	struct virtio_device *vdev;
+	int queue_index;	/* the index of the queue */
 	void *priv;
 };
 
diff -ruNp org/drivers/virtio/virtio_pci.c new.dynamic.optimize_vhost/drivers/virtio/virtio_pci.c
--- org/drivers/virtio/virtio_pci.c	2010-10-11 10:20:15.000000000 +0530
+++ new.dynamic.optimize_vhost/drivers/virtio/virtio_pci.c	2010-10-15 13:25:42.000000000 +0530
@@ -75,9 +75,6 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -185,11 +182,10 @@ static void vp_reset(struct virtio_devic
 static void vp_notify(struct virtqueue *vq)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	struct virtio_pci_vq_info *info = vq->priv;
 
 	/* we write the queue's selector into the notification register to
 	 * signal the other end */
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -385,7 +381,6 @@ static struct virtqueue *setup_vq(struct
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	info->queue_index = index;
 	info->num = num;
 	info->msix_vector = msix_vec;
 
@@ -408,6 +403,7 @@ static struct virtqueue *setup_vq(struct
 		goto out_activate_queue;
 	}
 
+	vq->queue_index = index;
 	vq->priv = info;
 	info->vq = vq;
 
@@ -446,7 +442,7 @@ static void vp_del_vq(struct virtqueue *
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vp_dev->lock, flags);
 
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
 	if (vp_dev->msix_enabled) {
 		iowrite16(VIRTIO_MSI_NO_VECTOR,

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [v3 RFC PATCH 2/4] Changes for virtio-net
  2010-10-20  8:54 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
  2010-10-20  8:54 ` [v3 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
@ 2010-10-20  8:55 ` Krishna Kumar
  2010-10-20  8:55 ` [v3 RFC PATCH 3/4] Changes for vhost Krishna Kumar
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Krishna Kumar @ 2010-10-20  8:55 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: kvm, arnd, netdev, avi, anthony, eric.dumazet, Krishna Kumar

Implement mq virtio-net driver. 

Though struct virtio_net_config changes, it works with old
qemu's since the last element is not accessed, unless qemu
sets VIRTIO_NET_F_NUMTXQS.  Patch also adds a macro for the
maximum number of TX vq's (VIRTIO_MAX_SQ) that the user can
specify.
        
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---     
 drivers/net/virtio_net.c   |  234 ++++++++++++++++++++++++++---------
 include/linux/virtio_net.h |    6 
 2 files changed, 185 insertions(+), 55 deletions(-)

diff -ruNp org/include/linux/virtio_net.h new.dynamic.optimize_vhost/include/linux/virtio_net.h
--- org/include/linux/virtio_net.h	2010-10-11 10:20:22.000000000 +0530
+++ new.dynamic.optimize_vhost/include/linux/virtio_net.h	2010-10-19 13:24:38.000000000 +0530
@@ -7,6 +7,9 @@
 #include <linux/virtio_config.h>
 #include <linux/if_ether.h>
 
+/* Maximum number of TX queues supported */
+#define VIRTIO_MAX_SQ 32
+
 /* The feature bitmap for virtio net */
 #define VIRTIO_NET_F_CSUM	0	/* Host handles pkts w/ partial csum */
 #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
@@ -26,6 +29,7 @@
 #define VIRTIO_NET_F_CTRL_RX	18	/* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN	19	/* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS	21	/* Device supports multiple TX queue */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -34,6 +38,8 @@ struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* number of transmit queues */
+	__u16 numtxqs;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org/drivers/net/virtio_net.c new.dynamic.optimize_vhost/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c	2010-10-11 10:20:02.000000000 +0530
+++ new.dynamic.optimize_vhost/drivers/net/virtio_net.c	2010-10-19 17:01:53.000000000 +0530
@@ -40,11 +40,24 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 
+/* Our representation of a send virtqueue */
+struct send_queue {
+	struct virtqueue *svq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
+};
+
 struct virtnet_info {
+	struct send_queue **sq;
+	struct napi_struct napi ____cacheline_aligned_in_smp;
+
+	/* read-mostly variables */
+	int numtxqs ____cacheline_aligned_in_smp;
 	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
+	struct virtqueue *rvq;
+	struct virtqueue *cvq;
 	struct net_device *dev;
-	struct napi_struct napi;
 	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
@@ -62,9 +75,8 @@ struct virtnet_info {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
-	/* fragments + linear part + virtio header */
+	/* RX: fragments + linear part + virtio header */
 	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -120,12 +132,13 @@ static struct page *get_a_page(struct vi
 static void skb_xmit_done(struct virtqueue *svq)
 {
 	struct virtnet_info *vi = svq->vdev->priv;
+	int qnum = svq->queue_index - 1;	/* 0 is RX vq */
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(svq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, qnum);
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -495,12 +508,13 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+				       struct virtqueue *svq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
@@ -510,7 +524,8 @@ static unsigned int free_old_xmit_skbs(s
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+		    struct virtqueue *svq, struct scatterlist *tx_sg)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -548,12 +563,12 @@ static int xmit_skb(struct virtnet_info 
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, tx_sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(svq, tx_sg, hdr->num_sg,
 					0, skb);
 }
 
@@ -561,31 +576,34 @@ static netdev_tx_t start_xmit(struct sk_
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
+	int qnum = skb_get_queue_mapping(skb);
+	struct virtqueue *svq = vi->sq[qnum]->svq;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, svq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, skb, svq, vi->sq[qnum]->tx_sg);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
 		if (net_ratelimit()) {
 			if (likely(capacity == -ENOMEM)) {
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					 "TXQ (%d) failure: out of memory\n",
+					 qnum);
 			} else {
 				dev->stats.tx_fifo_errors++;
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					 "Unexpected TXQ (%d) failure: %d\n",
+					 qnum, capacity);
 			}
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(svq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -594,13 +612,13 @@ static netdev_tx_t start_xmit(struct sk_
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+		netif_stop_subqueue(dev, qnum);
+		if (unlikely(!virtqueue_enable_cb(svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				netif_start_subqueue(dev, qnum);
+				virtqueue_disable_cb(svq);
 			}
 		}
 	}
@@ -871,10 +889,10 @@ static void virtnet_update_status(struct
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 }
 
@@ -885,18 +903,122 @@ static void virtnet_config_changed(struc
 	virtnet_update_status(vi);
 }
 
+#define MAX_DEVICE_NAME		16
+static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
+{
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int i, err = -ENOMEM;
+	int totalvqs;
+	char **names;
+
+	vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
+	if (!vi->sq)
+		goto out;
+	for (i = 0; i < numtxqs; i++) {
+		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
+		if (!vi->sq[i])
+			goto out;
+	}
+
+	/* setup initial send queue parameters */
+	for (i = 0; i < numtxqs; i++)
+		sg_init_table(vi->sq[i]->tx_sg, ARRAY_SIZE(vi->sq[i]->tx_sg));
+
+	/*
+	 * We expect 1 RX virtqueue followed by 'numtxqs' TX virtqueues, and
+	 * optionally one control virtqueue.
+	 */
+	totalvqs = 1 + numtxqs +
+		   virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Setup parameters for find_vqs */
+	vqs = kmalloc(totalvqs * sizeof(*vqs), GFP_KERNEL);
+	callbacks = kmalloc(totalvqs * sizeof(*callbacks), GFP_KERNEL);
+	names = kzalloc(totalvqs * sizeof(*names), GFP_KERNEL);
+	if (!vqs || !callbacks || !names)
+		goto free_mem;
+
+	/* Parameters for recv virtqueue */
+	callbacks[0] = skb_recv_done;
+	names[0] = "input";
+
+	/* Parameters for send virtqueues */
+	for (i = 1; i <= numtxqs; i++) {
+		callbacks[i] = skb_xmit_done;
+		names[i] = kmalloc(MAX_DEVICE_NAME * sizeof(*names[i]),
+				   GFP_KERNEL);
+		if (!names[i])
+			goto free_mem;
+		sprintf(names[i], "output.%d", i - 1);
+	}
+
+	/* Parameters for control virtqueue, if any */
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		callbacks[i] = NULL;
+		names[i] = "control";
+	}
+
+	err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs, callbacks,
+					 (const char **)names);
+	if (err)
+		goto free_mem;
+
+	vi->rvq = vqs[0];
+	for (i = 0; i < numtxqs; i++)
+		vi->sq[i]->svq = vqs[i + 1];
+
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		vi->cvq = vqs[i + 1];
+
+		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+	}
+
+free_mem:
+	if (names) {
+		for (i = 1; i <= numtxqs; i++)
+			kfree(names[i]);
+		kfree(names);
+	}
+
+	kfree(callbacks);
+	kfree(vqs);
+
+out:
+	if (err) {
+		for (i = 0; i < numtxqs; i++)
+			kfree(vi->sq[i]);
+		kfree(vi->sq);
+	}
+
+	return err;
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
-	int err;
+	int i, err;
+	u16 numtxqs;
 	struct net_device *dev;
 	struct virtnet_info *vi;
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs;
+
+	/*
+	 * Find if host passed the number of transmit queues supported
+	 * by the device
+	 */
+	err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
+				offsetof(struct virtio_net_config, numtxqs),
+				&numtxqs);
+
+	/* We need atleast one txq */
+	if (err || !numtxqs)
+		numtxqs = 1;
+
+	if (numtxqs > VIRTIO_MAX_SQ)
+		return -EINVAL;
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), numtxqs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -940,9 +1062,9 @@ static int virtnet_probe(struct virtio_d
 	vi->vdev = vdev;
 	vdev->priv = vi;
 	vi->pages = NULL;
+	vi->numtxqs = numtxqs;
 	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -953,23 +1075,10 @@ static int virtnet_probe(struct virtio_d
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
-
-	err = vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names);
+	/* Initialize our rx/tx queue parameters, and invoke find_vqs */
+	err = initialize_vqs(vi, numtxqs);
 	if (err)
-		goto free;
-
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
-
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
-
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
-			dev->features |= NETIF_F_HW_VLAN_FILTER;
-	}
+		goto free_netdev;
 
 	err = register_netdev(dev);
 	if (err) {
@@ -986,6 +1095,9 @@ static int virtnet_probe(struct virtio_d
 		goto unregister;
 	}
 
+	dev_info(&dev->dev, "(virtio-net) Allocated 1 RX and %d TX vq's\n",
+		 numtxqs);
+
 	vi->status = VIRTIO_NET_S_LINK_UP;
 	virtnet_update_status(vi);
 	netif_carrier_on(dev);
@@ -998,7 +1110,10 @@ unregister:
 	cancel_delayed_work_sync(&vi->refill);
 free_vqs:
 	vdev->config->del_vqs(vdev);
-free:
+	for (i = 0; i < numtxqs; i++)
+		kfree(vi->sq[i]);
+	kfree(vi->sq);
+free_netdev:
 	free_netdev(dev);
 	return err;
 }
@@ -1006,12 +1121,21 @@ free:
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
+	int i;
+
+	for (i = 0; i < vi->numtxqs; i++) {
+		struct virtqueue *svq = vi->sq[i]->svq;
+
+		while (1) {
+			buf = virtqueue_detach_unused_buf(svq);
+			if (!buf)
+				break;
+			dev_kfree_skb(buf);
+		}
+		kfree(vi->sq[i]);
 	}
+	kfree(vi->sq);
+
 	while (1) {
 		buf = virtqueue_detach_unused_buf(vi->rvq);
 		if (!buf)
@@ -1059,7 +1183,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
-	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
+	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_NUMTXQS,
 };
 
 static struct virtio_driver virtio_net_driver = {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [v3 RFC PATCH 3/4] Changes for vhost
  2010-10-20  8:54 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
  2010-10-20  8:54 ` [v3 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
  2010-10-20  8:55 ` [v3 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
@ 2010-10-20  8:55 ` Krishna Kumar
  2010-10-20  8:55 ` [v3 RFC PATCH 4/4] qemu changes Krishna Kumar
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Krishna Kumar @ 2010-10-20  8:55 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: arnd, eric.dumazet, netdev, avi, anthony, kvm, Krishna Kumar

Changes for mq vhost.

vhost_net_open is changed to allocate a vhost_net and
return.  The remaining initializations are delayed till
SET_OWNER.  SET_OWNER is changed so that the argument
is used to determine how many txqs to use.  Unmodified
qemu's will pass NULL, so this is recognized and handled
as numtxqs=1.

Besides changing handle_tx to use 'vq', this patch also
changes handle_rx to take vq as parameter.  The mq RX
patch requires this change, but till then it is consistent
(and less confusing) to make the interfaces for handling
rx and tx similar.

vhost thread handling for RX and TX is as follows.  The
first vhost thread handles RX traffic, while the remaining
threads handles TX.  The number of threads is <= #txqs, and
threads handle more than one txq when #txqs is more than
MAX_VHOST_THREADS (4).  When guest is started with >1 txqs
and there is only one stream of traffic from the guest,
that is recognized and handled such that vhost[0] processes
both RX and TX.  This can change dynamically.  vhost_poll
has a new element - find_vq(), which allows optimizing some
code for cases where numtxqs=1 or a packet on vhost[0]
needs processing.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/vhost/net.c   |  284 ++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.c |  275 ++++++++++++++++++++++++++++----------
 drivers/vhost/vhost.h |   42 +++++
 3 files changed, 430 insertions(+), 171 deletions(-)

diff -ruNp org/drivers/vhost/vhost.h new/drivers/vhost/vhost.h
--- org/drivers/vhost/vhost.h	2010-10-11 10:21:14.000000000 +0530
+++ new/drivers/vhost/vhost.h	2010-10-20 14:11:23.000000000 +0530
@@ -35,11 +35,13 @@ struct vhost_poll {
 	wait_queue_t              wait;
 	struct vhost_work	  work;
 	unsigned long		  mask;
-	struct vhost_dev	 *dev;
+	struct vhost_virtqueue	  *(*find_vq)(struct vhost_poll *poll);
+	struct vhost_virtqueue	  *vq;  /* points back to vq */
 };
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue *vq,
+		     int single_queue);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -108,6 +110,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	struct task_struct *worker; /* vhost for this vq, can be shared */
+	spinlock_t *work_lock;
+	struct list_head *work_list;
+	int qnum;		/* 0 for RX, 1 -> n-1 for TX */
 };
 
 struct vhost_dev {
@@ -119,15 +125,39 @@ struct vhost_dev {
 	struct mutex mutex;
 	unsigned acked_features;
 	struct vhost_virtqueue *vqs;
+	unsigned long *jiffies;
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
-	spinlock_t work_lock;
-	struct list_head work_list;
-	struct task_struct *worker;
+	spinlock_t *work_lock;
+	struct list_head *work_list;
 };
 
-long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
+/*
+ * Define maximum number of TX threads, and use that to have a maximum
+ * number of vhost threads to handle RX & TX. First thread handles RX.
+ * If guest is started with #txqs=1, only one vhost thread is started.
+ * Else, upto MAX_VHOST_THREADS are started where th[0] handles RX and
+ * remaining handles TX. However, vhost_poll_queue has an optimization
+ * where th[0] is selected for both RX & TX if there is only one flow.
+ */
+#define MAX_TXQ_THREADS		4
+#define MAX_VHOST_THREADS	(MAX_TXQ_THREADS + 1)
+
+static inline int get_nvhosts(int nvqs)
+{
+	int num_vhosts = nvqs - 1;
+
+	if (nvqs > 2)
+		num_vhosts = min_t(int, nvqs, MAX_VHOST_THREADS);
+
+	return num_vhosts;
+}
+
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs);
+void vhost_free_vqs(struct vhost_dev *dev);
+long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs,
+		    int nvhosts);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
 void vhost_dev_cleanup(struct vhost_dev *);
diff -ruNp org/drivers/vhost/net.c new/drivers/vhost/net.c
--- org/drivers/vhost/net.c	2010-10-11 10:21:14.000000000 +0530
+++ new/drivers/vhost/net.c	2010-10-20 14:20:10.000000000 +0530
@@ -33,12 +33,6 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
-enum {
-	VHOST_NET_VQ_RX = 0,
-	VHOST_NET_VQ_TX = 1,
-	VHOST_NET_VQ_MAX = 2,
-};
-
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
 	VHOST_NET_POLL_STARTED = 1,
@@ -47,12 +41,13 @@ enum vhost_net_poll_state {
 
 struct vhost_net {
 	struct vhost_dev dev;
-	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
-	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct vhost_virtqueue *vqs;
+	struct vhost_poll *poll;
+	struct socket **socks;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
-	enum vhost_net_poll_state tx_poll_state;
+	enum vhost_net_poll_state *tx_poll_state;
 };
 
 /* Pop first len bytes from iovec. Return number of segments used. */
@@ -92,28 +87,28 @@ static void copy_iovec_hdr(const struct 
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
+static void tx_poll_stop(struct vhost_net *net, int qnum)
 {
-	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
+	if (likely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STARTED))
 		return;
-	vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
-	net->tx_poll_state = VHOST_NET_POLL_STOPPED;
+	vhost_poll_stop(&net->poll[qnum]);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
+static void tx_poll_start(struct vhost_net *net, struct socket *sock, int qnum)
 {
-	if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
+	if (unlikely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STOPPED))
 		return;
-	vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
-	net->tx_poll_state = VHOST_NET_POLL_STARTED;
+	vhost_poll_start(&net->poll[qnum], sock->file);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STARTED;
 }
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 	unsigned out, in, s;
 	int head;
 	struct msghdr msg = {
@@ -134,7 +129,7 @@ static void handle_tx(struct vhost_net *
 	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 	if (wmem >= sock->sk->sk_sndbuf) {
 		mutex_lock(&vq->mutex);
-		tx_poll_start(net, sock);
+		tx_poll_start(net, sock, vq->qnum);
 		mutex_unlock(&vq->mutex);
 		return;
 	}
@@ -144,7 +139,7 @@ static void handle_tx(struct vhost_net *
 	vhost_disable_notify(vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
-		tx_poll_stop(net);
+		tx_poll_stop(net, vq->qnum);
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
@@ -159,7 +154,7 @@ static void handle_tx(struct vhost_net *
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
+				tx_poll_start(net, sock, vq->qnum);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
@@ -189,7 +184,7 @@ static void handle_tx(struct vhost_net *
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
 			vhost_discard_vq_desc(vq, 1);
-			tx_poll_start(net, sock);
+			tx_poll_start(net, sock, vq->qnum);
 			break;
 		}
 		if (err != len)
@@ -282,9 +277,9 @@ err:
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_big(struct vhost_net *net)
+static void handle_rx_big(struct vhost_virtqueue *vq,
+			  struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned out, in, log, s;
 	int head;
 	struct vhost_log *vq_log;
@@ -393,9 +388,9 @@ static void handle_rx_big(struct vhost_n
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_mergeable(struct vhost_net *net)
+static void handle_rx_mergeable(struct vhost_virtqueue *vq,
+				struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned uninitialized_var(in), log;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -500,96 +495,184 @@ static void handle_rx_mergeable(struct v
 	unuse_mm(net->dev.mm);
 }
 
-static void handle_rx(struct vhost_net *net)
+static void handle_rx(struct vhost_virtqueue *vq)
 {
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
-		handle_rx_mergeable(net);
+		handle_rx_mergeable(vq, net);
 	else
-		handle_rx_big(net);
+		handle_rx_big(vq, net);
 }
 
 static void handle_tx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_tx(net);
+	handle_tx(vq);
 }
 
 static void handle_rx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_rx(net);
+	handle_rx(vq);
 }
 
 static void handle_tx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_TX].work);
-	handle_tx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_tx(vq);
 }
 
 static void handle_rx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_RX].work);
-	handle_rx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_rx(vq);
 }
 
-static int vhost_net_open(struct inode *inode, struct file *f)
+void vhost_free_vqs(struct vhost_dev *dev)
 {
-	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
-	struct vhost_dev *dev;
-	int r;
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
+
+	kfree(dev->work_list);
+	kfree(dev->work_lock);
+	kfree(dev->jiffies);
+	kfree(n->socks);
+	kfree(n->tx_poll_state);
+	kfree(n->poll);
+	kfree(n->vqs);
+
+	/*
+	 * Reset so that vhost_net_release (after vhost_dev_set_owner call)
+	 * will notice.
+	 */
+	n->vqs = NULL;
+	n->poll = NULL;
+	n->socks = NULL;
+	n->tx_poll_state = NULL;
+	dev->jiffies = NULL;
+	dev->work_lock = NULL;
+	dev->work_list = NULL;
+}
+
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs)
+{
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
+	int nvhosts;
+	int i, nvqs;
+	int ret;
+
+	if (numtxqs < 0 || numtxqs > VIRTIO_MAX_SQ)
+		return -EINVAL;
+
+	if (numtxqs == 0) {
+		/* Old qemu doesn't pass arguments to set_owner, use 1 txq */
+		numtxqs = 1;
+	}
+
+	/* Get total number of virtqueues */
+	nvqs = numtxqs + 1;
+
+	/* Get total number of vhost threads */
+	nvhosts = get_nvhosts(nvqs);
+
+	n->vqs = kmalloc(nvqs * sizeof(*n->vqs), GFP_KERNEL);
+	n->poll = kmalloc(nvqs * sizeof(*n->poll), GFP_KERNEL);
+	n->socks = kmalloc(nvqs * sizeof(*n->socks), GFP_KERNEL);
+	n->tx_poll_state = kmalloc(nvqs * sizeof(*n->tx_poll_state),
+				   GFP_KERNEL);
+	dev->jiffies = kzalloc(numtxqs * sizeof(*dev->jiffies), GFP_KERNEL);
+	dev->work_lock = kmalloc(nvhosts * sizeof(*dev->work_lock),
+				 GFP_KERNEL);
+	dev->work_list = kmalloc(nvhosts * sizeof(*dev->work_list),
+				 GFP_KERNEL);
+
+	if (!n->vqs || !n->poll || !n->socks || !n->tx_poll_state ||
+	    !dev->jiffies || !dev->work_lock || !dev->work_list) {
+		ret = -ENOMEM;
+		goto err;
+	}
 
-	if (!n)
-		return -ENOMEM;
+	/* 1 RX, followed by 'numtxqs' TX queues */
+	n->vqs[0].handle_kick = handle_rx_kick;
 
-	dev = &n->dev;
-	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
-	if (r < 0) {
-		kfree(n);
-		return r;
-	}
+	for (i = 1; i < nvqs; i++)
+		n->vqs[i].handle_kick = handle_tx_kick;
+
+	ret = vhost_dev_init(dev, n->vqs, nvqs, nvhosts);
+	if (ret < 0)
+		goto err;
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
-	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	vhost_poll_init(&n->poll[0], handle_rx_net, POLLIN, &n->vqs[0], 1);
 
-	f->private_data = n;
+	for (i = 1; i < nvqs; i++) {
+		vhost_poll_init(&n->poll[i], handle_tx_net, POLLOUT,
+				&n->vqs[i], (nvqs == 2));
+		n->tx_poll_state[i] = VHOST_NET_POLL_DISABLED;
+	}
 
 	return 0;
+
+err:
+	/* Free all pointers that may have been allocated */
+	vhost_free_vqs(dev);
+
+	return ret;
+}
+
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
+	int ret = ENOMEM;
+
+	if (n) {
+		struct vhost_dev *dev = &n->dev;
+
+		f->private_data = n;
+		mutex_init(&dev->mutex);
+
+		/* Defer all other initialization till user does SET_OWNER */
+		ret = 0;
+	}
+
+	return ret;
 }
 
 static void vhost_net_disable_vq(struct vhost_net *n,
 				 struct vhost_virtqueue *vq)
 {
+	int qnum = vq->qnum;
+
 	if (!vq->private_data)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		tx_poll_stop(n);
-		n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	if (qnum) {	/* TX */
+		tx_poll_stop(n, qnum);
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_DISABLED;
 	} else
-		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+		vhost_poll_stop(&n->poll[qnum]);
 }
 
 static void vhost_net_enable_vq(struct vhost_net *n,
 				struct vhost_virtqueue *vq)
 {
 	struct socket *sock = vq->private_data;
+	int qnum = vq->qnum;
+
 	if (!sock)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		n->tx_poll_state = VHOST_NET_POLL_STOPPED;
-		tx_poll_start(n, sock);
+
+	if (qnum) {	/* TX */
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
+		tx_poll_start(n, sock, qnum);
 	} else
-		vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+		vhost_poll_start(&n->poll[qnum], sock->file);
 }
 
 static struct socket *vhost_net_stop_vq(struct vhost_net *n,
@@ -605,11 +688,12 @@ static struct socket *vhost_net_stop_vq(
 	return sock;
 }
 
-static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
-			   struct socket **rx_sock)
+static void vhost_net_stop(struct vhost_net *n)
 {
-	*tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
-	*rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		n->socks[i] = vhost_net_stop_vq(n, &n->vqs[i]);
 }
 
 static void vhost_net_flush_vq(struct vhost_net *n, int index)
@@ -620,26 +704,33 @@ static void vhost_net_flush_vq(struct vh
 
 static void vhost_net_flush(struct vhost_net *n)
 {
-	vhost_net_flush_vq(n, VHOST_NET_VQ_TX);
-	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		vhost_net_flush_vq(n, i);
 }
 
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
-	struct socket *tx_sock;
-	struct socket *rx_sock;
+	struct vhost_dev *dev = &n->dev;
+	int i;
 
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n);
 	vhost_net_flush(n);
-	vhost_dev_cleanup(&n->dev);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+	vhost_dev_cleanup(dev);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (n->socks[i])
+			fput(n->socks[i]->file);
+
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+
+	/* Free all old pointers */
+	vhost_free_vqs(dev);
+
 	kfree(n);
 	return 0;
 }
@@ -717,7 +808,7 @@ static long vhost_net_set_backend(struct
 	if (r)
 		goto err;
 
-	if (index >= VHOST_NET_VQ_MAX) {
+	if (index >= n->dev.nvqs) {
 		r = -ENOBUFS;
 		goto err;
 	}
@@ -738,9 +829,9 @@ static long vhost_net_set_backend(struct
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock != oldsock) {
-                vhost_net_disable_vq(n, vq);
-                rcu_assign_pointer(vq->private_data, sock);
-                vhost_net_enable_vq(n, vq);
+		vhost_net_disable_vq(n, vq);
+		rcu_assign_pointer(vq->private_data, sock);
+		vhost_net_enable_vq(n, vq);
 	}
 
 	mutex_unlock(&vq->mutex);
@@ -762,22 +853,25 @@ err:
 
 static long vhost_net_reset_owner(struct vhost_net *n)
 {
-	struct socket *tx_sock = NULL;
-	struct socket *rx_sock = NULL;
 	long err;
+	int i;
+
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
-	if (err)
-		goto done;
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	if (err) {
+		mutex_unlock(&n->dev.mutex);
+		return err;
+	}
+
+	vhost_net_stop(n);
 	vhost_net_flush(n);
 	err = vhost_dev_reset_owner(&n->dev);
-done:
 	mutex_unlock(&n->dev.mutex);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (n->socks[i])
+			fput(n->socks[i]->file);
+
 	return err;
 }
 
@@ -806,7 +900,7 @@ static int vhost_net_set_features(struct
 	}
 	n->dev.acked_features = features;
 	smp_wmb();
-	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+	for (i = 0; i < n->dev.nvqs; ++i) {
 		mutex_lock(&n->vqs[i].mutex);
 		n->vqs[i].vhost_hlen = vhost_hlen;
 		n->vqs[i].sock_hlen = sock_hlen;
diff -ruNp org/drivers/vhost/vhost.c new/drivers/vhost/vhost.c
--- org/drivers/vhost/vhost.c	2010-10-11 10:21:14.000000000 +0530
+++ new/drivers/vhost/vhost.c	2010-10-20 14:20:04.000000000 +0530
@@ -69,16 +69,70 @@ static void vhost_work_init(struct vhost
 	work->queue_seq = work->done_seq = 0;
 }
 
+/*
+ * __vhost_sq_find_vq: This is the poll->find_vq() handler for cases:
+ *	- #numtxqs == 1; or
+ *	- this is an RX vq
+ */
+static struct vhost_virtqueue *__vhost_sq_find_vq(struct vhost_poll *poll)
+{
+	return poll->vq;
+}
+
+/* Define how recently a txq was used, beyond this it is considered unused */
+#define RECENTLY_USED  5
+
+/*
+ * __vhost_mq_find_vq: This is the poll->find_vq() handler for cases:
+ *	- #numtxqs > 1, and
+ *	- this is a TX vq
+ *
+ * Algorithm for selecting vq:
+ *
+ *	Condition:					Return:
+ *	If all txqs unused				vq[0]
+ *	If one txq used, and new txq is same		vq[0]
+ *	If one txq used, and new txq is different	vq[vq->qnum]
+ *	If > 1 txqs used				vq[vq->qnum]
+ * Where "used" means the txq was used in the last RECENTLY_USED jiffies.
+ *
+ * Note: locking is not required as an update race will only result in
+ * a different worker being woken up.
+ */
+static struct vhost_virtqueue *__vhost_mq_find_vq(struct vhost_poll *poll)
+{
+	struct vhost_dev *dev = poll->vq->dev;
+	struct vhost_virtqueue *vq = &dev->vqs[0];
+	unsigned long max_time = jiffies - RECENTLY_USED;
+	unsigned long *table = dev->jiffies;
+	int i, used = 0;
+
+	for (i = 0; i < dev->nvqs - 1; i++) {
+		if (time_after_eq(table[i], max_time) && ++used > 1) {
+			vq = poll->vq;
+			break;
+		}
+	}
+
+	table[poll->vq->qnum - 1] = jiffies;
+	return vq;
+}
+
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq,
+		     int single_queue)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->vq = vq;
 
 	vhost_work_init(&poll->work, fn);
+	if (single_queue)
+		poll->find_vq = __vhost_sq_find_vq;
+	else
+		poll->find_vq = __vhost_mq_find_vq;
 }
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -98,25 +152,25 @@ void vhost_poll_stop(struct vhost_poll *
 	remove_wait_queue(poll->wqh, &poll->wait);
 }
 
-static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+static void vhost_work_flush(struct vhost_poll *poll, struct vhost_work *work)
 {
 	unsigned seq;
 	int left;
 	int flushing;
 
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	seq = work->queue_seq;
 	work->flushing++;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	wait_event(work->done, ({
-		   spin_lock_irq(&dev->work_lock);
+		   spin_lock_irq(poll->vq->work_lock);
 		   left = seq - work->done_seq <= 0;
-		   spin_unlock_irq(&dev->work_lock);
+		   spin_unlock_irq(poll->vq->work_lock);
 		   left;
 	}));
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	flushing = --work->flushing;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	BUG_ON(flushing < 0);
 }
 
@@ -124,26 +178,28 @@ static void vhost_work_flush(struct vhos
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	vhost_work_flush(poll->dev, &poll->work);
+	vhost_work_flush(poll, &poll->work);
 }
 
-static inline void vhost_work_queue(struct vhost_dev *dev,
+static inline void vhost_work_queue(struct vhost_virtqueue *vq,
 				    struct vhost_work *work)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&dev->work_lock, flags);
+	spin_lock_irqsave(vq->work_lock, flags);
 	if (list_empty(&work->node)) {
-		list_add_tail(&work->node, &dev->work_list);
+		list_add_tail(&work->node, vq->work_list);
 		work->queue_seq++;
-		wake_up_process(dev->worker);
+		wake_up_process(vq->worker);
 	}
-	spin_unlock_irqrestore(&dev->work_lock, flags);
+	spin_unlock_irqrestore(vq->work_lock, flags);
 }
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	struct vhost_virtqueue *vq = poll->find_vq(poll);
+
+	vhost_work_queue(vq, &poll->work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -174,7 +230,7 @@ static void vhost_vq_reset(struct vhost_
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_virtqueue *vq = data;
 	struct vhost_work *work = NULL;
 	unsigned uninitialized_var(seq);
 
@@ -182,7 +238,7 @@ static int vhost_worker(void *data)
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		spin_lock_irq(&dev->work_lock);
+		spin_lock_irq(vq->work_lock);
 		if (work) {
 			work->done_seq = seq;
 			if (work->flushing)
@@ -190,18 +246,18 @@ static int vhost_worker(void *data)
 		}
 
 		if (kthread_should_stop()) {
-			spin_unlock_irq(&dev->work_lock);
+			spin_unlock_irq(vq->work_lock);
 			__set_current_state(TASK_RUNNING);
 			return 0;
 		}
-		if (!list_empty(&dev->work_list)) {
-			work = list_first_entry(&dev->work_list,
+		if (!list_empty(vq->work_list)) {
+			work = list_first_entry(vq->work_list,
 						struct vhost_work, node);
 			list_del_init(&work->node);
 			seq = work->queue_seq;
 		} else
 			work = NULL;
-		spin_unlock_irq(&dev->work_lock);
+		spin_unlock_irq(vq->work_lock);
 
 		if (work) {
 			__set_current_state(TASK_RUNNING);
@@ -251,8 +307,19 @@ static void vhost_dev_free_iovecs(struct
 	}
 }
 
+/* Get index of an existing thread that will handle this txq */
+static int vhost_get_buddy_thread(int index, int nvhosts)
+{
+	int buddy = 0;
+
+	if (nvhosts > 1)
+		buddy = (index - 1) % MAX_TXQ_THREADS + 1;
+
+	return buddy;
+}
+
 long vhost_dev_init(struct vhost_dev *dev,
-		    struct vhost_virtqueue *vqs, int nvqs)
+		    struct vhost_virtqueue *vqs, int nvqs, int nvhosts)
 {
 	int i;
 
@@ -263,20 +330,37 @@ long vhost_dev_init(struct vhost_dev *de
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
-	spin_lock_init(&dev->work_lock);
-	INIT_LIST_HEAD(&dev->work_list);
-	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].log = NULL;
-		dev->vqs[i].indirect = NULL;
-		dev->vqs[i].heads = NULL;
-		dev->vqs[i].dev = dev;
-		mutex_init(&dev->vqs[i].mutex);
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+		int single_queue = (!i || dev->nvqs == 2);
+
+		if (i < nvhosts) {
+			spin_lock_init(&dev->work_lock[i]);
+			INIT_LIST_HEAD(&dev->work_list[i]);
+
+			vq->work_lock = &dev->work_lock[i];
+			vq->work_list = &dev->work_list[i];
+		} else {
+			/* Share work with another thread */
+			int j = vhost_get_buddy_thread(i, nvhosts);
+
+			vq->work_lock = &dev->work_lock[j];
+			vq->work_list = &dev->work_list[j];
+		}
+
+		vq->worker = NULL;
+		vq->qnum = i;
+		vq->log = NULL;
+		vq->indirect = NULL;
+		vq->heads = NULL;
+		vq->dev = dev;
+		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, dev->vqs + i);
-		if (dev->vqs[i].handle_kick)
-			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick, POLLIN, dev);
+		if (vq->handle_kick)
+			vhost_poll_init(&vq->poll,
+					vq->handle_kick, POLLIN, vq,
+					single_queue);
 	}
 
 	return 0;
@@ -290,61 +374,116 @@ long vhost_dev_check_owner(struct vhost_
 }
 
 struct vhost_attach_cgroups_struct {
-        struct vhost_work work;
-        struct task_struct *owner;
-        int ret;
+	struct vhost_work work;
+	struct task_struct *owner;
+	int ret;
 };
 
 static void vhost_attach_cgroups_work(struct vhost_work *work)
 {
-        struct vhost_attach_cgroups_struct *s;
-        s = container_of(work, struct vhost_attach_cgroups_struct, work);
-        s->ret = cgroup_attach_task_all(s->owner, current);
+	struct vhost_attach_cgroups_struct *s;
+	s = container_of(work, struct vhost_attach_cgroups_struct, work);
+	s->ret = cgroup_attach_task_all(s->owner, current);
 }
 
-static int vhost_attach_cgroups(struct vhost_dev *dev)
-{
-        struct vhost_attach_cgroups_struct attach;
-        attach.owner = current;
-        vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-        vhost_work_queue(dev, &attach.work);
-        vhost_work_flush(dev, &attach.work);
-        return attach.ret;
+static int vhost_attach_cgroups(struct vhost_virtqueue *vq)
+{
+	struct vhost_attach_cgroups_struct attach;
+	attach.owner = current;
+	vhost_work_init(&attach.work, vhost_attach_cgroups_work);
+	vhost_work_queue(vq, &attach.work);
+	vhost_work_flush(&vq->poll, &attach.work);
+	return attach.ret;
+}
+
+static void __vhost_stop_workers(struct vhost_dev *dev, int nvhosts)
+{
+	int i;
+
+	for (i = 0; i < nvhosts; i++) {
+		WARN_ON(!list_empty(dev->vqs[i].work_list));
+		if (dev->vqs[i].worker) {
+			kthread_stop(dev->vqs[i].worker);
+			dev->vqs[i].worker = NULL;
+		}
+	}
+}
+
+static void vhost_stop_workers(struct vhost_dev *dev)
+{
+	__vhost_stop_workers(dev, get_nvhosts(dev->nvqs));
+}
+
+static int vhost_start_workers(struct vhost_dev *dev)
+{
+	int nvhosts = get_nvhosts(dev->nvqs);
+	int i, err;
+
+	for (i = 0; i < dev->nvqs; ++i) {
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+
+		if (i < nvhosts) {
+			/* Start a new thread */
+			vq->worker = kthread_create(vhost_worker, vq,
+						    "vhost-%d-%d",
+						    current->pid, i);
+			if (IS_ERR(vq->worker)) {
+				i--;	/* no thread to clean at this index */
+				err = PTR_ERR(vq->worker);
+				goto err;
+			}
+
+			wake_up_process(vq->worker);
+
+			/* avoid contributing to loadavg */
+			err = vhost_attach_cgroups(vq);
+			if (err)
+				goto err;
+		} else {
+			/* Share work with an existing thread */
+			int j = vhost_get_buddy_thread(i, nvhosts);
+			struct vhost_virtqueue *share_vq = &dev->vqs[j];
+
+			vq->worker = share_vq->worker;
+		}
+	}
+	return 0;
+
+err:
+	__vhost_stop_workers(dev, i);
+	return err;
 }
 
 /* Caller should have device mutex */
-static long vhost_dev_set_owner(struct vhost_dev *dev)
+static long vhost_dev_set_owner(struct vhost_dev *dev, int numtxqs)
 {
-	struct task_struct *worker;
 	int err;
 	/* Is there an owner already? */
 	if (dev->mm) {
 		err = -EBUSY;
 		goto err_mm;
 	}
+
+	err = vhost_setup_vqs(dev, numtxqs);
+	if (err)
+		goto err_mm;
+
 	/* No owner, become one */
 	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
-	}
-
-	dev->worker = worker;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
 
-	err = vhost_attach_cgroups(dev);
+	/* Start threads */
+	err =  vhost_start_workers(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_worker;
 
 	err = vhost_dev_alloc_iovecs(dev);
 	if (err)
-		goto err_cgroup;
+		goto err_iovec;
 
 	return 0;
-err_cgroup:
-	kthread_stop(worker);
-	dev->worker = NULL;
+err_iovec:
+	vhost_stop_workers(dev);
+	vhost_free_vqs(dev);
 err_worker:
 	if (dev->mm)
 		mmput(dev->mm);
@@ -405,11 +544,7 @@ void vhost_dev_cleanup(struct vhost_dev 
 		mmput(dev->mm);
 	dev->mm = NULL;
 
-	WARN_ON(!list_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+	vhost_stop_workers(dev);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -760,7 +895,7 @@ long vhost_dev_ioctl(struct vhost_dev *d
 
 	/* If you are not the owner, you can become one */
 	if (ioctl == VHOST_SET_OWNER) {
-		r = vhost_dev_set_owner(d);
+		r = vhost_dev_set_owner(d, arg);
 		goto done;
 	}
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [v3 RFC PATCH 4/4] qemu changes
  2010-10-20  8:54 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (2 preceding siblings ...)
  2010-10-20  8:55 ` [v3 RFC PATCH 3/4] Changes for vhost Krishna Kumar
@ 2010-10-20  8:55 ` Krishna Kumar
  2010-10-25 15:50 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar2
  2011-02-22  7:47 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Simon Horman
  5 siblings, 0 replies; 35+ messages in thread
From: Krishna Kumar @ 2010-10-20  8:55 UTC (permalink / raw)
  To: rusty, davem, mst
  Cc: eric.dumazet, kvm, netdev, arnd, avi, anthony, Krishna Kumar

Changes in qemu to support mq TX.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---     
 hw/vhost.c      |    7 ++++--
 hw/vhost.h      |    2 -
 hw/vhost_net.c  |   16 +++++++++----
 hw/vhost_net.h  |    2 -
 hw/virtio-net.c |   53 ++++++++++++++++++++++++++++++++--------------
 hw/virtio-net.h |    2 +
 hw/virtio-pci.c |    2 +
 net.c           |   17 ++++++++++++++
 net.h           |    1 
 net/tap.c       |   34 ++++++++++++++++++++++++++---
 10 files changed, 107 insertions(+), 29 deletions(-)

diff -ruNp org3/hw/vhost.c new3/hw/vhost.c
--- org3/hw/vhost.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/hw/vhost.c	2010-10-20 12:44:21.000000000 +0530
@@ -580,7 +580,7 @@ static void vhost_virtqueue_cleanup(stru
                               0, virtio_queue_get_desc_size(vdev, idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int numtxqs)
 {
     uint64_t features;
     int r;
@@ -592,11 +592,14 @@ int vhost_dev_init(struct vhost_dev *hde
             return -errno;
         }
     }
-    r = ioctl(hdev->control, VHOST_SET_OWNER, NULL);
+
+    r = ioctl(hdev->control, VHOST_SET_OWNER, numtxqs);
     if (r < 0) {
         goto fail;
     }
 
+    hdev->nvqs = numtxqs + 1;
+
     r = ioctl(hdev->control, VHOST_GET_FEATURES, &features);
     if (r < 0) {
         goto fail;
diff -ruNp org3/hw/vhost.h new3/hw/vhost.h
--- org3/hw/vhost.h	2010-07-01 11:42:09.000000000 +0530
+++ new3/hw/vhost.h	2010-10-20 12:47:10.000000000 +0530
@@ -40,7 +40,7 @@ struct vhost_dev {
     unsigned long long log_size;
 };
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int numtxqs);
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff -ruNp org3/hw/vhost_net.c new3/hw/vhost_net.c
--- org3/hw/vhost_net.c	2010-09-28 10:07:31.000000000 +0530
+++ new3/hw/vhost_net.c	2010-10-19 19:46:52.000000000 +0530
@@ -36,7 +36,8 @@
 
 struct vhost_net {
     struct vhost_dev dev;
-    struct vhost_virtqueue vqs[2];
+    struct vhost_virtqueue *vqs;
+    int nvqs;
     int backend;
     VLANClientState *vc;
 };
@@ -81,7 +82,8 @@ static int vhost_net_get_fd(VLANClientSt
     }
 }
 
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int numtxqs)
 {
     int r;
     struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -98,10 +100,14 @@ struct vhost_net *vhost_net_init(VLANCli
         (1 << VHOST_NET_F_VIRTIO_NET_HDR);
     net->backend = r;
 
-    r = vhost_dev_init(&net->dev, devfd);
+    r = vhost_dev_init(&net->dev, devfd, numtxqs);
     if (r < 0) {
         goto fail;
     }
+
+    net->nvqs = numtxqs + 1;
+    net->vqs = qemu_malloc(net->nvqs * (sizeof *net->vqs));
+
     if (!tap_has_vnet_hdr_len(backend,
                               sizeof(struct virtio_net_hdr_mrg_rxbuf))) {
         net->dev.features &= ~(1 << VIRTIO_NET_F_MRG_RXBUF);
@@ -131,7 +137,6 @@ int vhost_net_start(struct vhost_net *ne
                              sizeof(struct virtio_net_hdr_mrg_rxbuf));
     }
 
-    net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
     r = vhost_dev_start(&net->dev, dev);
     if (r < 0) {
@@ -188,7 +193,8 @@ void vhost_net_cleanup(struct vhost_net 
     qemu_free(net);
 }
 #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int nvqs)
 {
 	return NULL;
 }
diff -ruNp org3/hw/vhost_net.h new3/hw/vhost_net.h
--- org3/hw/vhost_net.h	2010-07-01 11:42:09.000000000 +0530
+++ new3/hw/vhost_net.h	2010-10-19 19:46:52.000000000 +0530
@@ -6,7 +6,7 @@
 struct vhost_net;
 typedef struct vhost_net VHostNetState;
 
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, int nvqs);
 
 int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
 void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
diff -ruNp org3/hw/virtio-net.c new3/hw/virtio-net.c
--- org3/hw/virtio-net.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/hw/virtio-net.c	2010-10-19 21:02:33.000000000 +0530
@@ -32,7 +32,7 @@ typedef struct VirtIONet
     uint8_t mac[ETH_ALEN];
     uint16_t status;
     VirtQueue *rx_vq;
-    VirtQueue *tx_vq;
+    VirtQueue **tx_vq;
     VirtQueue *ctrl_vq;
     NICState *nic;
     QEMUTimer *tx_timer;
@@ -65,6 +65,7 @@ typedef struct VirtIONet
     } mac_table;
     uint32_t *vlans;
     DeviceState *qdev;
+    uint16_t numtxqs;
 } VirtIONet;
 
 /* TODO
@@ -82,6 +83,7 @@ static void virtio_net_get_config(VirtIO
     struct virtio_net_config netcfg;
 
     netcfg.status = n->status;
+    netcfg.numtxqs = n->numtxqs;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, sizeof(netcfg));
 }
@@ -196,6 +198,8 @@ static uint32_t virtio_net_get_features(
     VirtIONet *n = to_virtio_net(vdev);
 
     features |= (1 << VIRTIO_NET_F_MAC);
+    if (n->numtxqs > 1)
+        features |= (1 << VIRTIO_NET_F_NUMTXQS);
 
     if (peer_has_vnet_hdr(n)) {
         tap_using_vnet_hdr(n->nic->nc.peer, 1);
@@ -659,13 +663,16 @@ static void virtio_net_tx_complete(VLANC
 {
     VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
 
-    virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
-    virtio_notify(&n->vdev, n->tx_vq);
+    /*
+     * If this function executes, we are single TX and hence use only txq[0]
+     */
+    virtqueue_push(n->tx_vq[0], &n->async_tx.elem, n->async_tx.len);
+    virtio_notify(&n->vdev, n->tx_vq[0]);
 
     n->async_tx.elem.out_num = n->async_tx.len = 0;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 /* TX */
@@ -679,7 +686,7 @@ static int32_t virtio_net_flush_tx(VirtI
     }
 
     if (n->async_tx.elem.out_num) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+        virtio_queue_set_notification(n->tx_vq[0], 0);
         return num_packets;
     }
 
@@ -714,7 +721,7 @@ static int32_t virtio_net_flush_tx(VirtI
         ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
                                       virtio_net_tx_complete);
         if (ret == 0) {
-            virtio_queue_set_notification(n->tx_vq, 0);
+            virtio_queue_set_notification(n->tx_vq[0], 0);
             n->async_tx.elem = elem;
             n->async_tx.len  = len;
             return -EBUSY;
@@ -771,8 +778,8 @@ static void virtio_net_tx_timer(void *op
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 static void virtio_net_tx_bh(void *opaque)
@@ -786,7 +793,7 @@ static void virtio_net_tx_bh(void *opaqu
     if (unlikely(!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)))
         return;
 
-    ret = virtio_net_flush_tx(n, n->tx_vq);
+    ret = virtio_net_flush_tx(n, n->tx_vq[0]);
     if (ret == -EBUSY) {
         return; /* Notification re-enable handled by tx_complete */
     }
@@ -802,9 +809,9 @@ static void virtio_net_tx_bh(void *opaqu
     /* If less than a full burst, re-enable notification and flush
      * anything that may have come in while we weren't looking.  If
      * we find something, assume the guest is still active and reschedule */
-    virtio_queue_set_notification(n->tx_vq, 1);
-    if (virtio_net_flush_tx(n, n->tx_vq) > 0) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    if (virtio_net_flush_tx(n, n->tx_vq[0]) > 0) {
+        virtio_queue_set_notification(n->tx_vq[0], 0);
         qemu_bh_schedule(n->tx_bh);
         n->tx_waiting = 1;
     }
@@ -820,6 +827,7 @@ static void virtio_net_save(QEMUFile *f,
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
+    qemu_put_be16(f, n->numtxqs);
     qemu_put_be32(f, n->tx_waiting);
     qemu_put_be32(f, n->mergeable_rx_bufs);
     qemu_put_be16(f, n->status);
@@ -849,6 +857,7 @@ static int virtio_net_load(QEMUFile *f, 
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, ETH_ALEN);
+    n->numtxqs = qemu_get_be32(f);
     n->tx_waiting = qemu_get_be32(f);
     n->mergeable_rx_bufs = qemu_get_be32(f);
 
@@ -966,11 +975,14 @@ VirtIODevice *virtio_net_init(DeviceStat
                               virtio_net_conf *net)
 {
     VirtIONet *n;
+    int i;
 
     n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
                                         sizeof(struct virtio_net_config),
                                         sizeof(VirtIONet));
 
+    n->numtxqs = conf->peer->numtxqs;
+
     n->vdev.get_config = virtio_net_get_config;
     n->vdev.set_config = virtio_net_set_config;
     n->vdev.get_features = virtio_net_get_features;
@@ -978,8 +990,8 @@ VirtIODevice *virtio_net_init(DeviceStat
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
     n->vdev.set_status = virtio_net_set_status;
-    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
 
+    n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
     if (net->tx && strcmp(net->tx, "timer") && strcmp(net->tx, "bh")) {
         fprintf(stderr, "virtio-net: "
                 "Unknown option tx=%s, valid options: \"timer\" \"bh\"\n",
@@ -987,12 +999,21 @@ VirtIODevice *virtio_net_init(DeviceStat
         fprintf(stderr, "Defaulting to \"bh\"\n");
     }
 
+    /* Allocate per tx vq's */
+    n->tx_vq = qemu_mallocz(n->numtxqs * sizeof(*n->tx_vq));
+    for (i = 0; i < n->numtxqs; i++) {
+        if (net->tx && !strcmp(net->tx, "timer")) {
+            n->tx_vq[i] = virtio_add_queue(&n->vdev, 256,
+                                           virtio_net_handle_tx_timer);
+        } else {
+            n->tx_vq[i] = virtio_add_queue(&n->vdev, 256,
+                                           virtio_net_handle_tx_bh);
+        }
+    }
     if (net->tx && !strcmp(net->tx, "timer")) {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_timer);
         n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
         n->tx_timeout = net->txtimer;
     } else {
-        n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx_bh);
         n->tx_bh = qemu_bh_new(virtio_net_tx_bh, n);
     }
     n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
diff -ruNp org3/hw/virtio-net.h new3/hw/virtio-net.h
--- org3/hw/virtio-net.h	2010-09-28 10:07:31.000000000 +0530
+++ new3/hw/virtio-net.h	2010-10-19 19:46:52.000000000 +0530
@@ -44,6 +44,7 @@
 #define VIRTIO_NET_F_CTRL_RX    18      /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN  19      /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS    21      /* Supports multiple TX queues */
 
 #define VIRTIO_NET_S_LINK_UP    1       /* Link is up */
 
@@ -72,6 +73,7 @@ struct virtio_net_config
     uint8_t mac[ETH_ALEN];
     /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
     uint16_t status;
+    uint16_t numtxqs;	/* number of transmit queues */
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org3/hw/virtio-pci.c new3/hw/virtio-pci.c
--- org3/hw/virtio-pci.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/hw/virtio-pci.c	2010-10-19 19:46:52.000000000 +0530
@@ -99,6 +99,7 @@ typedef struct {
     uint32_t addr;
     uint32_t class_code;
     uint32_t nvectors;
+    uint32_t mq;
     BlockConf block;
     NICConf nic;
     uint32_t host_features;
@@ -788,6 +789,7 @@ static PCIDeviceInfo virtio_info[] = {
         .romfile    = "pxe-virtio.bin",
         .qdev.props = (Property[]) {
             DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
+	    DEFINE_PROP_UINT32("mq", VirtIOPCIProxy, mq, 1),
             DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
             DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
             DEFINE_PROP_UINT32("x-txtimer", VirtIOPCIProxy,
diff -ruNp org3/net/tap.c new3/net/tap.c
--- org3/net/tap.c	2010-09-28 10:07:31.000000000 +0530
+++ new3/net/tap.c	2010-10-20 12:39:56.000000000 +0530
@@ -320,13 +320,14 @@ static NetClientInfo net_tap_info = {
 static TAPState *net_tap_fd_init(VLANState *vlan,
                                  const char *model,
                                  const char *name,
-                                 int fd,
+                                 int fd, int numtxqs,
                                  int vnet_hdr)
 {
     VLANClientState *nc;
     TAPState *s;
 
     nc = qemu_new_net_client(&net_tap_info, vlan, NULL, model, name);
+    nc->numtxqs = numtxqs;
 
     s = DO_UPCAST(TAPState, nc, nc);
 
@@ -424,6 +425,27 @@ int net_init_tap(QemuOpts *opts, Monitor
 {
     TAPState *s;
     int fd, vnet_hdr = 0;
+    int vhost;
+    int numtxqs = 1;
+
+    vhost = qemu_opt_get_bool(opts, "vhost", 0);
+
+    /*
+     * We support multiple tx queues if:
+     *      1. smp > 1
+     *      2. vhost=on
+     *      3. mq=on
+     * In this case, #txqueues = #cpus. This value can be changed by
+     * using the "numtxqs" option.
+     */
+    if (vhost && smp_cpus > 1) {
+        if (qemu_opt_get_bool(opts, "mq", 0)) {
+#define VIRTIO_MAX_TXQS         32
+            int dflt = MIN(smp_cpus, VIRTIO_MAX_TXQS);
+
+            numtxqs = qemu_opt_get_number(opts, "numtxqs", dflt);
+        }
+    }
 
     if (qemu_opt_get(opts, "fd")) {
         if (qemu_opt_get(opts, "ifname") ||
@@ -457,7 +479,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    s = net_tap_fd_init(vlan, "tap", name, fd, vnet_hdr);
+    s = net_tap_fd_init(vlan, "tap", name, fd, numtxqs, vnet_hdr);
     if (!s) {
         close(fd);
         return -1;
@@ -486,7 +508,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
+    if (vhost) {
         int vhostfd, r;
         if (qemu_opt_get(opts, "vhostfd")) {
             r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
@@ -497,9 +519,13 @@ int net_init_tap(QemuOpts *opts, Monitor
         } else {
             vhostfd = -1;
         }
-        s->vhost_net = vhost_net_init(&s->nc, vhostfd);
+        s->vhost_net = vhost_net_init(&s->nc, vhostfd, numtxqs);
         if (!s->vhost_net) {
             error_report("vhost-net requested but could not be initialized");
+            if (numtxqs > 1) {
+                error_report("Need vhost support for numtxqs > 1, exiting...");
+                exit(1);
+            }
             return -1;
         }
     } else if (qemu_opt_get(opts, "vhostfd")) {
diff -ruNp org3/net.c new3/net.c
--- org3/net.c	2010-10-19 19:38:11.000000000 +0530
+++ new3/net.c	2010-10-19 19:46:52.000000000 +0530
@@ -849,6 +849,15 @@ static int net_init_nic(QemuOpts *opts,
         return -1;
     }
 
+    if (nd->netdev->numtxqs > 1 && nd->nvectors == DEV_NVECTORS_UNSPECIFIED) {
+        /*
+         * User specified mq for guest, but no "vectors=", tune
+         * it automatically to 'numtxqs' TX + 1 RX + 1 controlq.
+         */
+        nd->nvectors = nd->netdev->numtxqs + 1 + 1;
+        monitor_printf(mon, "nvectors tuned to %d\n", nd->nvectors);
+    }
+
     nd->used = 1;
     nb_nics++;
 
@@ -992,6 +1001,14 @@ static const struct {
             },
 #ifndef _WIN32
             {
+                .name = "mq",
+                .type = QEMU_OPT_BOOL,
+                .help = "enable multiqueue on network i/f",
+            }, {
+                .name = "numtxqs",
+                .type = QEMU_OPT_NUMBER,
+                .help = "optional number of TX queues, if mq is enabled",
+            }, {
                 .name = "fd",
                 .type = QEMU_OPT_STRING,
                 .help = "file descriptor of an already opened tap",
diff -ruNp org3/net.h new3/net.h
--- org3/net.h	2010-10-19 19:38:11.000000000 +0530
+++ new3/net.h	2010-10-19 19:46:52.000000000 +0530
@@ -62,6 +62,7 @@ struct VLANClientState {
     struct VLANState *vlan;
     VLANClientState *peer;
     NetQueue *send_queue;
+    int numtxqs;
     char *model;
     char *name;
     char info_str[256];

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-20  8:54 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (3 preceding siblings ...)
  2010-10-20  8:55 ` [v3 RFC PATCH 4/4] qemu changes Krishna Kumar
@ 2010-10-25 15:50 ` Krishna Kumar2
  2010-10-25 16:17   ` Michael S. Tsirkin
  2010-10-26  8:57   ` Michael S. Tsirkin
  2011-02-22  7:47 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Simon Horman
  5 siblings, 2 replies; 35+ messages in thread
From: Krishna Kumar2 @ 2010-10-25 15:50 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, mst, netdev, rusty

> Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:

Any feedback, comments, objections, issues or bugs about the
patches? Please let me know if something needs to be done.

Some more test results:
_____________________________________________________
         Host->Guest BW (numtxqs=2)
#       BW%     CPU%    RCPU%   SD%     RSD%
_____________________________________________________
1       5.53    .31     .67     -5.88   0
2       -2.11   -1.01   -2.08   4.34    0
4       13.53   10.77   13.87   -1.96   0
8       34.22   22.80   30.53   -8.46   -2.50
16      30.89   24.06   35.17   -5.20   3.20
24      33.22   26.30   43.39   -5.17   7.58
32      30.85   27.27   47.74   -.59    15.51
40      33.80   27.33   48.00   -7.42   7.59
48      45.93   26.33   45.46   -12.24  1.10
64      33.51   27.11   45.00   -3.27   10.30
80      39.28   29.21   52.33   -4.88   12.17
96      32.05   31.01   57.72   -1.02   19.05
128     35.66   32.04   60.00   -.66    20.41
_____________________________________________________
BW: 23.5%  CPU/RCPU: 28.6%,51.2%  SD/RSD: -2.6%,15.8%

____________________________________________________
Guest->Host 512 byte (numtxqs=2):
#       BW%     CPU%    RCPU%   SD%     RSD%
_____________________________________________________
1       3.02    -3.84   -4.76   -12.50  -7.69
2       52.77   -15.73  -8.66   -45.31  -40.33
4       -23.14  13.84   7.50    50.58   40.81
8       -21.44  28.08   16.32   63.06   47.43
16      33.53   46.50   27.19   7.61    -6.60
24      55.77   42.81   30.49   -8.65   -16.48
32      52.59   38.92   29.08   -9.18   -15.63
40      50.92   36.11   28.92   -10.59  -15.30
48      46.63   34.73   28.17   -7.83   -12.32
64      45.56   37.12   28.81   -5.05   -10.80
80      44.55   36.60   28.45   -4.95   -10.61
96      43.02   35.97   28.89   -.11    -5.31
128     38.54   33.88   27.19   -4.79   -9.54
_____________________________________________________
BW: 34.4%  CPU/RCPU: 35.9%,27.8%  SD/RSD: -4.1%,-9.3%


Thanks,

- KK



> [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
>
> Following set of patches implement transmit MQ in virtio-net.  Also
> included is the user qemu changes.  MQ is disabled by default unless
> qemu specifies it.
>
>                   Changes from rev2:
>                   ------------------
> 1. Define (in virtio_net.h) the maximum send txqs; and use in
>    virtio-net and vhost-net.
> 2. vi->sq[i] is allocated individually, resulting in cache line
>    aligned sq[0] to sq[n].  Another option was to define
>    'send_queue' as:
>        struct send_queue {
>                struct virtqueue *svq;
>                struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>        } ____cacheline_aligned_in_smp;
>    and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
>    the submitted method is preferable.
> 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
>    handles TX[0-n].
> 4. Further change TX handling such that vhost[0] handles both RX/TX
>    for single stream case.
>
>                   Enabling MQ on virtio:
>                   -----------------------
> When following options are passed to qemu:
>         - smp > 1
>         - vhost=on
>         - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using an
> optional 'numtxqs' option.  e.g. for a smp=4 guest:
>         vhost=on                   ->   #txqueues = 1
>         vhost=on,mq=on             ->   #txqueues = 4
>         vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
>         vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
>
>
>                    Performance (guest -> local host):
>                    -----------------------------------
> System configuration:
>         Host:  8 Intel Xeon, 8 GB memory
>         Guest: 4 cpus, 2 GB memory
> Test: Each test case runs for 60 secs, sum over three runs (except
> when number of netperf sessions is 1, which has 10 runs of 12 secs
> each).  No tuning (default netperf) other than taskset vhost's to
> cpus 0-3.  numtxqs=32 gave the best results though the guest had
> only 4 vcpus (I haven't tried beyond that).
>
> ______________ numtxqs=2, vhosts=3  ____________________
> #sessions  BW%      CPU%    RCPU%    SD%      RSD%
> ________________________________________________________
> 1          4.46    -1.96     .19     -12.50   -6.06
> 2          4.93    -1.16    2.10      0       -2.38
> 4          46.17    64.77   33.72     19.51   -2.48
> 8          47.89    70.00   36.23     41.46    13.35
> 16         48.97    80.44   40.67     21.11   -5.46
> 24         49.03    78.78   41.22     20.51   -4.78
> 32         51.11    77.15   42.42     15.81   -6.87
> 40         51.60    71.65   42.43     9.75    -8.94
> 48         50.10    69.55   42.85     11.80   -5.81
> 64         46.24    68.42   42.67     14.18   -3.28
> 80         46.37    63.13   41.62     7.43    -6.73
> 96         46.40    63.31   42.20     9.36    -4.78
> 128        50.43    62.79   42.16     13.11   -1.23
> ________________________________________________________
> BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
>
> ______________ numtxqs=8, vhosts=5  ____________________
> #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> ________________________________________________________
> 1           -.76    -1.56     2.33      0        3.03
> 2           17.41    11.11    11.41     0       -4.76
> 4           42.12    55.11    30.20     19.51    .62
> 8           54.69    80.00    39.22     24.39    -3.88
> 16          54.77    81.62    40.89     20.34    -6.58
> 24          54.66    79.68    41.57     15.49    -8.99
> 32          54.92    76.82    41.79     17.59    -5.70
> 40          51.79    68.56    40.53     15.31    -3.87
> 48          51.72    66.40    40.84     9.72     -7.13
> 64          51.11    63.94    41.10     5.93     -8.82
> 80          46.51    59.50    39.80     9.33     -4.18
> 96          47.72    57.75    39.84     4.20     -7.62
> 128         54.35    58.95    40.66     3.24     -8.63
> ________________________________________________________
> BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%
>
> ______________ numtxqs=16, vhosts=5  ___________________
> #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> ________________________________________________________
> 1           -1.43    -3.52    1.55      0          3.03
> 2           33.09     21.63   20.12    -10.00     -9.52
> 4           67.17     94.60   44.28     19.51     -11.80
> 8           75.72     108.14  49.15     25.00     -10.71
> 16          80.34     101.77  52.94     25.93     -4.49
> 24          70.84     93.12   43.62     27.63     -5.03
> 32          69.01     94.16   47.33     29.68     -1.51
> 40          58.56     63.47   25.91    -3.92      -25.85
> 48          61.16     74.70   34.88     .89       -22.08
> 64          54.37     69.09   26.80    -6.68      -30.04
> 80          36.22     22.73   -2.97    -8.25      -27.23
> 96          41.51     50.59   13.24     9.84      -16.77
> 128         48.98     38.15   6.41     -.33       -22.80
> ________________________________________________________
> BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%
>
> ______________ numtxqs=32, vhosts=5  ___________________
> #            BW%       CPU%    RCPU%    SD%     RSD%
> ________________________________________________________
> 1            7.62     -38.03   -26.26  -50.00   -33.33
> 2            28.95     20.46    21.62   0       -7.14
> 4            84.05     60.79    45.74  -2.43    -12.42
> 8            86.43     79.57    50.32   15.85   -3.10
> 16           88.63     99.48    58.17   9.47    -13.10
> 24           74.65     80.87    41.99  -1.81    -22.89
> 32           63.86     59.21    23.58  -18.13   -36.37
> 40           64.79     60.53    22.23  -15.77   -35.84
> 48           49.68     26.93    .51    -36.40   -49.61
> 64           54.69     36.50    5.41   -26.59   -43.23
> 80           45.06     12.72   -13.25  -37.79   -52.08
> 96           40.21    -3.16    -24.53  -39.92   -52.97
> 128          36.33    -33.19   -43.66  -5.68    -20.49
> ________________________________________________________
> BW: 49.3%,  CPU/RCPU: 15.5%,-8.2%,  SD/RSD: -22.2%,-37.0%
>
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-25 15:50 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar2
@ 2010-10-25 16:17   ` Michael S. Tsirkin
  2010-10-26  5:10     ` Krishna Kumar2
       [not found]     ` <OF5C53E9CF.FFDF2CE7-ON652577C8.00191D14-652577C8.001C2154@LocalDomain>
  2010-10-26  8:57   ` Michael S. Tsirkin
  1 sibling, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-10-25 16:17 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> 
> Any feedback, comments, objections, issues or bugs about the
> patches? Please let me know if something needs to be done.

I am trying to wrap my head around kernel/user interface here.
E.g., will we need another incompatible change when we add multiple RX
queues? Also need to think about how robust our single stream heuristic is,
e.g. what are the chances it will misdetect a bidirectional
UDP stream as a single TCP?

> Some more test results:
> _____________________________________________________
>          Host->Guest BW (numtxqs=2)
> #       BW%     CPU%    RCPU%   SD%     RSD%
> _____________________________________________________
> 1       5.53    .31     .67     -5.88   0
> 2       -2.11   -1.01   -2.08   4.34    0
> 4       13.53   10.77   13.87   -1.96   0
> 8       34.22   22.80   30.53   -8.46   -2.50
> 16      30.89   24.06   35.17   -5.20   3.20
> 24      33.22   26.30   43.39   -5.17   7.58
> 32      30.85   27.27   47.74   -.59    15.51
> 40      33.80   27.33   48.00   -7.42   7.59
> 48      45.93   26.33   45.46   -12.24  1.10
> 64      33.51   27.11   45.00   -3.27   10.30
> 80      39.28   29.21   52.33   -4.88   12.17
> 96      32.05   31.01   57.72   -1.02   19.05
> 128     35.66   32.04   60.00   -.66    20.41
> _____________________________________________________
> BW: 23.5%  CPU/RCPU: 28.6%,51.2%  SD/RSD: -2.6%,15.8%
> 
> ____________________________________________________
> Guest->Host 512 byte (numtxqs=2):
> #       BW%     CPU%    RCPU%   SD%     RSD%
> _____________________________________________________
> 1       3.02    -3.84   -4.76   -12.50  -7.69
> 2       52.77   -15.73  -8.66   -45.31  -40.33
> 4       -23.14  13.84   7.50    50.58   40.81
> 8       -21.44  28.08   16.32   63.06   47.43
> 16      33.53   46.50   27.19   7.61    -6.60
> 24      55.77   42.81   30.49   -8.65   -16.48
> 32      52.59   38.92   29.08   -9.18   -15.63
> 40      50.92   36.11   28.92   -10.59  -15.30
> 48      46.63   34.73   28.17   -7.83   -12.32
> 64      45.56   37.12   28.81   -5.05   -10.80
> 80      44.55   36.60   28.45   -4.95   -10.61
> 96      43.02   35.97   28.89   -.11    -5.31
> 128     38.54   33.88   27.19   -4.79   -9.54
> _____________________________________________________
> BW: 34.4%  CPU/RCPU: 35.9%,27.8%  SD/RSD: -4.1%,-9.3%
> 
> 
> Thanks,
> 
> - KK
> 
> 
> 
> > [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> >
> > Following set of patches implement transmit MQ in virtio-net.  Also
> > included is the user qemu changes.  MQ is disabled by default unless
> > qemu specifies it.
> >
> >                   Changes from rev2:
> >                   ------------------
> > 1. Define (in virtio_net.h) the maximum send txqs; and use in
> >    virtio-net and vhost-net.
> > 2. vi->sq[i] is allocated individually, resulting in cache line
> >    aligned sq[0] to sq[n].  Another option was to define
> >    'send_queue' as:
> >        struct send_queue {
> >                struct virtqueue *svq;
> >                struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> >        } ____cacheline_aligned_in_smp;
> >    and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
> >    the submitted method is preferable.
> > 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
> >    handles TX[0-n].
> > 4. Further change TX handling such that vhost[0] handles both RX/TX
> >    for single stream case.
> >
> >                   Enabling MQ on virtio:
> >                   -----------------------
> > When following options are passed to qemu:
> >         - smp > 1
> >         - vhost=on
> >         - mq=on (new option, default:off)
> > then #txqueues = #cpus.  The #txqueues can be changed by using an
> > optional 'numtxqs' option.  e.g. for a smp=4 guest:
> >         vhost=on                   ->   #txqueues = 1
> >         vhost=on,mq=on             ->   #txqueues = 4
> >         vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> >         vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> >
> >
> >                    Performance (guest -> local host):
> >                    -----------------------------------
> > System configuration:
> >         Host:  8 Intel Xeon, 8 GB memory
> >         Guest: 4 cpus, 2 GB memory
> > Test: Each test case runs for 60 secs, sum over three runs (except
> > when number of netperf sessions is 1, which has 10 runs of 12 secs
> > each).  No tuning (default netperf) other than taskset vhost's to
> > cpus 0-3.  numtxqs=32 gave the best results though the guest had
> > only 4 vcpus (I haven't tried beyond that).
> >
> > ______________ numtxqs=2, vhosts=3  ____________________
> > #sessions  BW%      CPU%    RCPU%    SD%      RSD%
> > ________________________________________________________
> > 1          4.46    -1.96     .19     -12.50   -6.06
> > 2          4.93    -1.16    2.10      0       -2.38
> > 4          46.17    64.77   33.72     19.51   -2.48
> > 8          47.89    70.00   36.23     41.46    13.35
> > 16         48.97    80.44   40.67     21.11   -5.46
> > 24         49.03    78.78   41.22     20.51   -4.78
> > 32         51.11    77.15   42.42     15.81   -6.87
> > 40         51.60    71.65   42.43     9.75    -8.94
> > 48         50.10    69.55   42.85     11.80   -5.81
> > 64         46.24    68.42   42.67     14.18   -3.28
> > 80         46.37    63.13   41.62     7.43    -6.73
> > 96         46.40    63.31   42.20     9.36    -4.78
> > 128        50.43    62.79   42.16     13.11   -1.23
> > ________________________________________________________
> > BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
> >
> > ______________ numtxqs=8, vhosts=5  ____________________
> > #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> > ________________________________________________________
> > 1           -.76    -1.56     2.33      0        3.03
> > 2           17.41    11.11    11.41     0       -4.76
> > 4           42.12    55.11    30.20     19.51    .62
> > 8           54.69    80.00    39.22     24.39    -3.88
> > 16          54.77    81.62    40.89     20.34    -6.58
> > 24          54.66    79.68    41.57     15.49    -8.99
> > 32          54.92    76.82    41.79     17.59    -5.70
> > 40          51.79    68.56    40.53     15.31    -3.87
> > 48          51.72    66.40    40.84     9.72     -7.13
> > 64          51.11    63.94    41.10     5.93     -8.82
> > 80          46.51    59.50    39.80     9.33     -4.18
> > 96          47.72    57.75    39.84     4.20     -7.62
> > 128         54.35    58.95    40.66     3.24     -8.63
> > ________________________________________________________
> > BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%
> >
> > ______________ numtxqs=16, vhosts=5  ___________________
> > #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> > ________________________________________________________
> > 1           -1.43    -3.52    1.55      0          3.03
> > 2           33.09     21.63   20.12    -10.00     -9.52
> > 4           67.17     94.60   44.28     19.51     -11.80
> > 8           75.72     108.14  49.15     25.00     -10.71
> > 16          80.34     101.77  52.94     25.93     -4.49
> > 24          70.84     93.12   43.62     27.63     -5.03
> > 32          69.01     94.16   47.33     29.68     -1.51
> > 40          58.56     63.47   25.91    -3.92      -25.85
> > 48          61.16     74.70   34.88     .89       -22.08
> > 64          54.37     69.09   26.80    -6.68      -30.04
> > 80          36.22     22.73   -2.97    -8.25      -27.23
> > 96          41.51     50.59   13.24     9.84      -16.77
> > 128         48.98     38.15   6.41     -.33       -22.80
> > ________________________________________________________
> > BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%
> >
> > ______________ numtxqs=32, vhosts=5  ___________________
> > #            BW%       CPU%    RCPU%    SD%     RSD%
> > ________________________________________________________
> > 1            7.62     -38.03   -26.26  -50.00   -33.33
> > 2            28.95     20.46    21.62   0       -7.14
> > 4            84.05     60.79    45.74  -2.43    -12.42
> > 8            86.43     79.57    50.32   15.85   -3.10
> > 16           88.63     99.48    58.17   9.47    -13.10
> > 24           74.65     80.87    41.99  -1.81    -22.89
> > 32           63.86     59.21    23.58  -18.13   -36.37
> > 40           64.79     60.53    22.23  -15.77   -35.84
> > 48           49.68     26.93    .51    -36.40   -49.61
> > 64           54.69     36.50    5.41   -26.59   -43.23
> > 80           45.06     12.72   -13.25  -37.79   -52.08
> > 96           40.21    -3.16    -24.53  -39.92   -52.97
> > 128          36.33    -33.19   -43.66  -5.68    -20.49
> > ________________________________________________________
> > BW: 49.3%,  CPU/RCPU: 15.5%,-8.2%,  SD/RSD: -22.2%,-37.0%
> >
> >
> > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-25 16:17   ` Michael S. Tsirkin
@ 2010-10-26  5:10     ` Krishna Kumar2
       [not found]     ` <OF5C53E9CF.FFDF2CE7-ON652577C8.00191D14-652577C8.001C2154@LocalDomain>
  1 sibling, 0 replies; 35+ messages in thread
From: Krishna Kumar2 @ 2010-10-26  5:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/25/2010 09:47:18 PM:

> > Any feedback, comments, objections, issues or bugs about the
> > patches? Please let me know if something needs to be done.
>
> I am trying to wrap my head around kernel/user interface here.
> E.g., will we need another incompatible change when we add multiple RX
> queues?

Though I added a 'mq' option to qemu, there shouldn't be
any incompatibility between old and new qemu's wrt vhost
and virtio-net drivers. So the old qemu will run new host
and new guest without issues, and new qemu can also run
old host and old guest. Multiple RXQ will also not add
any incompatibility.

With MQ RX, I will be able to remove the hueristic (idea
from David Stevens).  The idea is: Guest sends out packets
on, say TXQ#2, vhost#2 processes the packets but packets
going out from host to guest might be sent out on a
different RXQ, say RXQ#4.  Guest receives the packet on
RXQ#4, and all future responses on that connection are sent
on TXQ#4.  Now vhost#4 processes both RX and TX packets for
this connection.  Without needing to hash on the connection,
guest can make sure that the same vhost thread will handle
a single connection.

> Also need to think about how robust our single stream heuristic is,
> e.g. what are the chances it will misdetect a bidirectional
> UDP stream as a single TCP?

I think it should not happen. The hueristic code gets
called for handling just the transmit packets, packets
that vhost sends out to the guest skip this path.

I tested unidirectional and bidirectional UDP to confirm:

8 iterations of iperf tests, each iteration of 15 secs,
result is the sum of all 8 iterations in Gbits/sec
__________________________________________
Uni-directional          Bi-directional
  Org      New             Org      New
__________________________________________
  71.78    71.77           71.74   72.07
__________________________________________

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-25 15:50 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar2
  2010-10-25 16:17   ` Michael S. Tsirkin
@ 2010-10-26  8:57   ` Michael S. Tsirkin
  2010-11-09  4:38     ` Krishna Kumar2
  1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-10-26  8:57 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> 
> Any feedback, comments, objections, issues or bugs about the
> patches? Please let me know if something needs to be done.
> 
> Some more test results:
> _____________________________________________________
>          Host->Guest BW (numtxqs=2)
> #       BW%     CPU%    RCPU%   SD%     RSD%
> _____________________________________________________

I think we discussed the need for external to guest testing
over 10G. For large messages we should not see any change
but you should be able to get better numbers for small messages
assuming a MQ NIC card.

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
       [not found]     ` <OF5C53E9CF.FFDF2CE7-ON652577C8.00191D14-652577C8.001C2154@LocalDomain>
@ 2010-10-26  9:08       ` Krishna Kumar2
  2010-10-26  9:38         ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2010-10-26  9:08 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, Michael S. Tsirkin,
	netdev, rusty

Krishna Kumar2/India/IBM wrote on 10/26/2010 10:40:35 AM:

> > I am trying to wrap my head around kernel/user interface here.
> > E.g., will we need another incompatible change when we add multiple RX
> > queues?
>
> Though I added a 'mq' option to qemu, there shouldn't be
> any incompatibility between old and new qemu's wrt vhost
> and virtio-net drivers. So the old qemu will run new host
> and new guest without issues, and new qemu can also run
> old host and old guest. Multiple RXQ will also not add
> any incompatibility.
>
> With MQ RX, I will be able to remove the hueristic (idea
> from David Stevens).  The idea is: Guest sends out packets
> on, say TXQ#2, vhost#2 processes the packets but packets
> going out from host to guest might be sent out on a
> different RXQ, say RXQ#4.  Guest receives the packet on
> RXQ#4, and all future responses on that connection are sent
> on TXQ#4.  Now vhost#4 processes both RX and TX packets for
> this connection.  Without needing to hash on the connection,
> guest can make sure that the same vhost thread will handle
> a single connection.
>
> > Also need to think about how robust our single stream heuristic is,
> > e.g. what are the chances it will misdetect a bidirectional
> > UDP stream as a single TCP?

> I think it should not happen. The hueristic code gets
> called for handling just the transmit packets, packets
> that vhost sends out to the guest skip this path.
>
> I tested unidirectional and bidirectional UDP to confirm:
>
> 8 iterations of iperf tests, each iteration of 15 secs,
> result is the sum of all 8 iterations in Gbits/sec
> __________________________________________
> Uni-directional          Bi-directional
>   Org      New             Org      New
> __________________________________________
>   71.78    71.77           71.74   72.07
> __________________________________________


Results for UDP BW tests (unidirectional, sum across
3 iterations, each iteration of 45 seconds, default
netperf, vhosts bound to cpus 0-3; no other tuning):

------ numtxqs=8, vhosts=5 ---------
#     BW%    CPU%    SD%
------------------------------------
1     .49    1.07     0
2    23.51   52.51    26.66
4    75.17   72.43    8.57
8    86.54   80.21    27.85
16   92.37   85.99    6.27
24   91.37   84.91    8.41
32   89.78   82.90    3.31
48   89.85   79.95   -3.57
64   85.83   80.28    2.22
80   88.90   79.47   -23.18
96   90.12   79.98    14.71
128  86.13   80.60    4.42
------------------------------------
BW: 71.3%, CPU: 80.4%, SD: 1.2%


------ numtxqs=16, vhosts=5 --------
#    BW%      CPU%     SD%
------------------------------------
1    1.80     0        0
2    19.81    50.68    26.66
4    57.31    52.77    8.57
8    108.44   88.19   -5.21
16   106.09   85.03   -4.44
24   102.34   84.23   -.82
32   102.77   82.71   -5.81
48   100.00   79.62   -7.29
64   96.86    79.75   -6.10
80   99.26    79.82   -27.34
96   94.79    80.02   -5.08
128  98.14    81.15   -15.25
------------------------------------
BW: 77.9%,  CPU: 80.4%,  SD: -13.6%

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-26  9:08       ` Krishna Kumar2
@ 2010-10-26  9:38         ` Michael S. Tsirkin
  2010-10-26 10:01           ` Krishna Kumar2
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-10-26  9:38 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
> Results for UDP BW tests (unidirectional, sum across
> 3 iterations, each iteration of 45 seconds, default
> netperf, vhosts bound to cpus 0-3; no other tuning):

Is binding vhost threads to CPUs really required?
What happens if we let the scheduler do its job?

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-26  9:38         ` Michael S. Tsirkin
@ 2010-10-26 10:01           ` Krishna Kumar2
  2010-10-26 11:09             ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2010-10-26 10:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty

> "Michael S. Tsirkin" <mst@redhat.com>
>
> On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
> > Results for UDP BW tests (unidirectional, sum across
> > 3 iterations, each iteration of 45 seconds, default
> > netperf, vhosts bound to cpus 0-3; no other tuning):
>
> Is binding vhost threads to CPUs really required?
> What happens if we let the scheduler do its job?

Nothing drastic, I remember BW% and SD% both improved a
bit as a result of binding. I started binding vhost thread
after Avi suggested it in response to my v1 patch (he
suggested some more that I haven't done), and have been
doing only this tuning ever since. This is part of his
mail for the tuning:

> 		 vhost:
> 		 		 thread #0:  CPU0
> 		 		 thread #1:  CPU1
> 		 		 thread #2:  CPU2
> 		 		 thread #3:  CPU3

I simply bound each thread to CPU0-3 instead.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-26 10:01           ` Krishna Kumar2
@ 2010-10-26 11:09             ` Michael S. Tsirkin
  2010-10-28  5:14               ` Krishna Kumar2
       [not found]               ` <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@LocalDomain>
  0 siblings, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-10-26 11:09 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty

On Tue, Oct 26, 2010 at 03:31:39PM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com>
> >
> > On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
> > > Results for UDP BW tests (unidirectional, sum across
> > > 3 iterations, each iteration of 45 seconds, default
> > > netperf, vhosts bound to cpus 0-3; no other tuning):
> >
> > Is binding vhost threads to CPUs really required?
> > What happens if we let the scheduler do its job?
> 
> Nothing drastic, I remember BW% and SD% both improved a
> bit as a result of binding.

If there's a significant improvement this would mean that
we need to rethink the vhost-net interaction with the scheduler.

> I started binding vhost thread
> after Avi suggested it in response to my v1 patch (he
> suggested some more that I haven't done), and have been
> doing only this tuning ever since. This is part of his
> mail for the tuning:
> 
> > 		 vhost:
> > 		 		 thread #0:  CPU0
> > 		 		 thread #1:  CPU1
> > 		 		 thread #2:  CPU2
> > 		 		 thread #3:  CPU3
> 
> I simply bound each thread to CPU0-3 instead.
> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-26 11:09             ` Michael S. Tsirkin
@ 2010-10-28  5:14               ` Krishna Kumar2
  2010-10-28  5:50                 ` Michael S. Tsirkin
       [not found]               ` <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@LocalDomain>
  1 sibling, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2010-10-28  5:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 04:39:13 PM:

(merging two posts into one)

> I think we discussed the need for external to guest testing
> over 10G. For large messages we should not see any change
> but you should be able to get better numbers for small messages
> assuming a MQ NIC card.

For external host, there is a contention among different
queues (vhosts) when packets are processed in tun/bridge,
unless I implement MQ TX for macvtap (tun/bridge?).  So
my testing shows a small improvement (1 to 1.5% average)
in BW and a rise in SD (between 10-15%).  For remote host,
I think tun/macvtap needs MQ TX support?

> > > > Results for UDP BW tests (unidirectional, sum across
> > > > 3 iterations, each iteration of 45 seconds, default
> > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > >
> > > Is binding vhost threads to CPUs really required?
> > > What happens if we let the scheduler do its job?
> >
> > Nothing drastic, I remember BW% and SD% both improved a
> > bit as a result of binding.
>
> If there's a significant improvement this would mean that
> we need to rethink the vhost-net interaction with the scheduler.

I will get a test run with and without binding and post the
results later today.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-28  5:14               ` Krishna Kumar2
@ 2010-10-28  5:50                 ` Michael S. Tsirkin
  2010-10-28  6:12                   ` Krishna Kumar2
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-10-28  5:50 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Thu, Oct 28, 2010 at 10:44:14AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 04:39:13 PM:
> 
> (merging two posts into one)
> 
> > I think we discussed the need for external to guest testing
> > over 10G. For large messages we should not see any change
> > but you should be able to get better numbers for small messages
> > assuming a MQ NIC card.
> 
> For external host, there is a contention among different
> queues (vhosts) when packets are processed in tun/bridge,
> unless I implement MQ TX for macvtap (tun/bridge?).  So
> my testing shows a small improvement (1 to 1.5% average)
> in BW and a rise in SD (between 10-15%).  For remote host,
> I think tun/macvtap needs MQ TX support?

Confused. I thought this *is* with a multiqueue tun/macvtap?
bridge does not do any queueing AFAIK ...
I think we need to fix the contention. With migration what was guest to
host a minute ago might become guest to external now ...

> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
> 
> I will get a test run with and without binding and post the
> results later today.
> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-28  5:50                 ` Michael S. Tsirkin
@ 2010-10-28  6:12                   ` Krishna Kumar2
  2010-10-28  6:18                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2010-10-28  6:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty

> "Michael S. Tsirkin" <mst@redhat.com>

> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > For external host, there is a contention among different
> > queues (vhosts) when packets are processed in tun/bridge,
> > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > my testing shows a small improvement (1 to 1.5% average)
> > in BW and a rise in SD (between 10-15%).  For remote host,
> > I think tun/macvtap needs MQ TX support?
>
> Confused. I thought this *is* with a multiqueue tun/macvtap?
> bridge does not do any queueing AFAIK ...
> I think we need to fix the contention. With migration what was guest to
> host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-28  6:12                   ` Krishna Kumar2
@ 2010-10-28  6:18                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-10-28  6:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty

On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com>
> 
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > For external host, there is a contention among different
> > > queues (vhosts) when packets are processed in tun/bridge,
> > > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > > my testing shows a small improvement (1 to 1.5% average)
> > > in BW and a rise in SD (between 10-15%).  For remote host,
> > > I think tun/macvtap needs MQ TX support?
> >
> > Confused. I thought this *is* with a multiqueue tun/macvtap?
> > bridge does not do any queueing AFAIK ...
> > I think we need to fix the contention. With migration what was guest to
> > host a minute ago might become guest to external now ...
> 
> Macvtap RX is MQ but not TX. I don't think MQ TX support is
> required for macvtap, though. Is it enough for existing
> macvtap sendmsg to work, since it calls dev_queue_xmit
> which selects the txq for the outgoing device?
> 
> Thanks,
> 
> - KK

I think there would be an issue with using a single poll notifier and
contention on send buffer atomic variable.
Is tun different than macvtap? We need to support both long term ...

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
       [not found]               ` <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@LocalDomain>
@ 2010-10-28  7:18                 ` Krishna Kumar2
  2010-10-29 11:26                   ` Michael S. Tsirkin
  2010-11-03  7:01                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 35+ messages in thread
From: Krishna Kumar2 @ 2010-10-28  7:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, Michael S. Tsirkin,
	netdev, rusty

> Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
>
> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
>
> I will get a test run with and without binding and post the
> results later today.

Correction: The result with binding is is much better for
SD/CPU compared to without-binding:

_____________________________________________________
     numtxqs=8,vhosts=5, Bind vs No-bind
#     BW%     CPU%     RCPU%     SD%       RSD%
_____________________________________________________
1     11.25     10.77    1.89     0        -6.06
2     18.66     7.20     7.20    -14.28    -7.40
4     4.24     -1.27     1.56    -2.70     -.98
8     14.91    -3.79     5.46    -12.19    -3.76
16    12.32    -8.67     4.63    -35.97    -26.66
24    11.68    -7.83     5.10    -40.73    -32.37
32    13.09    -10.51    6.57    -51.52    -42.28
40    11.04    -4.12     11.23   -50.69    -42.81
48    8.61     -10.30    6.04    -62.38    -55.54
64    7.55     -6.05     6.41    -61.20    -56.04
80    8.74     -11.45    6.29    -72.65    -67.17
96    9.84     -6.01     9.87    -69.89    -64.78
128   5.57     -6.23     8.99    -75.03    -70.97
_____________________________________________________
BW: 10.4%,  CPU/RCPU: -7.4%,7.7%,  SD: -70.5%,-65.7%

Notes:
    1.  All my test results earlier was binding vhost
        to cpus 0-3 for both org and new kernel.
    2.  I am not using MST's use_mq patch, only mainline
        kernel. However, I reported earlier that I got
        better results with that patch. The result for
        MQ vs MQ+use_mm patch (from my earlier mail):

BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-28  7:18                 ` Krishna Kumar2
@ 2010-10-29 11:26                   ` Michael S. Tsirkin
  2010-10-29 14:57                     ` linux_kvm
  2010-11-03  7:01                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-10-29 11:26 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
> >
> > > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > > >
> > > > > Is binding vhost threads to CPUs really required?
> > > > > What happens if we let the scheduler do its job?
> > > >
> > > > Nothing drastic, I remember BW% and SD% both improved a
> > > > bit as a result of binding.
> > >
> > > If there's a significant improvement this would mean that
> > > we need to rethink the vhost-net interaction with the scheduler.
> >
> > I will get a test run with and without binding and post the
> > results later today.
> 
> Correction: The result with binding is is much better for
> SD/CPU compared to without-binding:

Can you pls ty finding out why that is?  Is some thread bouncing between
CPUs?  Does a wrong numa node get picked up?
In practice users are very unlikely to pin threads to CPUs.

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-29 11:26                   ` Michael S. Tsirkin
@ 2010-10-29 14:57                     ` linux_kvm
  0 siblings, 0 replies; 35+ messages in thread
From: linux_kvm @ 2010-10-29 14:57 UTC (permalink / raw)
  To: Michael S. Tsirkin

On Fri, 29 Oct 2010 13:26 +0200, "Michael S. Tsirkin" <mst@redhat.com>
wrote:
> On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
> > > Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
> In practice users are very unlikely to pin threads to CPUs.

I may be misunderstanding what you're referring to. It caught my
attention since I'm working on a configuration to do what you say is
unlikely, so I'll chime in for what it's worth.

An option in Vyatta allows assigning CPU affinity to network adapters,
since apparently seperate L2 caches can have a significant impact on
throughput.

Although much of their focus seems to be on commercial virtualization
platforms, I do see quite a few forum posts with regard to KVM.
Mabye this still qualifies as an edge case, but as for virtualized
routing theirs seems to offer the most functionality.

http://www.vyatta.org/forum/viewtopic.php?t=2697

-cb

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-28  7:18                 ` Krishna Kumar2
  2010-10-29 11:26                   ` Michael S. Tsirkin
@ 2010-11-03  7:01                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-11-03  7:01 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
> >
> > > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > > >
> > > > > Is binding vhost threads to CPUs really required?
> > > > > What happens if we let the scheduler do its job?
> > > >
> > > > Nothing drastic, I remember BW% and SD% both improved a
> > > > bit as a result of binding.
> > >
> > > If there's a significant improvement this would mean that
> > > we need to rethink the vhost-net interaction with the scheduler.
> >
> > I will get a test run with and without binding and post the
> > results later today.
> 
> Correction: The result with binding is is much better for
> SD/CPU compared to without-binding:

Something that was suggested to me off-list is
trying to set smp affinity for NIC: in host to guest
case probably virtio-net, for external to guest
the host NIC as well.

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-26  8:57   ` Michael S. Tsirkin
@ 2010-11-09  4:38     ` Krishna Kumar2
  2010-11-09 13:22       ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2010-11-09  4:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 02:27:09 PM:

> Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
>
> On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> >
> > Any feedback, comments, objections, issues or bugs about the
> > patches? Please let me know if something needs to be done.
> >
> > Some more test results:
> > _____________________________________________________
> >          Host->Guest BW (numtxqs=2)
> > #       BW%     CPU%    RCPU%   SD%     RSD%
> > _____________________________________________________
>
> I think we discussed the need for external to guest testing
> over 10G. For large messages we should not see any change
> but you should be able to get better numbers for small messages
> assuming a MQ NIC card.

I had to make a few changes to qemu (and a minor change in macvtap
driver) to get multiple TXQ support using macvtap working. The NIC
is a ixgbe card.

__________________________________________________________________________
            Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
#      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
__________________________________________________________________________
1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495 (-10.4)
48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538 (-9.5)
64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403 (-5.1)
80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419 (-8.7)
96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463 (-8.8)
128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337 (-14.6)
__________________________________________________________________________
Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)

__________________________________________________________________________
            Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
#      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
__________________________________________________________________________
1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873 (-18.8)
__________________________________________________________________________
Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)

Is there anything else you would like me to test/change, or shall
I submit the next version (with the above macvtap changes)?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-11-09  4:38     ` Krishna Kumar2
@ 2010-11-09 13:22       ` Michael S. Tsirkin
  2010-11-09 15:28         ` Krishna Kumar2
       [not found]         ` <OF24E08752.2087FFA4-ON652577D6.00532DF1-652577D6.0054B291@LocalDomain>
  0 siblings, 2 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-11-09 13:22 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Tue, Nov 09, 2010 at 10:08:21AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 02:27:09 PM:
> 
> > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> >
> > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > >
> > > Any feedback, comments, objections, issues or bugs about the
> > > patches? Please let me know if something needs to be done.
> > >
> > > Some more test results:
> > > _____________________________________________________
> > >          Host->Guest BW (numtxqs=2)
> > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > _____________________________________________________
> >
> > I think we discussed the need for external to guest testing
> > over 10G. For large messages we should not see any change
> > but you should be able to get better numbers for small messages
> > assuming a MQ NIC card.
> 
> I had to make a few changes to qemu (and a minor change in macvtap
> driver) to get multiple TXQ support using macvtap working. The NIC
> is a ixgbe card.
> 
> __________________________________________________________________________
>             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> __________________________________________________________________________
> 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495 (-10.4)
> 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538 (-9.5)
> 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403 (-5.1)
> 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419 (-8.7)
> 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463 (-8.8)
> 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337 (-14.6)
> __________________________________________________________________________
> Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> 
> __________________________________________________________________________
>             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> __________________________________________________________________________
> 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
> 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
> 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
> 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873 (-18.8)
> __________________________________________________________________________
> Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> 
> Is there anything else you would like me to test/change, or shall
> I submit the next version (with the above macvtap changes)?
> 
> Thanks,
> 
> - KK

Something strange here, right?
1. You are consistently getting >10G/s here, and even with a single stream?
2. With 2 streams, is where we get < 10G/s originally. Instead of
   doubling that we get a marginal improvement with 2 queues and
   about 30% worse with 1 queue.

Is your card MQ?

-- 
MST

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-11-09 13:22       ` Michael S. Tsirkin
@ 2010-11-09 15:28         ` Krishna Kumar2
  2010-11-09 15:33           ` Michael S. Tsirkin
       [not found]         ` <OF24E08752.2087FFA4-ON652577D6.00532DF1-652577D6.0054B291@LocalDomain>
  1 sibling, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2010-11-09 15:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:52:39 PM:

> > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > >
> > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > > >
> > > > Any feedback, comments, objections, issues or bugs about the
> > > > patches? Please let me know if something needs to be done.
> > > >
> > > > Some more test results:
> > > > _____________________________________________________
> > > >          Host->Guest BW (numtxqs=2)
> > > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > > _____________________________________________________
> > >
> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > I had to make a few changes to qemu (and a minor change in macvtap
> > driver) to get multiple TXQ support using macvtap working. The NIC
> > is a ixgbe card.
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> >
__________________________________________________________________________
> > 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> > 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> > 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> > 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> > 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> > 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> > 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> > 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495
(-10.4)
> > 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538
(-9.5)
> > 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403
(-5.1)
> > 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419
(-8.7)
> > 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463
(-8.8)
> > 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337
(-14.6)
> >
__________________________________________________________________________
> > Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> >
__________________________________________________________________________
> > 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> > 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> > 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> > 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> > 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> > 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> > 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> > 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> > 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403
(-10.9)
> > 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> > 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824
(-15.4)
> > 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131
(-11.4)
> > 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873
(-18.8)
> >
__________________________________________________________________________
> > Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> >
> > Is there anything else you would like me to test/change, or shall
> > I submit the next version (with the above macvtap changes)?
> >
> > Thanks,
> >
> > - KK
>
> Something strange here, right?
> 1. You are consistently getting >10G/s here, and even with a single
stream?

Sorry, I should have mentioned this though I had stated in my
earlier mails. Each test result has two iterations, each of 60
seconds, except when #netperfs is 1 for which I do 10 iteration
(sum across 10 iterations).  I started doing many more iterations
for 1 netperf after finding the issue earlier with single stream.
So the BW is only 4.5-7 Gbps.

> 2. With 2 streams, is where we get < 10G/s originally. Instead of
>    doubling that we get a marginal improvement with 2 queues and
>    about 30% worse with 1 queue.

(doubling happens consistently for guest -> host, but never for
remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
testing scenario. In first case, there is a slight improvement in
BW and good reduction in SD. In the second case, only SD improves
(though BW drops for 2 stream for some reason).  In both cases,
BW and SD improves as the number of sessions increase.

> Is your card MQ?

Yes, the card is MQ. ixgbe 10g card.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-11-09 15:28         ` Krishna Kumar2
@ 2010-11-09 15:33           ` Michael S. Tsirkin
  2010-11-09 17:24             ` Krishna Kumar2
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-11-09 15:33 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Tue, Nov 09, 2010 at 08:58:44PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:52:39 PM:
> 
> > > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > > >
> > > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > > > >
> > > > > Any feedback, comments, objections, issues or bugs about the
> > > > > patches? Please let me know if something needs to be done.
> > > > >
> > > > > Some more test results:
> > > > > _____________________________________________________
> > > > >          Host->Guest BW (numtxqs=2)
> > > > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > > > _____________________________________________________
> > > >
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > I had to make a few changes to qemu (and a minor change in macvtap
> > > driver) to get multiple TXQ support using macvtap working. The NIC
> > > is a ixgbe card.
> > >
> > >
> __________________________________________________________________________
> > >             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > > #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> > >
> __________________________________________________________________________
> > > 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> > > 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> > > 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> > > 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> > > 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> > > 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> > > 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> > > 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495
> (-10.4)
> > > 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538
> (-9.5)
> > > 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403
> (-5.1)
> > > 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419
> (-8.7)
> > > 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463
> (-8.8)
> > > 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337
> (-14.6)
> > >
> __________________________________________________________________________
> > > Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> > >
> > >
> __________________________________________________________________________
> > >             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > > #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> > >
> __________________________________________________________________________
> > > 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> > > 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> > > 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> > > 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> > > 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> > > 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> > > 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> > > 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> > > 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403
> (-10.9)
> > > 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> > > 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824
> (-15.4)
> > > 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131
> (-11.4)
> > > 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873
> (-18.8)
> > >
> __________________________________________________________________________
> > > Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> > >
> > > Is there anything else you would like me to test/change, or shall
> > > I submit the next version (with the above macvtap changes)?
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > Something strange here, right?
> > 1. You are consistently getting >10G/s here, and even with a single
> stream?
> 
> Sorry, I should have mentioned this though I had stated in my
> earlier mails. Each test result has two iterations, each of 60
> seconds, except when #netperfs is 1 for which I do 10 iteration
> (sum across 10 iterations).

So need to divide the number by 10?

>  I started doing many more iterations
> for 1 netperf after finding the issue earlier with single stream.
> So the BW is only 4.5-7 Gbps.
> 
> > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> >    doubling that we get a marginal improvement with 2 queues and
> >    about 30% worse with 1 queue.
> 
> (doubling happens consistently for guest -> host, but never for
> remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> testing scenario. In first case, there is a slight improvement in
> BW and good reduction in SD. In the second case, only SD improves
> (though BW drops for 2 stream for some reason).  In both cases,
> BW and SD improves as the number of sessions increase.

I guess this is another indication that something's wrong.
We are quite far from line rate, the fact BW does not scale
means there's some contention in the code.

> > Is your card MQ?
> 
> Yes, the card is MQ. ixgbe 10g card.
> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-11-09 15:33           ` Michael S. Tsirkin
@ 2010-11-09 17:24             ` Krishna Kumar2
  2010-11-10 16:16               ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2010-11-09 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 09:03:25 PM:

> > > Something strange here, right?
> > > 1. You are consistently getting >10G/s here, and even with a single
> > stream?
> >
> > Sorry, I should have mentioned this though I had stated in my
> > earlier mails. Each test result has two iterations, each of 60
> > seconds, except when #netperfs is 1 for which I do 10 iteration
> > (sum across 10 iterations).
>
> So need to divide the number by 10?

Yes, that is what I get with 512/1K macvtap I/O size :)

> >  I started doing many more iterations
> > for 1 netperf after finding the issue earlier with single stream.
> > So the BW is only 4.5-7 Gbps.
> >
> > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > >    doubling that we get a marginal improvement with 2 queues and
> > >    about 30% worse with 1 queue.
> >
> > (doubling happens consistently for guest -> host, but never for
> > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > testing scenario. In first case, there is a slight improvement in
> > BW and good reduction in SD. In the second case, only SD improves
> > (though BW drops for 2 stream for some reason).  In both cases,
> > BW and SD improves as the number of sessions increase.
>
> I guess this is another indication that something's wrong.

The patch - both virtio-net and vhost-net, doesn't have any
locking/mutex's/ or any synchronization method. Guest -> host
performance improvement of upto 100% shows the patch is not
doing anything wrong.

> We are quite far from line rate, the fact BW does not scale
> means there's some contention in the code.

Attaining line speed with macvtap seems to be a generic issue
and unrelated to my patch specifically. IMHO if there is nothing
wrong in the code (review) and is accepted, it will benefit as
others can also help to find what needs to be implemented in
vhost/macvtap/qemu to get line speed for guest->remote-host.

PS: bare-metal performance for host->remote-host is also
    2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-11-09 17:24             ` Krishna Kumar2
@ 2010-11-10 16:16               ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-11-10 16:16 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty

On Tue, Nov 09, 2010 at 10:54:57PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 09:03:25 PM:
> 
> > > > Something strange here, right?
> > > > 1. You are consistently getting >10G/s here, and even with a single
> > > stream?
> > >
> > > Sorry, I should have mentioned this though I had stated in my
> > > earlier mails. Each test result has two iterations, each of 60
> > > seconds, except when #netperfs is 1 for which I do 10 iteration
> > > (sum across 10 iterations).
> >
> > So need to divide the number by 10?
> 
> Yes, that is what I get with 512/1K macvtap I/O size :)
> 
> > >  I started doing many more iterations
> > > for 1 netperf after finding the issue earlier with single stream.
> > > So the BW is only 4.5-7 Gbps.
> > >
> > > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > > >    doubling that we get a marginal improvement with 2 queues and
> > > >    about 30% worse with 1 queue.
> > >
> > > (doubling happens consistently for guest -> host, but never for
> > > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > > testing scenario. In first case, there is a slight improvement in
> > > BW and good reduction in SD. In the second case, only SD improves
> > > (though BW drops for 2 stream for some reason).  In both cases,
> > > BW and SD improves as the number of sessions increase.
> >
> > I guess this is another indication that something's wrong.
> 
> The patch - both virtio-net and vhost-net, doesn't have any
> locking/mutex's/ or any synchronization method. Guest -> host
> performance improvement of upto 100% shows the patch is not
> doing anything wrong.

My concern is this: we don't seem to do anything in tap or macvtap to
help packets from separate virtio queues get to separate queues in the
hardware device and to avoid reordering when we do this.

- skb_tx_hash calculation will get different results
- hash math that e.g. tcp does will run on guest and seems to be discarded

etc

Maybe it's as simple as some tap/macvtap ioctls to set up the queue number
in skbs. Or maybe we need to pass the skb hash from guest to host.
It's this last option that should make us especially cautios as it'll
affect guest/host interface.

Also see d5a9e24afb4ab38110ebb777588ea0bd0eacbd0a: if we have
hardware which records an RX queue, it appears important to
pass that info to guest and to use that in selecting the TX queue.
Of course we won't see this in netperf runs but this needs to
be given thought too - supporting this seems to suggest either
sticking the hash in the virtio net header for both tx and rx,
or using multiplease RX queues.

> > We are quite far from line rate, the fact BW does not scale
> > means there's some contention in the code.
> 
> Attaining line speed with macvtap seems to be a generic issue
> and unrelated to my patch specifically. IMHO if there is nothing
> wrong in the code (review) and is accepted, it will benefit as
> others can also help to find what needs to be implemented in
> vhost/macvtap/qemu to get line speed for guest->remote-host.

No problem, I will queue these patches in some branch
to help enable cooperation, as well as help you
iterate with incremental patches instead of resending it all each time.


> PS: bare-metal performance for host->remote-host is also
>     2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.
> 
> Thanks,

You mean native linux BW does not scale for your host with
# of connections either? I guess this just means need another
setup for testing?

> - KK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* MQ performance on other cards (cxgb3)
       [not found]         ` <OF24E08752.2087FFA4-ON652577D6.00532DF1-652577D6.0054B291@LocalDomain>
@ 2010-11-16  7:25           ` Krishna Kumar2
  0 siblings, 0 replies; 35+ messages in thread
From: Krishna Kumar2 @ 2010-11-16  7:25 UTC (permalink / raw)
  To: davem, Divy Le Ray
  Cc: anthony, arnd, avi, eric.dumazet, kvm, Michael S. Tsirkin, netdev, rusty

I had sent this mail to Michael last week - he agrees that I should
share this information on the list:

On latest net-next-2.6, virtio-net (guest->host) results are:
______________________________________________________________
                         SQ vs MQ (#txqs=8)
#      BW1      BW2 (%)          CPU1     CPU2 (%)   RCPU1   RCPU2 (%)
_______________________________________________________________
1      105774  112256 (6.1)   257      255 (-.7)     532     549 (3.1)
2      20842   30674 (47.1)   107      150 (40.1)    208     279 (34.1)
4      22500   31953 (42.0)   241      409 (69.7)    467     619 (32.5)
8      22416   44507 (98.5)   477      1039 (117.8)  960     1459 (51.9)
16     22605   45372 (100.7)  905      2060 (127.6)  1895    2962 (56.3)
24     23192   44201 (90.5)   1360     3028 (122.6)  2833    4437 (56.6)
32     23158   43394 (87.3)   1811     3957 (118.4)  3770    5936 (57.4)
40     23322   42550 (82.4)   2276     4986 (119.0)  4711    7417 (57.4)
48     23564   41931 (77.9)   2757     5966 (116.3)  5653    8896 (57.3)
64     23949   41092 (71.5)   3788     7898 (108.5)  7609    11826 (55.4)
80     23256   41343 (77.7)   4597     9887 (115.0)  9503    14801 (55.7)
96     23310   40645 (74.3)   5588     11758 (110.4) 11381   17761 (56.0)
128    24095   41082 (70.5)   7587     15574 (105.2) 15029   23716 (57.8)
______________________________________________________________
Avg:      BW: (58.3)      CPU: (110.8)      RCPU: (55.9)

It's true that average CPU% on guest is almost double that of the BW
improvement. But I don't think this is due to the patch (driver does no
synchronization, etc). To compare MQ vs SQ on a 10G card, I ran the
same test from host to remote host across cxgb3. The results are
somewhat similar:

(I changed cxgb_open on the client system to:
	netif_set_real_num_tx_queues(dev, 1);
	err = netif_set_real_num_rx_queues(dev, 1);
to simulate single queue (SQ))
_____________________________________
            cxgb3 SQ vs cxgb3 MQ
#     BW1      BW2 (%)      CPU1   CPU2 (%)
_____________________________________
1     8301    8315 (.1)        5         4.66 (-6.6)
2     9395    9380 (-.1)      16        16 (0)
4     9411    9414 (0)        33        26 (-21.2)
8     9411    9398 (-.1)      60        62 (3.3)
16   9412    9413 (0)        116      117 (.8)
24   9442    9963 (5.5)     179      198 (10.6)
32   10031  10025 (0)       230     249 (8.2)
40   9953    10024 (.7)      300     312 (4.0)
48   10002  10015 (.1)      351     376 (7.1)
64   10022  10024 (0)       494     515 (4.2)
80   8894    10011 (12.5)   537    630 (17.3)
96   8465    9907 (17.0)     612    749 (22.3)
128  7541   9617 (27.5)     760    989 (30.1)
_____________________________________
Avg:     BW: (3.8)     CPU: (14.8)

(Each case runs runs once for 60 secs)

The BW increased modestly but CPU increased much more. I assume
the change I made above to convert the driver from MQ to SQ is not
incorrect.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-20  8:54 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (4 preceding siblings ...)
  2010-10-25 15:50 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar2
@ 2011-02-22  7:47 ` Simon Horman
  2011-02-23  5:22   ` Krishna Kumar2
  5 siblings, 1 reply; 35+ messages in thread
From: Simon Horman @ 2011-02-22  7:47 UTC (permalink / raw)
  To: Krishna Kumar
  Cc: rusty, davem, mst, arnd, eric.dumazet, netdev, avi, anthony, kvm

On Wed, Oct 20, 2010 at 02:24:52PM +0530, Krishna Kumar wrote:
> Following set of patches implement transmit MQ in virtio-net.  Also
> included is the user qemu changes.  MQ is disabled by default unless
> qemu specifies it.

Hi Krishna,

I have a few questions about the results below:

1. Are the (%) comparisons between non-mq and mq virtio?
2. Was UDP or TCP used?
3. What was the transmit size (-m option to netperf)?

Also, I'm interested to know what the status of these patches is.
Are you planing a fresh series?

> 
>                   Changes from rev2:
>                   ------------------
> 1. Define (in virtio_net.h) the maximum send txqs; and use in
>    virtio-net and vhost-net.
> 2. vi->sq[i] is allocated individually, resulting in cache line
>    aligned sq[0] to sq[n].  Another option was to define
>    'send_queue' as:
>        struct send_queue {
>                struct virtqueue *svq;
>                struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>        } ____cacheline_aligned_in_smp;
>    and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
>    the submitted method is preferable.
> 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
>    handles TX[0-n].
> 4. Further change TX handling such that vhost[0] handles both RX/TX
>    for single stream case.
> 
>                   Enabling MQ on virtio:
>                   -----------------------
> When following options are passed to qemu:
>         - smp > 1
>         - vhost=on
>         - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using an
> optional 'numtxqs' option.  e.g. for a smp=4 guest:
>         vhost=on                   ->   #txqueues = 1
>         vhost=on,mq=on             ->   #txqueues = 4
>         vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
>         vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> 
> 
>                    Performance (guest -> local host):
>                    -----------------------------------
> System configuration:
>         Host:  8 Intel Xeon, 8 GB memory
>         Guest: 4 cpus, 2 GB memory
> Test: Each test case runs for 60 secs, sum over three runs (except
> when number of netperf sessions is 1, which has 10 runs of 12 secs
> each).  No tuning (default netperf) other than taskset vhost's to
> cpus 0-3.  numtxqs=32 gave the best results though the guest had
> only 4 vcpus (I haven't tried beyond that).
> 
> ______________ numtxqs=2, vhosts=3  ____________________
> #sessions  BW%      CPU%    RCPU%    SD%      RSD%
> ________________________________________________________
> 1          4.46    -1.96     .19     -12.50   -6.06
> 2          4.93    -1.16    2.10      0       -2.38
> 4          46.17    64.77   33.72     19.51   -2.48
> 8          47.89    70.00   36.23     41.46    13.35
> 16         48.97    80.44   40.67     21.11   -5.46
> 24         49.03    78.78   41.22     20.51   -4.78
> 32         51.11    77.15   42.42     15.81   -6.87
> 40         51.60    71.65   42.43     9.75    -8.94
> 48         50.10    69.55   42.85     11.80   -5.81
> 64         46.24    68.42   42.67     14.18   -3.28
> 80         46.37    63.13   41.62     7.43    -6.73
> 96         46.40    63.31   42.20     9.36    -4.78
> 128        50.43    62.79   42.16     13.11   -1.23
> ________________________________________________________
> BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
> 
> ______________ numtxqs=8, vhosts=5  ____________________
> #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> ________________________________________________________
> 1           -.76    -1.56     2.33      0        3.03
> 2           17.41    11.11    11.41     0       -4.76
> 4           42.12    55.11    30.20     19.51    .62
> 8           54.69    80.00    39.22     24.39    -3.88
> 16          54.77    81.62    40.89     20.34    -6.58
> 24          54.66    79.68    41.57     15.49    -8.99
> 32          54.92    76.82    41.79     17.59    -5.70
> 40          51.79    68.56    40.53     15.31    -3.87
> 48          51.72    66.40    40.84     9.72     -7.13
> 64          51.11    63.94    41.10     5.93     -8.82
> 80          46.51    59.50    39.80     9.33     -4.18
> 96          47.72    57.75    39.84     4.20     -7.62
> 128         54.35    58.95    40.66     3.24     -8.63
> ________________________________________________________
> BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%
> 
> ______________ numtxqs=16, vhosts=5  ___________________
> #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> ________________________________________________________
> 1           -1.43    -3.52    1.55      0          3.03
> 2           33.09     21.63   20.12    -10.00     -9.52
> 4           67.17     94.60   44.28     19.51     -11.80
> 8           75.72     108.14  49.15     25.00     -10.71
> 16          80.34     101.77  52.94     25.93     -4.49
> 24          70.84     93.12   43.62     27.63     -5.03
> 32          69.01     94.16   47.33     29.68     -1.51
> 40          58.56     63.47   25.91    -3.92      -25.85
> 48          61.16     74.70   34.88     .89       -22.08
> 64          54.37     69.09   26.80    -6.68      -30.04
> 80          36.22     22.73   -2.97    -8.25      -27.23
> 96          41.51     50.59   13.24     9.84      -16.77
> 128         48.98     38.15   6.41     -.33       -22.80
> ________________________________________________________
> BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%
> 
> ______________ numtxqs=32, vhosts=5  ___________________
> #            BW%       CPU%    RCPU%    SD%     RSD%
> ________________________________________________________
> 1            7.62     -38.03   -26.26  -50.00   -33.33
> 2            28.95     20.46    21.62   0       -7.14
> 4            84.05     60.79    45.74  -2.43    -12.42
> 8            86.43     79.57    50.32   15.85   -3.10
> 16           88.63     99.48    58.17   9.47    -13.10
> 24           74.65     80.87    41.99  -1.81    -22.89
> 32           63.86     59.21    23.58  -18.13   -36.37
> 40           64.79     60.53    22.23  -15.77   -35.84
> 48           49.68     26.93    .51    -36.40   -49.61
> 64           54.69     36.50    5.41   -26.59   -43.23
> 80           45.06     12.72   -13.25  -37.79   -52.08
> 96           40.21    -3.16    -24.53  -39.92   -52.97
> 128          36.33    -33.19   -43.66  -5.68    -20.49
> ________________________________________________________
> BW: 49.3%,  CPU/RCPU: 15.5%,-8.2%,  SD/RSD: -22.2%,-37.0%
> 
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2011-02-22  7:47 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Simon Horman
@ 2011-02-23  5:22   ` Krishna Kumar2
  2011-02-23  6:39     ` Michael S. Tsirkin
  2011-02-23 22:59     ` Simon Horman
  0 siblings, 2 replies; 35+ messages in thread
From: Krishna Kumar2 @ 2011-02-23  5:22 UTC (permalink / raw)
  To: Simon Horman
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, mst, netdev, rusty

Simon Horman <horms@verge.net.au> wrote on 02/22/2011 01:17:09 PM:

Hi Simon,


> I have a few questions about the results below:
>
> 1. Are the (%) comparisons between non-mq and mq virtio?

Yes - mainline kernel with transmit-only MQ patch.

> 2. Was UDP or TCP used?

TCP. I had done some initial testing on UDP, but don't have
the results now as it is really old. But I will be running
it again.

> 3. What was the transmit size (-m option to netperf)?

I didn't use the -m option, so it defaults to 16K. The
script does:

netperf -t TCP_STREAM -c -C -l 60 -H $SERVER

> Also, I'm interested to know what the status of these patches is.
> Are you planing a fresh series?

Yes. Michael Tsirkin had wanted to see how the MQ RX patch
would look like, so I was in the process of getting the two
working together. The patch is ready and is being tested.
Should I send a RFC patch at this time?

The TX-only patch helped the guest TX path but didn't help
host->guest much (as tested using TCP_MAERTS from the guest).
But with the TX+RX patch, both directions are getting
improvements. Remote testing is still to be done.

Thanks,

- KK

> >                   Changes from rev2:
> >                   ------------------
> > 1. Define (in virtio_net.h) the maximum send txqs; and use in
> >    virtio-net and vhost-net.
> > 2. vi->sq[i] is allocated individually, resulting in cache line
> >    aligned sq[0] to sq[n].  Another option was to define
> >    'send_queue' as:
> >        struct send_queue {
> >                struct virtqueue *svq;
> >                struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> >        } ____cacheline_aligned_in_smp;
> >    and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
> >    the submitted method is preferable.
> > 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
> >    handles TX[0-n].
> > 4. Further change TX handling such that vhost[0] handles both RX/TX
> >    for single stream case.
> >
> >                   Enabling MQ on virtio:
> >                   -----------------------
> > When following options are passed to qemu:
> >         - smp > 1
> >         - vhost=on
> >         - mq=on (new option, default:off)
> > then #txqueues = #cpus.  The #txqueues can be changed by using an
> > optional 'numtxqs' option.  e.g. for a smp=4 guest:
> >         vhost=on                   ->   #txqueues = 1
> >         vhost=on,mq=on             ->   #txqueues = 4
> >         vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> >         vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> >
> >
> >                    Performance (guest -> local host):
> >                    -----------------------------------
> > System configuration:
> >         Host:  8 Intel Xeon, 8 GB memory
> >         Guest: 4 cpus, 2 GB memory
> > Test: Each test case runs for 60 secs, sum over three runs (except
> > when number of netperf sessions is 1, which has 10 runs of 12 secs
> > each).  No tuning (default netperf) other than taskset vhost's to
> > cpus 0-3.  numtxqs=32 gave the best results though the guest had
> > only 4 vcpus (I haven't tried beyond that).
> >
> > ______________ numtxqs=2, vhosts=3  ____________________
> > #sessions  BW%      CPU%    RCPU%    SD%      RSD%
> > ________________________________________________________
> > 1          4.46    -1.96     .19     -12.50   -6.06
> > 2          4.93    -1.16    2.10      0       -2.38
> > 4          46.17    64.77   33.72     19.51   -2.48
> > 8          47.89    70.00   36.23     41.46    13.35
> > 16         48.97    80.44   40.67     21.11   -5.46
> > 24         49.03    78.78   41.22     20.51   -4.78
> > 32         51.11    77.15   42.42     15.81   -6.87
> > 40         51.60    71.65   42.43     9.75    -8.94
> > 48         50.10    69.55   42.85     11.80   -5.81
> > 64         46.24    68.42   42.67     14.18   -3.28
> > 80         46.37    63.13   41.62     7.43    -6.73
> > 96         46.40    63.31   42.20     9.36    -4.78
> > 128        50.43    62.79   42.16     13.11   -1.23
> > ________________________________________________________
> > BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
> >
> > ______________ numtxqs=8, vhosts=5  ____________________
> > #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> > ________________________________________________________
> > 1           -.76    -1.56     2.33      0        3.03
> > 2           17.41    11.11    11.41     0       -4.76
> > 4           42.12    55.11    30.20     19.51    .62
> > 8           54.69    80.00    39.22     24.39    -3.88
> > 16          54.77    81.62    40.89     20.34    -6.58
> > 24          54.66    79.68    41.57     15.49    -8.99
> > 32          54.92    76.82    41.79     17.59    -5.70
> > 40          51.79    68.56    40.53     15.31    -3.87
> > 48          51.72    66.40    40.84     9.72     -7.13
> > 64          51.11    63.94    41.10     5.93     -8.82
> > 80          46.51    59.50    39.80     9.33     -4.18
> > 96          47.72    57.75    39.84     4.20     -7.62
> > 128         54.35    58.95    40.66     3.24     -8.63
> > ________________________________________________________
> > BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%
> >
> > ______________ numtxqs=16, vhosts=5  ___________________
> > #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> > ________________________________________________________
> > 1           -1.43    -3.52    1.55      0          3.03
> > 2           33.09     21.63   20.12    -10.00     -9.52
> > 4           67.17     94.60   44.28     19.51     -11.80
> > 8           75.72     108.14  49.15     25.00     -10.71
> > 16          80.34     101.77  52.94     25.93     -4.49
> > 24          70.84     93.12   43.62     27.63     -5.03
> > 32          69.01     94.16   47.33     29.68     -1.51
> > 40          58.56     63.47   25.91    -3.92      -25.85
> > 48          61.16     74.70   34.88     .89       -22.08
> > 64          54.37     69.09   26.80    -6.68      -30.04
> > 80          36.22     22.73   -2.97    -8.25      -27.23
> > 96          41.51     50.59   13.24     9.84      -16.77
> > 128         48.98     38.15   6.41     -.33       -22.80
> > ________________________________________________________
> > BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%
> >
> > ______________ numtxqs=32, vhosts=5  ___________________
> > #            BW%       CPU%    RCPU%    SD%     RSD%
> > ________________________________________________________
> > 1            7.62     -38.03   -26.26  -50.00   -33.33
> > 2            28.95     20.46    21.62   0       -7.14
> > 4            84.05     60.79    45.74  -2.43    -12.42
> > 8            86.43     79.57    50.32   15.85   -3.10
> > 16           88.63     99.48    58.17   9.47    -13.10
> > 24           74.65     80.87    41.99  -1.81    -22.89
> > 32           63.86     59.21    23.58  -18.13   -36.37
> > 40           64.79     60.53    22.23  -15.77   -35.84
> > 48           49.68     26.93    .51    -36.40   -49.61
> > 64           54.69     36.50    5.41   -26.59   -43.23
> > 80           45.06     12.72   -13.25  -37.79   -52.08
> > 96           40.21    -3.16    -24.53  -39.92   -52.97
> > 128          36.33    -33.19   -43.66  -5.68    -20.49
> > ________________________________________________________
> > BW: 49.3%,  CPU/RCPU: 15.5%,-8.2%,  SD/RSD: -22.2%,-37.0%


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2011-02-23  5:22   ` Krishna Kumar2
@ 2011-02-23  6:39     ` Michael S. Tsirkin
  2011-02-23  6:48       ` Krishna Kumar2
  2011-02-23 22:59     ` Simon Horman
  1 sibling, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2011-02-23  6:39 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Simon Horman, anthony, arnd, avi, davem, eric.dumazet, kvm,
	netdev, rusty

On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote:
> Simon Horman <horms@verge.net.au> wrote on 02/22/2011 01:17:09 PM:
> 
> Hi Simon,
> 
> 
> > I have a few questions about the results below:
> >
> > 1. Are the (%) comparisons between non-mq and mq virtio?
> 
> Yes - mainline kernel with transmit-only MQ patch.
> 
> > 2. Was UDP or TCP used?
> 
> TCP. I had done some initial testing on UDP, but don't have
> the results now as it is really old. But I will be running
> it again.
> 
> > 3. What was the transmit size (-m option to netperf)?
> 
> I didn't use the -m option, so it defaults to 16K. The
> script does:
> 
> netperf -t TCP_STREAM -c -C -l 60 -H $SERVER
> 
> > Also, I'm interested to know what the status of these patches is.
> > Are you planing a fresh series?
> 
> Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> would look like, so I was in the process of getting the two
> working together. The patch is ready and is being tested.
> Should I send a RFC patch at this time?

Yes, please do.

> The TX-only patch helped the guest TX path but didn't help
> host->guest much (as tested using TCP_MAERTS from the guest).
> But with the TX+RX patch, both directions are getting
> improvements.

Also, my hope is that with appropriate queue mapping,
we might be able to do away with heuristics to detect
single stream load that TX only code needs.

> Remote testing is still to be done.

Others might be able to help here once you post the patch.

> Thanks,
> 
> - KK
> 
> > >                   Changes from rev2:
> > >                   ------------------
> > > 1. Define (in virtio_net.h) the maximum send txqs; and use in
> > >    virtio-net and vhost-net.
> > > 2. vi->sq[i] is allocated individually, resulting in cache line
> > >    aligned sq[0] to sq[n].  Another option was to define
> > >    'send_queue' as:
> > >        struct send_queue {
> > >                struct virtqueue *svq;
> > >                struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> > >        } ____cacheline_aligned_in_smp;
> > >    and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
> > >    the submitted method is preferable.
> > > 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
> > >    handles TX[0-n].
> > > 4. Further change TX handling such that vhost[0] handles both RX/TX
> > >    for single stream case.
> > >
> > >                   Enabling MQ on virtio:
> > >                   -----------------------
> > > When following options are passed to qemu:
> > >         - smp > 1
> > >         - vhost=on
> > >         - mq=on (new option, default:off)
> > > then #txqueues = #cpus.  The #txqueues can be changed by using an
> > > optional 'numtxqs' option.  e.g. for a smp=4 guest:
> > >         vhost=on                   ->   #txqueues = 1
> > >         vhost=on,mq=on             ->   #txqueues = 4
> > >         vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> > >         vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> > >
> > >
> > >                    Performance (guest -> local host):
> > >                    -----------------------------------
> > > System configuration:
> > >         Host:  8 Intel Xeon, 8 GB memory
> > >         Guest: 4 cpus, 2 GB memory
> > > Test: Each test case runs for 60 secs, sum over three runs (except
> > > when number of netperf sessions is 1, which has 10 runs of 12 secs
> > > each).  No tuning (default netperf) other than taskset vhost's to
> > > cpus 0-3.  numtxqs=32 gave the best results though the guest had
> > > only 4 vcpus (I haven't tried beyond that).
> > >
> > > ______________ numtxqs=2, vhosts=3  ____________________
> > > #sessions  BW%      CPU%    RCPU%    SD%      RSD%
> > > ________________________________________________________
> > > 1          4.46    -1.96     .19     -12.50   -6.06
> > > 2          4.93    -1.16    2.10      0       -2.38
> > > 4          46.17    64.77   33.72     19.51   -2.48
> > > 8          47.89    70.00   36.23     41.46    13.35
> > > 16         48.97    80.44   40.67     21.11   -5.46
> > > 24         49.03    78.78   41.22     20.51   -4.78
> > > 32         51.11    77.15   42.42     15.81   -6.87
> > > 40         51.60    71.65   42.43     9.75    -8.94
> > > 48         50.10    69.55   42.85     11.80   -5.81
> > > 64         46.24    68.42   42.67     14.18   -3.28
> > > 80         46.37    63.13   41.62     7.43    -6.73
> > > 96         46.40    63.31   42.20     9.36    -4.78
> > > 128        50.43    62.79   42.16     13.11   -1.23
> > > ________________________________________________________
> > > BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
> > >
> > > ______________ numtxqs=8, vhosts=5  ____________________
> > > #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> > > ________________________________________________________
> > > 1           -.76    -1.56     2.33      0        3.03
> > > 2           17.41    11.11    11.41     0       -4.76
> > > 4           42.12    55.11    30.20     19.51    .62
> > > 8           54.69    80.00    39.22     24.39    -3.88
> > > 16          54.77    81.62    40.89     20.34    -6.58
> > > 24          54.66    79.68    41.57     15.49    -8.99
> > > 32          54.92    76.82    41.79     17.59    -5.70
> > > 40          51.79    68.56    40.53     15.31    -3.87
> > > 48          51.72    66.40    40.84     9.72     -7.13
> > > 64          51.11    63.94    41.10     5.93     -8.82
> > > 80          46.51    59.50    39.80     9.33     -4.18
> > > 96          47.72    57.75    39.84     4.20     -7.62
> > > 128         54.35    58.95    40.66     3.24     -8.63
> > > ________________________________________________________
> > > BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%
> > >
> > > ______________ numtxqs=16, vhosts=5  ___________________
> > > #sessions   BW%      CPU%     RCPU%     SD%      RSD%
> > > ________________________________________________________
> > > 1           -1.43    -3.52    1.55      0          3.03
> > > 2           33.09     21.63   20.12    -10.00     -9.52
> > > 4           67.17     94.60   44.28     19.51     -11.80
> > > 8           75.72     108.14  49.15     25.00     -10.71
> > > 16          80.34     101.77  52.94     25.93     -4.49
> > > 24          70.84     93.12   43.62     27.63     -5.03
> > > 32          69.01     94.16   47.33     29.68     -1.51
> > > 40          58.56     63.47   25.91    -3.92      -25.85
> > > 48          61.16     74.70   34.88     .89       -22.08
> > > 64          54.37     69.09   26.80    -6.68      -30.04
> > > 80          36.22     22.73   -2.97    -8.25      -27.23
> > > 96          41.51     50.59   13.24     9.84      -16.77
> > > 128         48.98     38.15   6.41     -.33       -22.80
> > > ________________________________________________________
> > > BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%
> > >
> > > ______________ numtxqs=32, vhosts=5  ___________________
> > > #            BW%       CPU%    RCPU%    SD%     RSD%
> > > ________________________________________________________
> > > 1            7.62     -38.03   -26.26  -50.00   -33.33
> > > 2            28.95     20.46    21.62   0       -7.14
> > > 4            84.05     60.79    45.74  -2.43    -12.42
> > > 8            86.43     79.57    50.32   15.85   -3.10
> > > 16           88.63     99.48    58.17   9.47    -13.10
> > > 24           74.65     80.87    41.99  -1.81    -22.89
> > > 32           63.86     59.21    23.58  -18.13   -36.37
> > > 40           64.79     60.53    22.23  -15.77   -35.84
> > > 48           49.68     26.93    .51    -36.40   -49.61
> > > 64           54.69     36.50    5.41   -26.59   -43.23
> > > 80           45.06     12.72   -13.25  -37.79   -52.08
> > > 96           40.21    -3.16    -24.53  -39.92   -52.97
> > > 128          36.33    -33.19   -43.66  -5.68    -20.49
> > > ________________________________________________________
> > > BW: 49.3%,  CPU/RCPU: 15.5%,-8.2%,  SD/RSD: -22.2%,-37.0%

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2011-02-23  6:39     ` Michael S. Tsirkin
@ 2011-02-23  6:48       ` Krishna Kumar2
  2011-02-23 15:55         ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Krishna Kumar2 @ 2011-02-23  6:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, Simon Horman, kvm,
	netdev, rusty

> "Michael S. Tsirkin" <mst@redhat.com> wrote on 02/23/2011 12:09:15 PM:

Hi Michael,

> > Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> > would look like, so I was in the process of getting the two
> > working together. The patch is ready and is being tested.
> > Should I send a RFC patch at this time?
>
> Yes, please do.

Sure, will get a build/test on latest bits and send in 1-2 days.

> > The TX-only patch helped the guest TX path but didn't help
> > host->guest much (as tested using TCP_MAERTS from the guest).
> > But with the TX+RX patch, both directions are getting
> > improvements.
>
> Also, my hope is that with appropriate queue mapping,
> we might be able to do away with heuristics to detect
> single stream load that TX only code needs.

Yes, that whole stuff is removed, and the TX/RX path is
unchanged with this patch (thankfully :)

> > Remote testing is still to be done.
>
> Others might be able to help here once you post the patch.

That's great, will appreciate any help.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2011-02-23  6:48       ` Krishna Kumar2
@ 2011-02-23 15:55         ` Michael S. Tsirkin
  2011-02-24 11:48           ` Krishna Kumar2
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2011-02-23 15:55 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, Simon Horman, kvm,
	netdev, rusty

On Wed, Feb 23, 2011 at 12:18:36PM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> wrote on 02/23/2011 12:09:15 PM:
> 
> Hi Michael,
> 
> > > Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> > > would look like, so I was in the process of getting the two
> > > working together. The patch is ready and is being tested.
> > > Should I send a RFC patch at this time?
> >
> > Yes, please do.
> 
> Sure, will get a build/test on latest bits and send in 1-2 days.
> 
> > > The TX-only patch helped the guest TX path but didn't help
> > > host->guest much (as tested using TCP_MAERTS from the guest).
> > > But with the TX+RX patch, both directions are getting
> > > improvements.
> >
> > Also, my hope is that with appropriate queue mapping,
> > we might be able to do away with heuristics to detect
> > single stream load that TX only code needs.
> 
> Yes, that whole stuff is removed, and the TX/RX path is
> unchanged with this patch (thankfully :)

Cool. I was wondering whether in that case, we can
do without host kernel changes at all,
and use a separate fd for each TX/RX pair.
The advantage of that approach is that this way,
the max fd limit naturally sets an upper bound
on the amount of resources userspace can use up.

Thoughts?

In any case, pls don't let the above delay
sending an RFC.

> > > Remote testing is still to be done.
> >
> > Others might be able to help here once you post the patch.
> 
> That's great, will appreciate any help.
> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2011-02-23  5:22   ` Krishna Kumar2
  2011-02-23  6:39     ` Michael S. Tsirkin
@ 2011-02-23 22:59     ` Simon Horman
  1 sibling, 0 replies; 35+ messages in thread
From: Simon Horman @ 2011-02-23 22:59 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, mst, netdev, rusty

On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote:
> Simon Horman <horms@verge.net.au> wrote on 02/22/2011 01:17:09 PM:
> 
> Hi Simon,
> 
> 
> > I have a few questions about the results below:
> >
> > 1. Are the (%) comparisons between non-mq and mq virtio?
> 
> Yes - mainline kernel with transmit-only MQ patch.
> 
> > 2. Was UDP or TCP used?
> 
> TCP. I had done some initial testing on UDP, but don't have
> the results now as it is really old. But I will be running
> it again.
> 
> > 3. What was the transmit size (-m option to netperf)?
> 
> I didn't use the -m option, so it defaults to 16K. The
> script does:
> 
> netperf -t TCP_STREAM -c -C -l 60 -H $SERVER
> 
> > Also, I'm interested to know what the status of these patches is.
> > Are you planing a fresh series?
> 
> Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> would look like, so I was in the process of getting the two
> working together. The patch is ready and is being tested.
> Should I send a RFC patch at this time?
> 
> The TX-only patch helped the guest TX path but didn't help
> host->guest much (as tested using TCP_MAERTS from the guest).
> But with the TX+RX patch, both directions are getting
> improvements. Remote testing is still to be done.

Hi Krishna,

thanks for clarifying the test results.
I'm looking forward to the forthcoming RFC patches.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
  2011-02-23 15:55         ` Michael S. Tsirkin
@ 2011-02-24 11:48           ` Krishna Kumar2
  0 siblings, 0 replies; 35+ messages in thread
From: Krishna Kumar2 @ 2011-02-24 11:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, Simon Horman, kvm,
	netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 02/23/2011 09:25:34 PM:

> > Sure, will get a build/test on latest bits and send in 1-2 days.
> >
> > > > The TX-only patch helped the guest TX path but didn't help
> > > > host->guest much (as tested using TCP_MAERTS from the guest).
> > > > But with the TX+RX patch, both directions are getting
> > > > improvements.
> > >
> > > Also, my hope is that with appropriate queue mapping,
> > > we might be able to do away with heuristics to detect
> > > single stream load that TX only code needs.
> >
> > Yes, that whole stuff is removed, and the TX/RX path is
> > unchanged with this patch (thankfully :)
>
> Cool. I was wondering whether in that case, we can
> do without host kernel changes at all,
> and use a separate fd for each TX/RX pair.
> The advantage of that approach is that this way,
> the max fd limit naturally sets an upper bound
> on the amount of resources userspace can use up.
>
> Thoughts?
>
> In any case, pls don't let the above delay
> sending an RFC.

I will look into this also.

Please excuse the delay in sending the patch out faster - my
bits are a little old, so it is taking some time to move to
the latest kernel and get some initial TCP/UDP test results.
I should have it ready by tomorrow.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-02-24 11:48 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-20  8:54 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
2010-10-20  8:54 ` [v3 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
2010-10-20  8:55 ` [v3 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
2010-10-20  8:55 ` [v3 RFC PATCH 3/4] Changes for vhost Krishna Kumar
2010-10-20  8:55 ` [v3 RFC PATCH 4/4] qemu changes Krishna Kumar
2010-10-25 15:50 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar2
2010-10-25 16:17   ` Michael S. Tsirkin
2010-10-26  5:10     ` Krishna Kumar2
     [not found]     ` <OF5C53E9CF.FFDF2CE7-ON652577C8.00191D14-652577C8.001C2154@LocalDomain>
2010-10-26  9:08       ` Krishna Kumar2
2010-10-26  9:38         ` Michael S. Tsirkin
2010-10-26 10:01           ` Krishna Kumar2
2010-10-26 11:09             ` Michael S. Tsirkin
2010-10-28  5:14               ` Krishna Kumar2
2010-10-28  5:50                 ` Michael S. Tsirkin
2010-10-28  6:12                   ` Krishna Kumar2
2010-10-28  6:18                     ` Michael S. Tsirkin
     [not found]               ` <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@LocalDomain>
2010-10-28  7:18                 ` Krishna Kumar2
2010-10-29 11:26                   ` Michael S. Tsirkin
2010-10-29 14:57                     ` linux_kvm
2010-11-03  7:01                   ` Michael S. Tsirkin
2010-10-26  8:57   ` Michael S. Tsirkin
2010-11-09  4:38     ` Krishna Kumar2
2010-11-09 13:22       ` Michael S. Tsirkin
2010-11-09 15:28         ` Krishna Kumar2
2010-11-09 15:33           ` Michael S. Tsirkin
2010-11-09 17:24             ` Krishna Kumar2
2010-11-10 16:16               ` Michael S. Tsirkin
     [not found]         ` <OF24E08752.2087FFA4-ON652577D6.00532DF1-652577D6.0054B291@LocalDomain>
2010-11-16  7:25           ` MQ performance on other cards (cxgb3) Krishna Kumar2
2011-02-22  7:47 ` [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Simon Horman
2011-02-23  5:22   ` Krishna Kumar2
2011-02-23  6:39     ` Michael S. Tsirkin
2011-02-23  6:48       ` Krishna Kumar2
2011-02-23 15:55         ` Michael S. Tsirkin
2011-02-24 11:48           ` Krishna Kumar2
2011-02-23 22:59     ` Simon Horman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.