All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Implement multiqueue virtio-net
@ 2010-09-08  7:28 Krishna Kumar
  2010-09-08  7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08  7:28 UTC (permalink / raw)
  To: rusty, davem; +Cc: netdev, kvm, anthony, Krishna Kumar, mst

Following patches implement Transmit mq in virtio-net.  Also
included is the user qemu changes.

1. This feature was first implemented with a single vhost.
   Testing showed 3-8% performance gain for upto 8 netperf
   sessions (and sometimes 16), but BW dropped with more
   sessions.  However, implementing per-txq vhost improved
   BW significantly all the way to 128 sessions.
2. For this mq TX patch, 1 daemon is created for RX and 'n'
   daemons for the 'n' TXQ's, for a total of (n+1) daemons.
   The (subsequent) RX mq patch changes that to a total of
   'n' daemons, where RX and TX vq's share 1 daemon.
3. Service Demand increases for TCP, but significantly
   improves for UDP.
4. Interoperability: Many combinations, but not all, of
   qemu, host, guest tested together.


                  Enabling mq on virtio:
                  -----------------------

When following options are passed to qemu:
        - smp > 1
        - vhost=on
        - mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using
an optional 'numtxqs' option. e.g.  for a smp=4 guest:
        vhost=on,mq=on             ->   #txqueues = 4
        vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
        vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2


                   Performance (guest -> local host):
                   -----------------------------------

System configuration:
        Host:  8 Intel Xeon, 8 GB memory
        Guest: 4 cpus, 2 GB memory
All testing without any tuning, and TCP netperf with 64K I/O
_______________________________________________________________________________
                           TCP (#numtxqs=2)
N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
_______________________________________________________________________________
4       26387   40716 (54.30)   20      28   (40.00)    86i     85     (-1.16)
8       24356   41843 (71.79)   88      129  (46.59)    372     362    (-2.68)
16      23587   40546 (71.89)   375     564  (50.40)    1558    1519   (-2.50)
32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722   (-14.52)
48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552  (-14.35)
64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173  (-9.66)
96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)
_______________________________________________________________________________
                       UDP (#numtxqs=8)
N#      BW1     BW2   (%)      SD1     SD2   (%)
__________________________________________________________
4       29836   56761 (90.24)   67      63    (-5.97)
8       27666   63767 (130.48)  326     265   (-18.71)
16      25452   60665 (138.35)  1396    1269  (-9.09)
32      26172   63491 (142.59)  5617    4202  (-25.19)
48      26146   64629 (147.18)  12813   9316  (-27.29)
64      25575   65448 (155.90)  23063   16346 (-29.12)
128     26454   63772 (141.06)  91054   85051 (-6.59)
__________________________________________________________
N#: Number of netperf sessions, 90 sec runs
BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
              SD for original code
BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
              SD for new code. e.g. BW2=40716 means average BW2 was
              20358 mbps.


                       Next steps:
                       -----------

1. mq RX patch is also complete - plan to submit once TX is OK.
2. Cache-align data structures: I didn't see any BW/SD improvement
   after making the sq's (and similarly for vhost) cache-aligned
   statically:
        struct virtnet_info {
                ...
                struct send_queue sq[16] ____cacheline_aligned_in_smp;
                ...
        };

Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts 
      CPU0     CPU1     CPU2    CPU3       
40:   0        0        0       0        PCI-MSI-edge  virtio0-config
41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

Review/feedback appreciated.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
@ 2010-09-08  7:29 ` Krishna Kumar
  2010-09-09  3:49   ` Rusty Russell
  2010-09-08  7:29 ` [RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08  7:29 UTC (permalink / raw)
  To: rusty, davem; +Cc: netdev, Krishna Kumar, anthony, kvm, mst

Add virtio_get_queue_index() to get the queue index of a
vq.  This is needed by the cb handler to locate the queue
that should be processed.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/virtio/virtio_pci.c |    9 +++++++++
 include/linux/virtio.h      |    3 +++
 2 files changed, 12 insertions(+)

diff -ruNp org/include/linux/virtio.h tx_only/include/linux/virtio.h
--- org/include/linux/virtio.h	2010-09-03 16:33:51.000000000 +0530
+++ tx_only/include/linux/virtio.h	2010-09-08 10:23:36.000000000 +0530
@@ -136,4 +136,7 @@ struct virtio_driver {
 
 int register_virtio_driver(struct virtio_driver *drv);
 void unregister_virtio_driver(struct virtio_driver *drv);
+
+/* return the internal queue index associated with the virtqueue */
+extern int virtio_get_queue_index(struct virtqueue *vq);
 #endif /* _LINUX_VIRTIO_H */
diff -ruNp org/drivers/virtio/virtio_pci.c tx_only/drivers/virtio/virtio_pci.c
--- org/drivers/virtio/virtio_pci.c	2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/virtio/virtio_pci.c	2010-09-08 10:23:16.000000000 +0530
@@ -359,6 +359,15 @@ static int vp_request_intx(struct virtio
 	return err;
 }
 
+/* Return the internal queue index associated with the virtqueue */
+int virtio_get_queue_index(struct virtqueue *vq)
+{
+	struct virtio_pci_vq_info *info = vq->priv;
+
+	return info->queue_index;
+}
+EXPORT_SYMBOL(virtio_get_queue_index);
+
 static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 				  void (*callback)(struct virtqueue *vq),
 				  const char *name,

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 2/4] Changes for virtio-net
  2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
  2010-09-08  7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
@ 2010-09-08  7:29 ` Krishna Kumar
  2010-09-08  7:29 ` [RFC PATCH 3/4] Changes for vhost Krishna Kumar
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08  7:29 UTC (permalink / raw)
  To: rusty, davem; +Cc: netdev, mst, anthony, Krishna Kumar, kvm

Implement mq virtio-net driver.

Though struct virtio_net_config changes, it works with old
qemu's since the last element is not accessed, unless qemu
sets VIRTIO_NET_F_NUMTXQS.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/net/virtio_net.c   |  213 ++++++++++++++++++++++++++---------
 include/linux/virtio_net.h |    6 
 2 files changed, 166 insertions(+), 53 deletions(-)

diff -ruNp org/include/linux/virtio_net.h tx_only/include/linux/virtio_net.h
--- org/include/linux/virtio_net.h	2010-09-03 16:33:51.000000000 +0530
+++ tx_only/include/linux/virtio_net.h	2010-09-08 10:39:22.000000000 +0530
@@ -7,6 +7,9 @@
 #include <linux/virtio_config.h>
 #include <linux/if_ether.h>
 
+/* The maximum of transmit queues supported */
+#define VIRTIO_MAX_TXQS		16
+
 /* The feature bitmap for virtio net */
 #define VIRTIO_NET_F_CSUM	0	/* Host handles pkts w/ partial csum */
 #define VIRTIO_NET_F_GUEST_CSUM	1	/* Guest handles pkts w/ partial csum */
@@ -26,6 +29,7 @@
 #define VIRTIO_NET_F_CTRL_RX	18	/* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN	19	/* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS	21	/* Device supports multiple TX queue */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -34,6 +38,8 @@ struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* number of transmit queues */
+	__u16 numtxqs;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org/drivers/net/virtio_net.c tx_only/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c	2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/net/virtio_net.c	2010-09-08 12:14:19.000000000 +0530
@@ -40,9 +40,20 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 
+/* Our representation of a send virtqueue */
+struct send_queue {
+	struct virtqueue *svq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
+	int numtxqs;			/* Number of tx queues */
+	struct send_queue *sq;
+	struct virtqueue *rvq;
+	struct virtqueue *cvq;
 	struct net_device *dev;
 	struct napi_struct napi;
 	unsigned int status;
@@ -62,9 +73,8 @@ struct virtnet_info {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
-	/* fragments + linear part + virtio header */
+	/* RX: fragments + linear part + virtio header */
 	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -120,12 +130,13 @@ static struct page *get_a_page(struct vi
 static void skb_xmit_done(struct virtqueue *svq)
 {
 	struct virtnet_info *vi = svq->vdev->priv;
+	int qnum = virtio_get_queue_index(svq) - 1;	/* 0 is RX vq */
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(svq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, qnum);
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -495,12 +506,13 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+				       struct virtqueue *svq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
@@ -510,7 +522,8 @@ static unsigned int free_old_xmit_skbs(s
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+		    struct virtqueue *svq, struct scatterlist *tx_sg)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -548,12 +561,12 @@ static int xmit_skb(struct virtnet_info 
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, tx_sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(svq, tx_sg, hdr->num_sg,
 					0, skb);
 }
 
@@ -561,31 +574,34 @@ static netdev_tx_t start_xmit(struct sk_
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
+	int qnum = skb_get_queue_mapping(skb);
+	struct virtqueue *svq = vi->sq[qnum].svq;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, svq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, skb, svq, vi->sq[qnum].tx_sg);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
 		if (net_ratelimit()) {
 			if (likely(capacity == -ENOMEM)) {
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					 "TXQ (%d) failure: out of memory\n",
+					 qnum);
 			} else {
 				dev->stats.tx_fifo_errors++;
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					 "Unexpected TXQ (%d) failure: %d\n",
+					 qnum, capacity);
 			}
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(svq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -594,13 +610,13 @@ static netdev_tx_t start_xmit(struct sk_
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+		netif_stop_subqueue(dev, qnum);
+		if (unlikely(!virtqueue_enable_cb(svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				netif_start_subqueue(dev, qnum);
+				virtqueue_disable_cb(svq);
 			}
 		}
 	}
@@ -871,10 +887,10 @@ static void virtnet_update_status(struct
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 }
 
@@ -885,18 +901,112 @@ static void virtnet_config_changed(struc
 	virtnet_update_status(vi);
 }
 
+#define MAX_DEVICE_NAME		16
+static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
+{
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int i, err = -ENOMEM;
+	int totalvqs;
+	char **names;
+
+	/* Allocate send queues */
+	vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
+	if (!vi->sq)
+		goto out;
+
+	/* setup initial send queue parameters */
+	for (i = 0; i < numtxqs; i++)
+		sg_init_table(vi->sq[i].tx_sg, ARRAY_SIZE(vi->sq[i].tx_sg));
+
+	/*
+	 * We expect 1 RX virtqueue followed by 'numtxqs' TX virtqueues, and
+	 * optionally one control virtqueue.
+	 */
+	totalvqs = 1 + numtxqs +
+		   virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Setup parameters for find_vqs */
+	vqs = kmalloc(totalvqs * sizeof(*vqs), GFP_KERNEL);
+	callbacks = kmalloc(totalvqs * sizeof(*callbacks), GFP_KERNEL);
+	names = kzalloc(totalvqs * sizeof(*names), GFP_KERNEL);
+	if (!vqs || !callbacks || !names)
+		goto free_mem;
+
+	/* Parameters for recv virtqueue */
+	callbacks[0] = skb_recv_done;
+	names[0] = "input";
+
+	/* Parameters for send virtqueues */
+	for (i = 1; i <= numtxqs; i++) {
+		callbacks[i] = skb_xmit_done;
+		names[i] = kmalloc(MAX_DEVICE_NAME * sizeof(*names[i]),
+				   GFP_KERNEL);
+		if (!names[i])
+			goto free_mem;
+		sprintf(names[i], "output.%d", i - 1);
+	}
+
+	/* Parameters for control virtqueue, if any */
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		callbacks[i] = NULL;
+		names[i] = "control";
+	}
+
+	err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs, callbacks,
+					 (const char **)names);
+	if (err)
+		goto free_mem;
+
+	vi->rvq = vqs[0];
+	for (i = 0; i < numtxqs; i++)
+		vi->sq[i].svq = vqs[i + 1];
+
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		vi->cvq = vqs[i + 1];
+
+		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+	}
+
+free_mem:
+	if (names) {
+		for (i = 1; i <= numtxqs; i++)
+			kfree(names[i]);
+		kfree(names);
+	}
+
+	kfree(callbacks);
+	kfree(vqs);
+
+	if (err)
+		kfree(vi->sq);
+
+out:
+	return err;
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int err;
+	u16 numtxqs = 1;
 	struct net_device *dev;
 	struct virtnet_info *vi;
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs;
+
+	/* Find how many transmit queues the device supports */
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_NUMTXQS)) {
+		vdev->config->get(vdev,
+			  offsetof(struct virtio_net_config, numtxqs),
+			  &numtxqs, sizeof(numtxqs));
+		if (numtxqs < 1 || numtxqs > VIRTIO_MAX_TXQS) {
+			printk(KERN_WARNING "%s: Invalid numtxqs: %d\n",
+			       __func__, numtxqs);
+			return -EINVAL;
+		}
+	}
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), numtxqs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -940,9 +1050,9 @@ static int virtnet_probe(struct virtio_d
 	vi->vdev = vdev;
 	vdev->priv = vi;
 	vi->pages = NULL;
+	vi->numtxqs = numtxqs;
 	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -953,23 +1063,10 @@ static int virtnet_probe(struct virtio_d
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
-
-	err = vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names);
+	/* Initialize our rx/tx queue parameters, and invoke find_vqs */
+	err = initialize_vqs(vi, numtxqs);
 	if (err)
-		goto free;
-
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
-
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
-
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
-			dev->features |= NETIF_F_HW_VLAN_FILTER;
-	}
+		goto free_netdev;
 
 	err = register_netdev(dev);
 	if (err) {
@@ -986,6 +1083,9 @@ static int virtnet_probe(struct virtio_d
 		goto unregister;
 	}
 
+	dev_info(&dev->dev, "(virtio-net) Allocated 1 RX and %d TX vq's\n",
+		 numtxqs);
+
 	vi->status = VIRTIO_NET_S_LINK_UP;
 	virtnet_update_status(vi);
 	netif_carrier_on(dev);
@@ -998,7 +1098,8 @@ unregister:
 	cancel_delayed_work_sync(&vi->refill);
 free_vqs:
 	vdev->config->del_vqs(vdev);
-free:
+	kfree(vi->sq);
+free_netdev:
 	free_netdev(dev);
 	return err;
 }
@@ -1006,11 +1107,17 @@ free:
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
+	int i;
+
+	for (i = 0; i < vi->numtxqs; i++) {
+		struct virtqueue *svq = vi->sq[i].svq;
+
+		while (1) {
+			buf = virtqueue_detach_unused_buf(svq);
+			if (!buf)
+				break;
+			dev_kfree_skb(buf);
+		}
 	}
 	while (1) {
 		buf = virtqueue_detach_unused_buf(vi->rvq);
@@ -1059,7 +1166,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
-	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
+	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_NUMTXQS,
 };
 
 static struct virtio_driver virtio_net_driver = {

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 3/4] Changes for vhost
  2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
  2010-09-08  7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
  2010-09-08  7:29 ` [RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
@ 2010-09-08  7:29 ` Krishna Kumar
  2010-09-08  7:29 ` [RFC PATCH 4/4] qemu changes Krishna Kumar
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08  7:29 UTC (permalink / raw)
  To: rusty, davem; +Cc: netdev, kvm, Krishna Kumar, anthony, mst

Changes for mq vhost.

vhost_net_open is changed to allocate a vhost_net and
return.  The remaining initializations are delayed till
SET_OWNER. SET_OWNER is changed so that the argument
is used to figure out how many txqs to use.  Unmodified
qemu's will pass NULL, so this is recognized and handled
as numtxqs=1.

Besides changing handle_tx to use 'vq', this patch also
changes handle_rx to take vq as parameter.  The mq RX
patch requires this change, but till then it is consistent
(and less confusing) to make the interfaces for handling
rx and tx similar.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/vhost/net.c   |  272 ++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.c |  152 ++++++++++++++--------
 drivers/vhost/vhost.h |   15 +-
 3 files changed, 289 insertions(+), 150 deletions(-)

diff -ruNp org/drivers/vhost/net.c tx_only/drivers/vhost/net.c
--- org/drivers/vhost/net.c	2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/vhost/net.c	2010-09-08 10:20:54.000000000 +0530
@@ -33,12 +33,6 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
-enum {
-	VHOST_NET_VQ_RX = 0,
-	VHOST_NET_VQ_TX = 1,
-	VHOST_NET_VQ_MAX = 2,
-};
-
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
 	VHOST_NET_POLL_STARTED = 1,
@@ -47,12 +41,12 @@ enum vhost_net_poll_state {
 
 struct vhost_net {
 	struct vhost_dev dev;
-	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
-	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct vhost_virtqueue *vqs;
+	struct vhost_poll *poll;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
-	enum vhost_net_poll_state tx_poll_state;
+	enum vhost_net_poll_state *tx_poll_state;
 };
 
 /* Pop first len bytes from iovec. Return number of segments used. */
@@ -92,28 +86,28 @@ static void copy_iovec_hdr(const struct 
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
+static void tx_poll_stop(struct vhost_net *net, int qnum)
 {
-	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
+	if (likely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STARTED))
 		return;
-	vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
-	net->tx_poll_state = VHOST_NET_POLL_STOPPED;
+	vhost_poll_stop(&net->poll[qnum]);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
+static void tx_poll_start(struct vhost_net *net, struct socket *sock, int qnum)
 {
-	if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
+	if (unlikely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STOPPED))
 		return;
-	vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
-	net->tx_poll_state = VHOST_NET_POLL_STARTED;
+	vhost_poll_start(&net->poll[qnum], sock->file);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STARTED;
 }
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 	unsigned out, in, s;
 	int head;
 	struct msghdr msg = {
@@ -134,7 +128,7 @@ static void handle_tx(struct vhost_net *
 	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 	if (wmem >= sock->sk->sk_sndbuf) {
 		mutex_lock(&vq->mutex);
-		tx_poll_start(net, sock);
+		tx_poll_start(net, sock, vq->qnum);
 		mutex_unlock(&vq->mutex);
 		return;
 	}
@@ -144,7 +138,7 @@ static void handle_tx(struct vhost_net *
 	vhost_disable_notify(vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
-		tx_poll_stop(net);
+		tx_poll_stop(net, vq->qnum);
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
@@ -159,7 +153,7 @@ static void handle_tx(struct vhost_net *
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
+				tx_poll_start(net, sock, vq->qnum);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
@@ -189,7 +183,7 @@ static void handle_tx(struct vhost_net *
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
 			vhost_discard_vq_desc(vq, 1);
-			tx_poll_start(net, sock);
+			tx_poll_start(net, sock, vq->qnum);
 			break;
 		}
 		if (err != len)
@@ -282,9 +276,9 @@ err:
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_big(struct vhost_net *net)
+static void handle_rx_big(struct vhost_virtqueue *vq,
+			  struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned out, in, log, s;
 	int head;
 	struct vhost_log *vq_log;
@@ -393,9 +387,9 @@ static void handle_rx_big(struct vhost_n
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_mergeable(struct vhost_net *net)
+static void handle_rx_mergeable(struct vhost_virtqueue *vq,
+				struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned uninitialized_var(in), log;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -500,96 +494,179 @@ static void handle_rx_mergeable(struct v
 	unuse_mm(net->dev.mm);
 }
 
-static void handle_rx(struct vhost_net *net)
+static void handle_rx(struct vhost_virtqueue *vq)
 {
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
-		handle_rx_mergeable(net);
+		handle_rx_mergeable(vq, net);
 	else
-		handle_rx_big(net);
+		handle_rx_big(vq, net);
 }
 
 static void handle_tx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_tx(net);
+	handle_tx(vq);
 }
 
 static void handle_rx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_rx(net);
+	handle_rx(vq);
 }
 
 static void handle_tx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_TX].work);
-	handle_tx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_tx(vq);
 }
 
 static void handle_rx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_RX].work);
-	handle_rx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_rx(vq);
 }
 
-static int vhost_net_open(struct inode *inode, struct file *f)
+void vhost_free_vqs(struct vhost_dev *dev)
 {
-	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
-	struct vhost_dev *dev;
-	int r;
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
 
-	if (!n)
-		return -ENOMEM;
+	kfree(dev->work_list);
+	kfree(dev->work_lock);
+	kfree(n->tx_poll_state);
+	kfree(n->poll);
+	kfree(n->vqs);
 
-	dev = &n->dev;
-	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
-	if (r < 0) {
-		kfree(n);
-		return r;
+	/*
+	 * Reset so that vhost_net_release (after vhost_dev_set_owner call)
+	 * will notice.
+	 */
+	n->vqs = NULL;
+	n->poll = NULL;
+	n->tx_poll_state = NULL;
+	dev->work_lock = NULL;
+	dev->work_list = NULL;
+}
+
+/* Upper limit of how many vq's we support - 1 RX and VIRTIO_MAX_TXQS TX vq's */
+#define MAX_VQS		(1 + VIRTIO_MAX_TXQS)
+
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs)
+{
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
+	int i, nvqs;
+	int ret;
+
+	if (numtxqs < 0 || numtxqs > VIRTIO_MAX_TXQS)
+		return -EINVAL;
+
+	if (numtxqs == 0) {
+		/* Old qemu doesn't pass arguments to set_owner, use 1 txq */
+		numtxqs = 1;
 	}
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
-	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	/* Total number of virtqueues is numtxqs + 1 */
+	nvqs = numtxqs + 1;
+
+	n->vqs = kmalloc(nvqs * sizeof(*n->vqs), GFP_KERNEL);
+	n->poll = kmalloc(nvqs * sizeof(*n->poll), GFP_KERNEL);
+
+	/* Allocate 1 more tx_poll_state than required for convenience */
+	n->tx_poll_state = kmalloc(nvqs * sizeof(*n->tx_poll_state),
+				   GFP_KERNEL);
+	dev->work_lock = kmalloc(nvqs * sizeof(*dev->work_lock),
+				 GFP_KERNEL);
+	dev->work_list = kmalloc(nvqs * sizeof(*dev->work_list),
+				 GFP_KERNEL);
+
+	if (!n->vqs || !n->poll || !n->tx_poll_state || !dev->work_lock ||
+	    !dev->work_list) {
+		ret = -ENOMEM;
+		goto err;
+	}
 
-	f->private_data = n;
+	/* 1 RX, followed by 'numtxqs' TX queues */
+	n->vqs[0].handle_kick = handle_rx_kick;
+
+	for (i = 1; i < nvqs; i++)
+		n->vqs[i].handle_kick = handle_tx_kick;
+
+	ret = vhost_dev_init(dev, n->vqs, nvqs);
+	if (ret < 0)
+		goto err;
+
+	vhost_poll_init(&n->poll[0], handle_rx_net, POLLIN, &n->vqs[0]);
+
+	for (i = 1; i < nvqs; i++) {
+		vhost_poll_init(&n->poll[i], handle_tx_net, POLLOUT,
+				&n->vqs[i]);
+		n->tx_poll_state[i] = VHOST_NET_POLL_DISABLED;
+	}
 
 	return 0;
+
+err:
+	/* Free all pointers that may have been allocated */
+	vhost_free_vqs(dev);
+
+	return ret;
+}
+
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
+	int ret = ENOMEM;
+
+	if (n) {
+		struct vhost_dev *dev = &n->dev;
+
+		f->private_data = n;
+		mutex_init(&dev->mutex);
+
+		/* Defer all other initialization till user does SET_OWNER */
+		ret = 0;
+	}
+
+	return ret;
 }
 
 static void vhost_net_disable_vq(struct vhost_net *n,
 				 struct vhost_virtqueue *vq)
 {
+	int qnum = vq->qnum;
+
 	if (!vq->private_data)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		tx_poll_stop(n);
-		n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	if (qnum) {	/* TX */
+		tx_poll_stop(n, qnum);
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_DISABLED;
 	} else
-		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+		vhost_poll_stop(&n->poll[qnum]);
 }
 
 static void vhost_net_enable_vq(struct vhost_net *n,
 				struct vhost_virtqueue *vq)
 {
 	struct socket *sock = vq->private_data;
+	int qnum = vq->qnum;
+
 	if (!sock)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		n->tx_poll_state = VHOST_NET_POLL_STOPPED;
-		tx_poll_start(n, sock);
+
+	if (qnum) {	/* TX */
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
+		tx_poll_start(n, sock, qnum);
 	} else
-		vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+		vhost_poll_start(&n->poll[qnum], sock->file);
 }
 
 static struct socket *vhost_net_stop_vq(struct vhost_net *n,
@@ -605,11 +682,12 @@ static struct socket *vhost_net_stop_vq(
 	return sock;
 }
 
-static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
-			   struct socket **rx_sock)
+static void vhost_net_stop(struct vhost_net *n, struct socket **socks)
 {
-	*tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
-	*rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		socks[i] = vhost_net_stop_vq(n, &n->vqs[i]);
 }
 
 static void vhost_net_flush_vq(struct vhost_net *n, int index)
@@ -620,26 +698,34 @@ static void vhost_net_flush_vq(struct vh
 
 static void vhost_net_flush(struct vhost_net *n)
 {
-	vhost_net_flush_vq(n, VHOST_NET_VQ_TX);
-	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		vhost_net_flush_vq(n, i);
 }
 
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
-	struct socket *tx_sock;
-	struct socket *rx_sock;
+	struct vhost_dev *dev = &n->dev;
+	struct socket *socks[MAX_VQS];
+	int i;
 
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n, socks);
 	vhost_net_flush(n);
-	vhost_dev_cleanup(&n->dev);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+	vhost_dev_cleanup(dev);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (socks[i])
+			fput(socks[i]->file);
+
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+
+	/* Free all old pointers */
+	vhost_free_vqs(dev);
+
 	kfree(n);
 	return 0;
 }
@@ -717,7 +803,7 @@ static long vhost_net_set_backend(struct
 	if (r)
 		goto err;
 
-	if (index >= VHOST_NET_VQ_MAX) {
+	if (index >= n->dev.nvqs) {
 		r = -ENOBUFS;
 		goto err;
 	}
@@ -762,22 +848,26 @@ err:
 
 static long vhost_net_reset_owner(struct vhost_net *n)
 {
-	struct socket *tx_sock = NULL;
-	struct socket *rx_sock = NULL;
+	struct socket *socks[MAX_VQS];
 	long err;
+	int i;
+
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
-	if (err)
-		goto done;
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	if (err) {
+		mutex_unlock(&n->dev.mutex);
+		return err;
+	}
+
+	vhost_net_stop(n, socks);
 	vhost_net_flush(n);
 	err = vhost_dev_reset_owner(&n->dev);
-done:
 	mutex_unlock(&n->dev.mutex);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (socks[i])
+			fput(socks[i]->file);
+
 	return err;
 }
 
@@ -806,7 +896,7 @@ static int vhost_net_set_features(struct
 	}
 	n->dev.acked_features = features;
 	smp_wmb();
-	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+	for (i = 0; i < n->dev.nvqs; ++i) {
 		mutex_lock(&n->vqs[i].mutex);
 		n->vqs[i].vhost_hlen = vhost_hlen;
 		n->vqs[i].sock_hlen = sock_hlen;
diff -ruNp org/drivers/vhost/vhost.c tx_only/drivers/vhost/vhost.c
--- org/drivers/vhost/vhost.c	2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/vhost/vhost.c	2010-09-08 10:20:54.000000000 +0530
@@ -62,14 +62,14 @@ static int vhost_poll_wakeup(wait_queue_
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq)
 {
 	struct vhost_work *work = &poll->work;
 
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->vq = vq;
 
 	INIT_LIST_HEAD(&work->node);
 	work->fn = fn;
@@ -104,35 +104,35 @@ void vhost_poll_flush(struct vhost_poll 
 	int left;
 	int flushing;
 
-	spin_lock_irq(&poll->dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	seq = work->queue_seq;
 	work->flushing++;
-	spin_unlock_irq(&poll->dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	wait_event(work->done, ({
-		   spin_lock_irq(&poll->dev->work_lock);
+		   spin_lock_irq(poll->vq->work_lock);
 		   left = seq - work->done_seq <= 0;
-		   spin_unlock_irq(&poll->dev->work_lock);
+		   spin_unlock_irq(poll->vq->work_lock);
 		   left;
 	}));
-	spin_lock_irq(&poll->dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	flushing = --work->flushing;
-	spin_unlock_irq(&poll->dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	BUG_ON(flushing < 0);
 }
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	struct vhost_dev *dev = poll->dev;
+	struct vhost_virtqueue *vq = poll->vq;
 	struct vhost_work *work = &poll->work;
 	unsigned long flags;
 
-	spin_lock_irqsave(&dev->work_lock, flags);
+	spin_lock_irqsave(vq->work_lock, flags);
 	if (list_empty(&work->node)) {
-		list_add_tail(&work->node, &dev->work_list);
+		list_add_tail(&work->node, vq->work_list);
 		work->queue_seq++;
-		wake_up_process(dev->worker);
+		wake_up_process(vq->worker);
 	}
-	spin_unlock_irqrestore(&dev->work_lock, flags);
+	spin_unlock_irqrestore(vq->work_lock, flags);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -163,7 +163,7 @@ static void vhost_vq_reset(struct vhost_
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_virtqueue *vq = data;
 	struct vhost_work *work = NULL;
 	unsigned uninitialized_var(seq);
 
@@ -171,7 +171,7 @@ static int vhost_worker(void *data)
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		spin_lock_irq(&dev->work_lock);
+		spin_lock_irq(vq->work_lock);
 		if (work) {
 			work->done_seq = seq;
 			if (work->flushing)
@@ -179,18 +179,18 @@ static int vhost_worker(void *data)
 		}
 
 		if (kthread_should_stop()) {
-			spin_unlock_irq(&dev->work_lock);
+			spin_unlock_irq(vq->work_lock);
 			__set_current_state(TASK_RUNNING);
 			return 0;
 		}
-		if (!list_empty(&dev->work_list)) {
-			work = list_first_entry(&dev->work_list,
+		if (!list_empty(vq->work_list)) {
+			work = list_first_entry(vq->work_list,
 						struct vhost_work, node);
 			list_del_init(&work->node);
 			seq = work->queue_seq;
 		} else
 			work = NULL;
-		spin_unlock_irq(&dev->work_lock);
+		spin_unlock_irq(vq->work_lock);
 
 		if (work) {
 			__set_current_state(TASK_RUNNING);
@@ -213,17 +213,24 @@ long vhost_dev_init(struct vhost_dev *de
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
-	spin_lock_init(&dev->work_lock);
-	INIT_LIST_HEAD(&dev->work_list);
-	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].dev = dev;
-		mutex_init(&dev->vqs[i].mutex);
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+
+		spin_lock_init(&dev->work_lock[i]);
+		INIT_LIST_HEAD(&dev->work_list[i]);
+
+		vq->work_lock = &dev->work_lock[i];
+		vq->work_list = &dev->work_list[i];
+
+		vq->worker = NULL;
+		vq->dev = dev;
+		vq->qnum = i;
+		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, dev->vqs + i);
-		if (dev->vqs[i].handle_kick)
-			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick, POLLIN, dev);
+		if (vq->handle_kick)
+			vhost_poll_init(&vq->poll, vq->handle_kick, POLLIN,
+					vq);
 	}
 
 	return 0;
@@ -236,38 +243,76 @@ long vhost_dev_check_owner(struct vhost_
 	return dev->mm == current->mm ? 0 : -EPERM;
 }
 
+static void vhost_stop_workers(struct vhost_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < dev->nvqs; i++) {
+		WARN_ON(!list_empty(dev->vqs[i].work_list));
+		kthread_stop(dev->vqs[i].worker);
+	}
+}
+
+static int vhost_start_workers(struct vhost_dev *dev)
+{
+	int i, err = 0;
+
+	for (i = 0; i < dev->nvqs; ++i) {
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+
+		vq->worker = kthread_create(vhost_worker, vq, "vhost-%d-%d",
+					    current->pid, i);
+		if (IS_ERR(vq->worker)) {
+			err = PTR_ERR(vq->worker);
+			i--;	/* no thread to clean up at this index */
+			goto err;
+		}
+
+		/* avoid contributing to loadavg */
+		err = cgroup_attach_task_current_cg(vq->worker);
+		if (err)
+			goto err;
+
+		wake_up_process(vq->worker);
+	}
+
+	return 0;
+
+err:
+	for (; i >= 0; i--)
+		kthread_stop(dev->vqs[i].worker);
+
+	return err;
+}
+
 /* Caller should have device mutex */
-static long vhost_dev_set_owner(struct vhost_dev *dev)
+static long vhost_dev_set_owner(struct vhost_dev *dev, int numtxqs)
 {
-	struct task_struct *worker;
 	int err;
 	/* Is there an owner already? */
 	if (dev->mm) {
 		err = -EBUSY;
-		goto err_mm;
-	}
-	/* No owner, become one */
-	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
+	} else {
+		err = vhost_setup_vqs(dev, numtxqs);
+		if (err)
+			goto out;
+
+		/* No owner, become one */
+		dev->mm = get_task_mm(current);
+
+		/* Start daemons */
+		err =  vhost_start_workers(dev);
+
+		if (err) {
+			vhost_free_vqs(dev);
+			if (dev->mm) {
+				mmput(dev->mm);
+				dev->mm = NULL;
+			}
+		}
 	}
 
-	dev->worker = worker;
-	err = cgroup_attach_task_current_cg(worker);
-	if (err)
-		goto err_cgroup;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
-
-	return 0;
-err_cgroup:
-	kthread_stop(worker);
-err_worker:
-	if (dev->mm)
-		mmput(dev->mm);
-	dev->mm = NULL;
-err_mm:
+out:
 	return err;
 }
 
@@ -322,8 +367,7 @@ void vhost_dev_cleanup(struct vhost_dev 
 		mmput(dev->mm);
 	dev->mm = NULL;
 
-	WARN_ON(!list_empty(&dev->work_list));
-	kthread_stop(dev->worker);
+	vhost_stop_workers(dev);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -674,7 +718,7 @@ long vhost_dev_ioctl(struct vhost_dev *d
 
 	/* If you are not the owner, you can become one */
 	if (ioctl == VHOST_SET_OWNER) {
-		r = vhost_dev_set_owner(d);
+		r = vhost_dev_set_owner(d, arg);
 		goto done;
 	}
 
diff -ruNp org/drivers/vhost/vhost.h tx_only/drivers/vhost/vhost.h
--- org/drivers/vhost/vhost.h	2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/vhost/vhost.h	2010-09-08 10:20:54.000000000 +0530
@@ -40,11 +40,11 @@ struct vhost_poll {
 	wait_queue_t              wait;
 	struct vhost_work	  work;
 	unsigned long		  mask;
-	struct vhost_dev	 *dev;
+	struct vhost_virtqueue	  *vq;  /* points back to vq */
 };
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue *vq);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -110,6 +110,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	struct task_struct *worker; /* vhost for this vq, shared btwn RX/TX */
+	spinlock_t *work_lock;
+	struct list_head *work_list;
+	int qnum;	/* 0 for RX, 1 -> n-1 for TX */
 };
 
 struct vhost_dev {
@@ -124,11 +128,12 @@ struct vhost_dev {
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
-	spinlock_t work_lock;
-	struct list_head work_list;
-	struct task_struct *worker;
+	spinlock_t *work_lock;
+	struct list_head *work_list;
 };
 
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs);
+void vhost_free_vqs(struct vhost_dev *dev);
 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 4/4] qemu changes
  2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (2 preceding siblings ...)
  2010-09-08  7:29 ` [RFC PATCH 3/4] Changes for vhost Krishna Kumar
@ 2010-09-08  7:29 ` Krishna Kumar
  2010-09-08  7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08  7:29 UTC (permalink / raw)
  To: rusty, davem; +Cc: anthony, netdev, Krishna Kumar, kvm, mst

Changes in qemu to support mq TX.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 hw/vhost.c      |    8 ++-
 hw/vhost.h      |    2 
 hw/vhost_net.c  |   16 +++++--
 hw/vhost_net.h  |    2 
 hw/virtio-net.c |   97 ++++++++++++++++++++++++++++++----------------
 hw/virtio-net.h |    5 ++
 hw/virtio-pci.c |    2 
 net.c           |   17 ++++++++
 net.h           |    1 
 net/tap.c       |   61 +++++++++++++++++++++-------
 10 files changed, 155 insertions(+), 56 deletions(-)

diff -ruNp org/hw/vhost.c new/hw/vhost.c
--- org/hw/vhost.c	2010-08-09 09:51:58.000000000 +0530
+++ new/hw/vhost.c	2010-09-08 12:54:50.000000000 +0530
@@ -599,23 +599,27 @@ static void vhost_virtqueue_cleanup(stru
                               0, virtio_queue_get_desc_size(vdev, idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int numtxqs)
 {
     uint64_t features;
     int r;
     if (devfd >= 0) {
         hdev->control = devfd;
+        hdev->nvqs = 2;
     } else {
         hdev->control = open("/dev/vhost-net", O_RDWR);
         if (hdev->control < 0) {
             return -errno;
         }
     }
-    r = ioctl(hdev->control, VHOST_SET_OWNER, NULL);
+
+    r = ioctl(hdev->control, VHOST_SET_OWNER, numtxqs);
     if (r < 0) {
         goto fail;
     }
 
+    hdev->nvqs = numtxqs + 1;
+
     r = ioctl(hdev->control, VHOST_GET_FEATURES, &features);
     if (r < 0) {
         goto fail;
diff -ruNp org/hw/vhost.h new/hw/vhost.h
--- org/hw/vhost.h	2010-07-01 11:42:09.000000000 +0530
+++ new/hw/vhost.h	2010-09-08 12:54:50.000000000 +0530
@@ -40,7 +40,7 @@ struct vhost_dev {
     unsigned long long log_size;
 };
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int nvqs);
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff -ruNp org/hw/vhost_net.c new/hw/vhost_net.c
--- org/hw/vhost_net.c	2010-08-09 09:51:58.000000000 +0530
+++ new/hw/vhost_net.c	2010-09-08 12:54:50.000000000 +0530
@@ -36,7 +36,8 @@
 
 struct vhost_net {
     struct vhost_dev dev;
-    struct vhost_virtqueue vqs[2];
+    struct vhost_virtqueue *vqs;
+    int nvqs;
     int backend;
     VLANClientState *vc;
 };
@@ -76,7 +77,8 @@ static int vhost_net_get_fd(VLANClientSt
     }
 }
 
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int numtxqs)
 {
     int r;
     struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -93,10 +95,14 @@ struct vhost_net *vhost_net_init(VLANCli
         (1 << VHOST_NET_F_VIRTIO_NET_HDR);
     net->backend = r;
 
-    r = vhost_dev_init(&net->dev, devfd);
+    r = vhost_dev_init(&net->dev, devfd, numtxqs);
     if (r < 0) {
         goto fail;
     }
+
+    net->nvqs = numtxqs + 1;
+    net->vqs = qemu_malloc(net->nvqs * (sizeof *net->vqs));
+
     if (~net->dev.features & net->dev.backend_features) {
         fprintf(stderr, "vhost lacks feature mask %" PRIu64 " for backend\n",
                 (uint64_t)(~net->dev.features & net->dev.backend_features));
@@ -118,7 +124,6 @@ int vhost_net_start(struct vhost_net *ne
     struct vhost_vring_file file = { };
     int r;
 
-    net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
     r = vhost_dev_start(&net->dev, dev);
     if (r < 0) {
@@ -166,7 +171,8 @@ void vhost_net_cleanup(struct vhost_net 
     qemu_free(net);
 }
 #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int nvqs)
 {
 	return NULL;
 }
diff -ruNp org/hw/vhost_net.h new/hw/vhost_net.h
--- org/hw/vhost_net.h	2010-07-01 11:42:09.000000000 +0530
+++ new/hw/vhost_net.h	2010-09-08 12:54:50.000000000 +0530
@@ -6,7 +6,7 @@
 struct vhost_net;
 typedef struct vhost_net VHostNetState;
 
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, int nvqs);
 
 int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
 void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
diff -ruNp org/hw/virtio-net.c new/hw/virtio-net.c
--- org/hw/virtio-net.c	2010-07-19 12:41:28.000000000 +0530
+++ new/hw/virtio-net.c	2010-09-08 12:54:50.000000000 +0530
@@ -32,17 +32,17 @@ typedef struct VirtIONet
     uint8_t mac[ETH_ALEN];
     uint16_t status;
     VirtQueue *rx_vq;
-    VirtQueue *tx_vq;
+    VirtQueue **tx_vq;
     VirtQueue *ctrl_vq;
     NICState *nic;
-    QEMUTimer *tx_timer;
-    int tx_timer_active;
+    QEMUTimer **tx_timer;
+    int *tx_timer_active;
     uint32_t has_vnet_hdr;
     uint8_t has_ufo;
     struct {
         VirtQueueElement elem;
         ssize_t len;
-    } async_tx;
+    } *async_tx;
     int mergeable_rx_bufs;
     uint8_t promisc;
     uint8_t allmulti;
@@ -61,6 +61,7 @@ typedef struct VirtIONet
     } mac_table;
     uint32_t *vlans;
     DeviceState *qdev;
+    uint16_t numtxqs;
 } VirtIONet;
 
 /* TODO
@@ -78,6 +79,7 @@ static void virtio_net_get_config(VirtIO
     struct virtio_net_config netcfg;
 
     netcfg.status = n->status;
+    netcfg.numtxqs = n->numtxqs;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, sizeof(netcfg));
 }
@@ -162,6 +164,8 @@ static uint32_t virtio_net_get_features(
     VirtIONet *n = to_virtio_net(vdev);
 
     features |= (1 << VIRTIO_NET_F_MAC);
+    if (n->numtxqs > 1)
+        features |= (1 << VIRTIO_NET_F_NUMTXQS);
 
     if (peer_has_vnet_hdr(n)) {
         tap_using_vnet_hdr(n->nic->nc.peer, 1);
@@ -625,13 +629,16 @@ static void virtio_net_tx_complete(VLANC
 {
     VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
 
-    virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
-    virtio_notify(&n->vdev, n->tx_vq);
+    /*
+     * If this function executes, we are single TX and hence use only txq[0]
+     */
+    virtqueue_push(n->tx_vq[0], &n->async_tx[0].elem, n->async_tx[0].len);
+    virtio_notify(&n->vdev, n->tx_vq[0]);
 
-    n->async_tx.elem.out_num = n->async_tx.len = 0;
+    n->async_tx[0].elem.out_num = n->async_tx[0].len = 0;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 /* TX */
@@ -642,8 +649,8 @@ static void virtio_net_flush_tx(VirtIONe
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    if (n->async_tx.elem.out_num) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+    if (n->async_tx[0].elem.out_num) {
+        virtio_queue_set_notification(n->tx_vq[0], 0);
         return;
     }
 
@@ -678,9 +685,9 @@ static void virtio_net_flush_tx(VirtIONe
         ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
                                       virtio_net_tx_complete);
         if (ret == 0) {
-            virtio_queue_set_notification(n->tx_vq, 0);
-            n->async_tx.elem = elem;
-            n->async_tx.len  = len;
+            virtio_queue_set_notification(n->tx_vq[0], 0);
+            n->async_tx[0].elem = elem;
+            n->async_tx[0].len  = len;
             return;
         }
 
@@ -695,15 +702,15 @@ static void virtio_net_handle_tx(VirtIOD
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
+    if (n->tx_timer_active[0]) {
         virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_timer_active = 0;
+        qemu_del_timer(n->tx_timer[0]);
+        n->tx_timer_active[0] = 0;
         virtio_net_flush_tx(n, vq);
     } else {
-        qemu_mod_timer(n->tx_timer,
+        qemu_mod_timer(n->tx_timer[0],
                        qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-        n->tx_timer_active = 1;
+        n->tx_timer_active[0] = 1;
         virtio_queue_set_notification(vq, 0);
     }
 }
@@ -712,18 +719,19 @@ static void virtio_net_tx_timer(void *op
 {
     VirtIONet *n = opaque;
 
-    n->tx_timer_active = 0;
+    n->tx_timer_active[0] = 0;
 
     /* Just in case the driver is not ready on more */
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
 {
+    int i;
     VirtIONet *n = opaque;
 
     if (n->vhost_started) {
@@ -735,7 +743,9 @@ static void virtio_net_save(QEMUFile *f,
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
-    qemu_put_be32(f, n->tx_timer_active);
+    qemu_put_be16(f, n->numtxqs);
+    for (i = 0; i < n->numtxqs; i++)
+        qemu_put_be32(f, n->tx_timer_active[i]);
     qemu_put_be32(f, n->mergeable_rx_bufs);
     qemu_put_be16(f, n->status);
     qemu_put_byte(f, n->promisc);
@@ -764,7 +774,9 @@ static int virtio_net_load(QEMUFile *f, 
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, ETH_ALEN);
-    n->tx_timer_active = qemu_get_be32(f);
+    n->numtxqs = qemu_get_be16(f);
+    for (i = 0; i < n->numtxqs; i++)
+        n->tx_timer_active[i] = qemu_get_be32(f);
     n->mergeable_rx_bufs = qemu_get_be32(f);
 
     if (version_id >= 3)
@@ -840,9 +852,10 @@ static int virtio_net_load(QEMUFile *f, 
     }
     n->mac_table.first_multi = i;
 
-    if (n->tx_timer_active) {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
+    for (i = 0; i < n->numtxqs; i++) {
+        if (n->tx_timer_active[i])
+            qemu_mod_timer(n->tx_timer[i],
+                           qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
     }
     return 0;
 }
@@ -905,12 +918,15 @@ static void virtio_net_vmstate_change(vo
 
 VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
 {
+    int i;
     VirtIONet *n;
 
     n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
                                         sizeof(struct virtio_net_config),
                                         sizeof(VirtIONet));
 
+    n->numtxqs = conf->peer->numtxqs;
+
     n->vdev.get_config = virtio_net_get_config;
     n->vdev.set_config = virtio_net_set_config;
     n->vdev.get_features = virtio_net_get_features;
@@ -918,8 +934,24 @@ VirtIODevice *virtio_net_init(DeviceStat
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
     n->vdev.set_status = virtio_net_set_status;
+
     n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
-    n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+
+    n->tx_vq = qemu_mallocz(n->numtxqs * sizeof(*n->tx_vq));
+    n->tx_timer = qemu_mallocz(n->numtxqs * sizeof(*n->tx_timer));
+    n->tx_timer_active = qemu_mallocz(n->numtxqs * sizeof(*n->tx_timer_active));
+    n->async_tx = qemu_mallocz(n->numtxqs * sizeof(*n->async_tx));
+
+    /* Allocate per tx vq's */
+    for (i = 0; i < n->numtxqs; i++) {
+        n->tx_vq[i] = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+
+        /* setup timer per tx vq */
+        n->tx_timer[i] = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
+        n->tx_timer_active[i] = 0;
+    }
+
+    /* Allocate control vq */
     n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
     qemu_macaddr_default_if_unset(&conf->macaddr);
     memcpy(&n->mac[0], &conf->macaddr, sizeof(n->mac));
@@ -929,8 +961,6 @@ VirtIODevice *virtio_net_init(DeviceStat
 
     qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
 
@@ -948,6 +978,7 @@ VirtIODevice *virtio_net_init(DeviceStat
 
 void virtio_net_exit(VirtIODevice *vdev)
 {
+    int i;
     VirtIONet *n = DO_UPCAST(VirtIONet, vdev, vdev);
     qemu_del_vm_change_state_handler(n->vmstate);
 
@@ -962,8 +993,10 @@ void virtio_net_exit(VirtIODevice *vdev)
     qemu_free(n->mac_table.macs);
     qemu_free(n->vlans);
 
-    qemu_del_timer(n->tx_timer);
-    qemu_free_timer(n->tx_timer);
+    for (i = 0; i < n->numtxqs; i++) {
+        qemu_del_timer(n->tx_timer[i]);
+        qemu_free_timer(n->tx_timer[i]);
+    }
 
     virtio_cleanup(&n->vdev);
     qemu_del_vlan_client(&n->nic->nc);
diff -ruNp org/hw/virtio-net.h new/hw/virtio-net.h
--- org/hw/virtio-net.h	2010-07-01 11:42:09.000000000 +0530
+++ new/hw/virtio-net.h	2010-09-08 12:54:50.000000000 +0530
@@ -22,6 +22,9 @@
 
 /* from Linux's virtio_net.h */
 
+/* The maximum of transmit (& separate receive) queues supported */
+#define VIRTIO_MAX_TXQS		16
+
 /* The ID for virtio_net */
 #define VIRTIO_ID_NET   1
 
@@ -44,6 +47,7 @@
 #define VIRTIO_NET_F_CTRL_RX    18      /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN  19      /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS    21      /* Supports multiple TX queues */
 
 #define VIRTIO_NET_S_LINK_UP    1       /* Link is up */
 
@@ -58,6 +62,7 @@ struct virtio_net_config
     uint8_t mac[ETH_ALEN];
     /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
     uint16_t status;
+    uint16_t numtxqs;	/* number of transmit queues */
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org/hw/virtio-pci.c new/hw/virtio-pci.c
--- org/hw/virtio-pci.c	2010-09-08 12:46:36.000000000 +0530
+++ new/hw/virtio-pci.c	2010-09-08 12:54:50.000000000 +0530
@@ -99,6 +99,7 @@ typedef struct {
     uint32_t addr;
     uint32_t class_code;
     uint32_t nvectors;
+    uint32_t mq;
     BlockConf block;
     NICConf nic;
     uint32_t host_features;
@@ -722,6 +723,7 @@ static PCIDeviceInfo virtio_info[] = {
         .romfile    = "pxe-virtio.bin",
         .qdev.props = (Property[]) {
             DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
+	    DEFINE_PROP_UINT32("mq", VirtIOPCIProxy, mq, 1),
             DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
             DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
             DEFINE_PROP_END_OF_LIST(),
diff -ruNp org/net/tap.c new/net/tap.c
--- org/net/tap.c	2010-07-01 11:42:09.000000000 +0530
+++ new/net/tap.c	2010-09-08 12:54:50.000000000 +0530
@@ -249,7 +249,7 @@ void tap_set_offload(VLANClientState *nc
 {
     TAPState *s = DO_UPCAST(TAPState, nc, nc);
 
-    return tap_fd_set_offload(s->fd, csum, tso4, tso6, ecn, ufo);
+    tap_fd_set_offload(s->fd, csum, tso4, tso6, ecn, ufo);
 }
 
 static void tap_cleanup(VLANClientState *nc)
@@ -262,8 +262,9 @@ static void tap_cleanup(VLANClientState 
 
     qemu_purge_queued_packets(nc);
 
-    if (s->down_script[0])
+    if (s->down_script[0]) {
         launch_script(s->down_script, s->down_script_arg, s->fd);
+    }
 
     tap_read_poll(s, 0);
     tap_write_poll(s, 0);
@@ -299,13 +300,14 @@ static NetClientInfo net_tap_info = {
 static TAPState *net_tap_fd_init(VLANState *vlan,
                                  const char *model,
                                  const char *name,
-                                 int fd,
+                                 int fd, int numtxqs,
                                  int vnet_hdr)
 {
     VLANClientState *nc;
     TAPState *s;
 
     nc = qemu_new_net_client(&net_tap_info, vlan, NULL, model, name);
+    nc->numtxqs = numtxqs;
 
     s = DO_UPCAST(TAPState, nc, nc);
 
@@ -368,6 +370,7 @@ static int net_tap_init(QemuOpts *opts, 
     int fd, vnet_hdr_required;
     char ifname[128] = {0,};
     const char *setup_script;
+    int launch = 0;
 
     if (qemu_opt_get(opts, "ifname")) {
         pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, "ifname"));
@@ -380,29 +383,57 @@ static int net_tap_init(QemuOpts *opts, 
         vnet_hdr_required = 0;
     }
 
-    TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required));
-    if (fd < 0) {
-        return -1;
-    }
-
     setup_script = qemu_opt_get(opts, "script");
     if (setup_script &&
         setup_script[0] != '\0' &&
-        strcmp(setup_script, "no") != 0 &&
-        launch_script(setup_script, ifname, fd)) {
-        close(fd);
+        strcmp(setup_script, "no") != 0) {
+         launch = 1;
+    }
+
+    TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr,
+                          vnet_hdr_required));
+    if (fd < 0) {
         return -1;
     }
 
+    if (launch && launch_script(setup_script, ifname, fd))
+        goto err;
+
     qemu_opt_set(opts, "ifname", ifname);
 
     return fd;
+
+err:
+    close(fd);
+
+    return -1;
 }
 
 int net_init_tap(QemuOpts *opts, Monitor *mon, const char *name, VLANState *vlan)
 {
     TAPState *s;
     int fd, vnet_hdr = 0;
+    int vhost;
+    int numtxqs = 1;
+
+    vhost = qemu_opt_get_bool(opts, "vhost", 0);
+
+    /*
+     * We support multiple tx queues if:
+     *      1. smp > 1
+     *      2. vhost=on
+     *      3. mq=on
+     * In this case, #txqueues = #cpus. This value can be changed by
+     * using the "numtxqs" option.
+     */
+    if (vhost && smp_cpus > 1) {
+        if (qemu_opt_get_bool(opts, "mq", 0)) {
+#define VIRTIO_MAX_TXQS		16
+            int dflt = MIN(smp_cpus, VIRTIO_MAX_TXQS);
+
+            numtxqs = qemu_opt_get_number(opts, "numtxqs", dflt);
+        }
+    }
 
     if (qemu_opt_get(opts, "fd")) {
         if (qemu_opt_get(opts, "ifname") ||
@@ -436,14 +467,14 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    s = net_tap_fd_init(vlan, "tap", name, fd, vnet_hdr);
+    s = net_tap_fd_init(vlan, "tap", name, fd, numtxqs, vnet_hdr);
     if (!s) {
         close(fd);
         return -1;
     }
 
     if (tap_set_sndbuf(s->fd, opts) < 0) {
-        return -1;
+            return -1;
     }
 
     if (qemu_opt_get(opts, "fd")) {
@@ -465,7 +496,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
+    if (vhost) {
         int vhostfd, r;
         if (qemu_opt_get(opts, "vhostfd")) {
             r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
@@ -476,7 +507,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         } else {
             vhostfd = -1;
         }
-        s->vhost_net = vhost_net_init(&s->nc, vhostfd);
+        s->vhost_net = vhost_net_init(&s->nc, vhostfd, numtxqs);
         if (!s->vhost_net) {
             error_report("vhost-net requested but could not be initialized");
             return -1;
diff -ruNp org/net.c new/net.c
--- org/net.c	2010-09-08 12:46:36.000000000 +0530
+++ new/net.c	2010-09-08 12:54:50.000000000 +0530
@@ -814,6 +814,15 @@ static int net_init_nic(QemuOpts *opts,
         return -1;
     }
 
+    if (nd->netdev->numtxqs > 1 && nd->nvectors == DEV_NVECTORS_UNSPECIFIED) {
+        /*
+         * User specified mq for guest, but no "vectors=", tune
+         * it automatically to 'numtxqs' TX + 1 RX + 1 controlq.
+         */
+        nd->nvectors = nd->netdev->numtxqs + 1 + 1;
+        monitor_printf(mon, "nvectors tuned to %d\n", nd->nvectors);
+    }
+
     nd->used = 1;
     nb_nics++;
 
@@ -957,6 +966,14 @@ static const struct {
             },
 #ifndef _WIN32
             {
+                .name = "mq",
+                .type = QEMU_OPT_BOOL,
+                .help = "enable multiqueue on network i/f",
+            }, {
+                .name = "numtxqs",
+                .type = QEMU_OPT_NUMBER,
+                .help = "optional number of TX queues, if mq is enabled",
+            }, {
                 .name = "fd",
                 .type = QEMU_OPT_STRING,
                 .help = "file descriptor of an already opened tap",
diff -ruNp org/net.h new/net.h
--- org/net.h	2010-07-01 11:42:09.000000000 +0530
+++ new/net.h	2010-09-08 12:54:50.000000000 +0530
@@ -62,6 +62,7 @@ struct VLANClientState {
     struct VLANState *vlan;
     VLANClientState *peer;
     NetQueue *send_queue;
+    int numtxqs;
     char *model;
     char *name;
     char info_str[256];

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (3 preceding siblings ...)
  2010-09-08  7:29 ` [RFC PATCH 4/4] qemu changes Krishna Kumar
@ 2010-09-08  7:47 ` Avi Kivity
  2010-09-08  9:22   ` Krishna Kumar2
  2010-09-08  8:10 ` Michael S. Tsirkin
  2010-09-08  8:13 ` Michael S. Tsirkin
  6 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-09-08  7:47 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, davem, netdev, kvm, anthony, mst

  On 09/08/2010 10:28 AM, Krishna Kumar wrote:
> Following patches implement Transmit mq in virtio-net.  Also
> included is the user qemu changes.
>
> 1. This feature was first implemented with a single vhost.
>     Testing showed 3-8% performance gain for upto 8 netperf
>     sessions (and sometimes 16), but BW dropped with more
>     sessions.  However, implementing per-txq vhost improved
>     BW significantly all the way to 128 sessions.

Why were vhost kernel changes required?  Can't you just instantiate more 
vhost queues?

> 2. For this mq TX patch, 1 daemon is created for RX and 'n'
>     daemons for the 'n' TXQ's, for a total of (n+1) daemons.
>     The (subsequent) RX mq patch changes that to a total of
>     'n' daemons, where RX and TX vq's share 1 daemon.
> 3. Service Demand increases for TCP, but significantly
>     improves for UDP.
> 4. Interoperability: Many combinations, but not all, of
>     qemu, host, guest tested together.

Please update the virtio-pci spec @ http://ozlabs.org/~rusty/virtio-spec/.

>
>                    Enabling mq on virtio:
>                    -----------------------
>
> When following options are passed to qemu:
>          - smp>  1
>          - vhost=on
>          - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using
> an optional 'numtxqs' option. e.g.  for a smp=4 guest:
>          vhost=on,mq=on             ->    #txqueues = 4
>          vhost=on,mq=on,numtxqs=8   ->    #txqueues = 8
>          vhost=on,mq=on,numtxqs=2   ->    #txqueues = 2
>
>
>                     Performance (guest ->  local host):
>                     -----------------------------------
>
> System configuration:
>          Host:  8 Intel Xeon, 8 GB memory
>          Guest: 4 cpus, 2 GB memory
> All testing without any tuning, and TCP netperf with 64K I/O
> _______________________________________________________________________________
>                             TCP (#numtxqs=2)
> N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
> _______________________________________________________________________________
> 4       26387   40716 (54.30)   20      28   (40.00)    86i     85     (-1.16)
> 8       24356   41843 (71.79)   88      129  (46.59)    372     362    (-2.68)
> 16      23587   40546 (71.89)   375     564  (50.40)    1558    1519   (-2.50)
> 32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722   (-14.52)
> 48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552  (-14.35)
> 64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173  (-9.66)
> 96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)
> _______________________________________________________________________________
>                         UDP (#numtxqs=8)
> N#      BW1     BW2   (%)      SD1     SD2   (%)
> __________________________________________________________
> 4       29836   56761 (90.24)   67      63    (-5.97)
> 8       27666   63767 (130.48)  326     265   (-18.71)
> 16      25452   60665 (138.35)  1396    1269  (-9.09)
> 32      26172   63491 (142.59)  5617    4202  (-25.19)
> 48      26146   64629 (147.18)  12813   9316  (-27.29)
> 64      25575   65448 (155.90)  23063   16346 (-29.12)
> 128     26454   63772 (141.06)  91054   85051 (-6.59)

Impressive results.

> __________________________________________________________
> N#: Number of netperf sessions, 90 sec runs
> BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
>                SD for original code
> BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
>                SD for new code. e.g. BW2=40716 means average BW2 was
>                20358 mbps.
>
>
>                         Next steps:
>                         -----------
>
> 1. mq RX patch is also complete - plan to submit once TX is OK.
> 2. Cache-align data structures: I didn't see any BW/SD improvement
>     after making the sq's (and similarly for vhost) cache-aligned
>     statically:
>          struct virtnet_info {
>                  ...
>                  struct send_queue sq[16] ____cacheline_aligned_in_smp;
>                  ...
>          };
>
> Guest interrupts for a 4 TXQ device after a 5 min test:
> # egrep "virtio0|CPU" /proc/interrupts
>        CPU0     CPU1     CPU2    CPU3
> 40:   0        0        0       0        PCI-MSI-edge  virtio0-config
> 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

How are vhost threads and host interrupts distributed?  We need to move 
vhost queue threads to be colocated with the related vcpu threads (if no 
extra cores are available) or on the same socket (if extra cores are 
available).  Similarly, move device interrupts to the same core as the 
vhost thread.



-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (4 preceding siblings ...)
  2010-09-08  7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
@ 2010-09-08  8:10 ` Michael S. Tsirkin
  2010-09-08  9:23   ` Krishna Kumar2
                     ` (2 more replies)
  2010-09-08  8:13 ` Michael S. Tsirkin
  6 siblings, 3 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-08  8:10 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, davem, netdev, kvm, anthony

On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> Following patches implement Transmit mq in virtio-net.  Also
> included is the user qemu changes.
> 
> 1. This feature was first implemented with a single vhost.
>    Testing showed 3-8% performance gain for upto 8 netperf
>    sessions (and sometimes 16), but BW dropped with more
>    sessions.  However, implementing per-txq vhost improved
>    BW significantly all the way to 128 sessions.
> 2. For this mq TX patch, 1 daemon is created for RX and 'n'
>    daemons for the 'n' TXQ's, for a total of (n+1) daemons.
>    The (subsequent) RX mq patch changes that to a total of
>    'n' daemons, where RX and TX vq's share 1 daemon.
> 3. Service Demand increases for TCP, but significantly
>    improves for UDP.
> 4. Interoperability: Many combinations, but not all, of
>    qemu, host, guest tested together.
> 
> 
>                   Enabling mq on virtio:
>                   -----------------------
> 
> When following options are passed to qemu:
>         - smp > 1
>         - vhost=on
>         - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using
> an optional 'numtxqs' option. e.g.  for a smp=4 guest:
>         vhost=on,mq=on             ->   #txqueues = 4
>         vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
>         vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> 
> 
>                    Performance (guest -> local host):
>                    -----------------------------------
> 
> System configuration:
>         Host:  8 Intel Xeon, 8 GB memory
>         Guest: 4 cpus, 2 GB memory
> All testing without any tuning, and TCP netperf with 64K I/O
> _______________________________________________________________________________
>                            TCP (#numtxqs=2)
> N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
> _______________________________________________________________________________
> 4       26387   40716 (54.30)   20      28   (40.00)    86i     85     (-1.16)
> 8       24356   41843 (71.79)   88      129  (46.59)    372     362    (-2.68)
> 16      23587   40546 (71.89)   375     564  (50.40)    1558    1519   (-2.50)
> 32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722   (-14.52)
> 48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552  (-14.35)
> 64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173  (-9.66)
> 96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)

That's a significant hit in TCP SD. Is it caused by the imbalance between
number of queues for TX and RX? Since you mention RX is complete,
maybe measure with a balanced TX/RX?


> _______________________________________________________________________________
>                        UDP (#numtxqs=8)
> N#      BW1     BW2   (%)      SD1     SD2   (%)
> __________________________________________________________
> 4       29836   56761 (90.24)   67      63    (-5.97)
> 8       27666   63767 (130.48)  326     265   (-18.71)
> 16      25452   60665 (138.35)  1396    1269  (-9.09)
> 32      26172   63491 (142.59)  5617    4202  (-25.19)
> 48      26146   64629 (147.18)  12813   9316  (-27.29)
> 64      25575   65448 (155.90)  23063   16346 (-29.12)
> 128     26454   63772 (141.06)  91054   85051 (-6.59)
> __________________________________________________________
> N#: Number of netperf sessions, 90 sec runs
> BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
>               SD for original code
> BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
>               SD for new code. e.g. BW2=40716 means average BW2 was
>               20358 mbps.
> 

What happens with a single netperf?
host -> guest performance with TCP and small packet speed
are also worth measuring.


>                        Next steps:
>                        -----------
> 
> 1. mq RX patch is also complete - plan to submit once TX is OK.
> 2. Cache-align data structures: I didn't see any BW/SD improvement
>    after making the sq's (and similarly for vhost) cache-aligned
>    statically:
>         struct virtnet_info {
>                 ...
>                 struct send_queue sq[16] ____cacheline_aligned_in_smp;
>                 ...
>         };
> 

At some level, host/guest communication is easy in that we don't really
care which queue is used.  I would like to give some thought (and
testing) to how is this going to work with a real NIC card and packet
steering at the backend.
Any idea?

> Guest interrupts for a 4 TXQ device after a 5 min test:
> # egrep "virtio0|CPU" /proc/interrupts 
>       CPU0     CPU1     CPU2    CPU3       
> 40:   0        0        0       0        PCI-MSI-edge  virtio0-config
> 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

Does this mean each interrupt is constantly bouncing between CPUs?

> Review/feedback appreciated.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (5 preceding siblings ...)
  2010-09-08  8:10 ` Michael S. Tsirkin
@ 2010-09-08  8:13 ` Michael S. Tsirkin
  2010-09-08  9:28   ` Krishna Kumar2
  6 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-08  8:13 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, davem, netdev, kvm, anthony

On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> 1. mq RX patch is also complete - plan to submit once TX is OK.

It's good that you split patches, I think it would be interesting to see
the RX patches at least once to complete the picture.
You could make it a separate patchset, tag them as RFC.

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
@ 2010-09-08  9:22   ` Krishna Kumar2
  2010-09-08  9:28     ` Avi Kivity
  0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08  9:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: anthony, davem, kvm, mst, netdev, rusty

Avi Kivity <avi@redhat.com> wrote on 09/08/2010 01:17:34 PM:

>   On 09/08/2010 10:28 AM, Krishna Kumar wrote:
> > Following patches implement Transmit mq in virtio-net.  Also
> > included is the user qemu changes.
> >
> > 1. This feature was first implemented with a single vhost.
> >     Testing showed 3-8% performance gain for upto 8 netperf
> >     sessions (and sometimes 16), but BW dropped with more
> >     sessions.  However, implementing per-txq vhost improved
> >     BW significantly all the way to 128 sessions.
>
> Why were vhost kernel changes required?  Can't you just instantiate more
> vhost queues?

I did try using a single thread processing packets from multiple
vq's on host, but the BW dropped beyond a certain number of
sessions. I don't have the code and performance numbers for that
right now since it is a bit ancient, I can try to resuscitate
that if you want.

> > Guest interrupts for a 4 TXQ device after a 5 min test:
> > # egrep "virtio0|CPU" /proc/interrupts
> >        CPU0     CPU1     CPU2    CPU3
> > 40:   0        0        0       0        PCI-MSI-edge  virtio0-config
> > 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> > 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> > 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> > 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> > 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
>
> How are vhost threads and host interrupts distributed?  We need to move
> vhost queue threads to be colocated with the related vcpu threads (if no
> extra cores are available) or on the same socket (if extra cores are
> available).  Similarly, move device interrupts to the same core as the
> vhost thread.

All my testing was without any tuning, including binding netperf &
netserver (irqbalance is also off). I assume (maybe wrongly) that
the above might give better results? Are you suggesting this
combination:
	IRQ on guest:
		40: CPU0
		41: CPU1
		42: CPU2
		43: CPU3 (all CPUs are on socket #0)
	vhost:
		thread #0:  CPU0
		thread #1:  CPU1
		thread #2:  CPU2
		thread #3:  CPU3
	qemu:
		thread #0:  CPU4
		thread #1:  CPU5
		thread #2:  CPU6
		thread #3:  CPU7 (all CPUs are on socket#1)
	netperf/netserver:
		Run on CPUs 0-4 on both sides

The reason I did not optimize anything from user space is because
I felt showing the default works reasonably well is important.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  8:10 ` Michael S. Tsirkin
@ 2010-09-08  9:23   ` Krishna Kumar2
  2010-09-08 10:48     ` Michael S. Tsirkin
  2010-09-08 16:47   ` Krishna Kumar2
       [not found]   ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
  2 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08  9:23 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty, rick.jones2

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:40:11 PM:

>
_______________________________________________________________________________

> >                            TCP (#numtxqs=2)
> > N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2
(%)
> >
>
_______________________________________________________________________________

> > 4       26387   40716 (54.30)   20      28   (40.00)    86i     85
(-1.16)
> > 8       24356   41843 (71.79)   88      129  (46.59)    372     362
(-2.68)
> > 16      23587   40546 (71.89)   375     564  (50.40)    1558    1519
(-2.50)
> > 32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722
(-14.52)
> > 48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552
(-14.35)
> > 64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173
(-9.66)
> > 96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)
>
> That's a significant hit in TCP SD. Is it caused by the imbalance between
> number of queues for TX and RX? Since you mention RX is complete,
> maybe measure with a balanced TX/RX?

Yes, I am not sure why it is so high. I found the same with #RX=#TX
too. As a hack, I tried ixgbe without MQ (set "indices=1" before
calling alloc_etherdev_mq, not sure if that is entirely correct) -
here too SD worsened by around 40%. I can't explain it, since the
virtio-net driver runs lock free once sch_direct_xmit gets
HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly
correct since
more threads are now running parallel and load is higher? Eg, if you
compare SD between
#netperfs = 8 vs 16 for original code (cut-n-paste relevant columns
only) ...

N#         BW        SD
8           24356   88
16         23587   375

... SD has increased more than 4 times for the same BW.

> What happens with a single netperf?
> host -> guest performance with TCP and small packet speed
> are also worth measuring.

OK, I will do this and send the results later today.

> At some level, host/guest communication is easy in that we don't really
> care which queue is used.  I would like to give some thought (and
> testing) to how is this going to work with a real NIC card and packet
> steering at the backend.
> Any idea?

I have done a little testing with guest -> remote server both
using a bridge and with macvtap (mq is required only for rx).
I didn't understand what you mean by packet steering though,
is it whether packets go out of the NIC on different queues?
If so, I verified that is the case by putting a counter and
displaying through /debug interface on the host. dev_queue_xmit
on the host handles it by calling dev_pick_tx().

> > Guest interrupts for a 4 TXQ device after a 5 min test:
> > # egrep "virtio0|CPU" /proc/interrupts
> >       CPU0     CPU1     CPU2    CPU3
> > 40:   0        0        0       0        PCI-MSI-edge  virtio0-config
> > 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> > 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> > 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> > 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> > 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
>
> Does this mean each interrupt is constantly bouncing between CPUs?

Yes. I didn't do *any* tuning for the tests. The only "tuning"
was to use 64K IO size with netperf. When I ran default netperf
(16K), I got a little lesser improvement in BW and worse(!) SD
than with 64K.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  9:22   ` Krishna Kumar2
@ 2010-09-08  9:28     ` Avi Kivity
  2010-09-08 10:17       ` Krishna Kumar2
  0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-09-08  9:28 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, mst, netdev, rusty

  On 09/08/2010 12:22 PM, Krishna Kumar2 wrote:
> Avi Kivity<avi@redhat.com>  wrote on 09/08/2010 01:17:34 PM:
>
>>    On 09/08/2010 10:28 AM, Krishna Kumar wrote:
>>> Following patches implement Transmit mq in virtio-net.  Also
>>> included is the user qemu changes.
>>>
>>> 1. This feature was first implemented with a single vhost.
>>>      Testing showed 3-8% performance gain for upto 8 netperf
>>>      sessions (and sometimes 16), but BW dropped with more
>>>      sessions.  However, implementing per-txq vhost improved
>>>      BW significantly all the way to 128 sessions.
>> Why were vhost kernel changes required?  Can't you just instantiate more
>> vhost queues?
> I did try using a single thread processing packets from multiple
> vq's on host, but the BW dropped beyond a certain number of
> sessions.

Oh - so the interface has not changed (which can be seen from the 
patch).  That was my concern, I remembered that we planned for vhost-net 
to be multiqueue-ready.

The new guest and qemu code work with old vhost-net, just with reduced 
performance, yes?

> I don't have the code and performance numbers for that
> right now since it is a bit ancient, I can try to resuscitate
> that if you want.

No need.

>>> Guest interrupts for a 4 TXQ device after a 5 min test:
>>> # egrep "virtio0|CPU" /proc/interrupts
>>>         CPU0     CPU1     CPU2    CPU3
>>> 40:   0        0        0       0        PCI-MSI-edge  virtio0-config
>>> 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
>>> 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
>>> 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
>>> 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
>>> 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
>> How are vhost threads and host interrupts distributed?  We need to move
>> vhost queue threads to be colocated with the related vcpu threads (if no
>> extra cores are available) or on the same socket (if extra cores are
>> available).  Similarly, move device interrupts to the same core as the
>> vhost thread.
> All my testing was without any tuning, including binding netperf&
> netserver (irqbalance is also off). I assume (maybe wrongly) that
> the above might give better results?

I hope so!

> Are you suggesting this
> combination:
> 	IRQ on guest:
> 		40: CPU0
> 		41: CPU1
> 		42: CPU2
> 		43: CPU3 (all CPUs are on socket #0)
> 	vhost:
> 		thread #0:  CPU0
> 		thread #1:  CPU1
> 		thread #2:  CPU2
> 		thread #3:  CPU3
> 	qemu:
> 		thread #0:  CPU4
> 		thread #1:  CPU5
> 		thread #2:  CPU6
> 		thread #3:  CPU7 (all CPUs are on socket#1)

May be better to put vcpu threads and vhost threads on the same socket.

Also need to affine host interrupts.

> 	netperf/netserver:
> 		Run on CPUs 0-4 on both sides
>
> The reason I did not optimize anything from user space is because
> I felt showing the default works reasonably well is important.

Definitely.  Heavy tuning is not a useful path for general end users.  
We need to make sure the the scheduler is able to arrive at the optimal 
layout without pinning (but perhaps with hints).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  8:13 ` Michael S. Tsirkin
@ 2010-09-08  9:28   ` Krishna Kumar2
  0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08  9:28 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty

Hi Michael,

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:43:26 PM:

> On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> > 1. mq RX patch is also complete - plan to submit once TX is OK.
>
> It's good that you split patches, I think it would be interesting to see
> the RX patches at least once to complete the picture.
> You could make it a separate patchset, tag them as RFC.

OK, I need to re-do some parts of it, since I started the TX only
branch a couple of weeks earlier and the RX side is outdated. I
will try to send that out in the next couple of days, as you say
it will help to complete the picture. Reasons to send it only TX
now:

- Reduce size of patch and complexity
- I didn't get much improvement on multiple RX patch (netperf from
  host -> guest), so needed some time to figure out the reason and
  fix it.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  9:28     ` Avi Kivity
@ 2010-09-08 10:17       ` Krishna Kumar2
  2010-09-08 14:12         ` Arnd Bergmann
  0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 10:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: anthony, davem, kvm, mst, netdev, rusty

Avi Kivity <avi@redhat.com> wrote on 09/08/2010 02:58:21 PM:

> >>> 1. This feature was first implemented with a single vhost.
> >>>      Testing showed 3-8% performance gain for upto 8 netperf
> >>>      sessions (and sometimes 16), but BW dropped with more
> >>>      sessions.  However, implementing per-txq vhost improved
> >>>      BW significantly all the way to 128 sessions.
> >> Why were vhost kernel changes required?  Can't you just instantiate
more
> >> vhost queues?
> > I did try using a single thread processing packets from multiple
> > vq's on host, but the BW dropped beyond a certain number of
> > sessions.
>
> Oh - so the interface has not changed (which can be seen from the
> patch).  That was my concern, I remembered that we planned for vhost-net
> to be multiqueue-ready.
>
> The new guest and qemu code work with old vhost-net, just with reduced
> performance, yes?

Yes, I have tested new guest/qemu with old vhost but using
#numtxqs=1 (or not passing any arguments at all to qemu to
enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
since vhost_net_set_backend in the unmodified vhost checks
for boundary overflow.

I have also tested running an unmodified guest with new
vhost/qemu, but qemu should not specify numtxqs>1.

> > Are you suggesting this
> > combination:
> >    IRQ on guest:
> >       40: CPU0
> >       41: CPU1
> >       42: CPU2
> >       43: CPU3 (all CPUs are on socket #0)
> >    vhost:
> >       thread #0:  CPU0
> >       thread #1:  CPU1
> >       thread #2:  CPU2
> >       thread #3:  CPU3
> >    qemu:
> >       thread #0:  CPU4
> >       thread #1:  CPU5
> >       thread #2:  CPU6
> >       thread #3:  CPU7 (all CPUs are on socket#1)
>
> May be better to put vcpu threads and vhost threads on the same socket.
>
> Also need to affine host interrupts.
>
> >    netperf/netserver:
> >       Run on CPUs 0-4 on both sides
> >
> > The reason I did not optimize anything from user space is because
> > I felt showing the default works reasonably well is important.
>
> Definitely.  Heavy tuning is not a useful path for general end users.
> We need to make sure the the scheduler is able to arrive at the optimal
> layout without pinning (but perhaps with hints).

OK, I will see if I can get results with this.

Thanks for your suggestions,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  9:23   ` Krishna Kumar2
@ 2010-09-08 10:48     ` Michael S. Tsirkin
  2010-09-08 12:19       ` Krishna Kumar2
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-08 10:48 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, rusty, rick.jones2

On Wed, Sep 08, 2010 at 02:53:03PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:40:11 PM:
> 
> >
> _______________________________________________________________________________
> 
> > >                            TCP (#numtxqs=2)
> > > N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2
> (%)
> > >
> >
> _______________________________________________________________________________
> 
> > > 4       26387   40716 (54.30)   20      28   (40.00)    86i     85
> (-1.16)
> > > 8       24356   41843 (71.79)   88      129  (46.59)    372     362
> (-2.68)
> > > 16      23587   40546 (71.89)   375     564  (50.40)    1558    1519
> (-2.50)
> > > 32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722
> (-14.52)
> > > 48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552
> (-14.35)
> > > 64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173
> (-9.66)
> > > 96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
> (10.74)
> >
> > That's a significant hit in TCP SD. Is it caused by the imbalance between
> > number of queues for TX and RX? Since you mention RX is complete,
> > maybe measure with a balanced TX/RX?
> 
> Yes, I am not sure why it is so high.

Any errors at higher levels? Are any packets reordered?

> I found the same with #RX=#TX
> too. As a hack, I tried ixgbe without MQ (set "indices=1" before
> calling alloc_etherdev_mq, not sure if that is entirely correct) -
> here too SD worsened by around 40%. I can't explain it, since the
> virtio-net driver runs lock free once sch_direct_xmit gets
> HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly
> correct since
> more threads are now running parallel and load is higher? Eg, if you
> compare SD between
> #netperfs = 8 vs 16 for original code (cut-n-paste relevant columns
> only) ...
> 
> N#         BW        SD
> 8           24356   88
> 16         23587   375
> 
> ... SD has increased more than 4 times for the same BW.
> 
> > What happens with a single netperf?
> > host -> guest performance with TCP and small packet speed
> > are also worth measuring.
> 
> OK, I will do this and send the results later today.
> 
> > At some level, host/guest communication is easy in that we don't really
> > care which queue is used.  I would like to give some thought (and
> > testing) to how is this going to work with a real NIC card and packet
> > steering at the backend.
> > Any idea?
> 
> I have done a little testing with guest -> remote server both
> using a bridge and with macvtap (mq is required only for rx).
> I didn't understand what you mean by packet steering though,
> is it whether packets go out of the NIC on different queues?
> If so, I verified that is the case by putting a counter and
> displaying through /debug interface on the host. dev_queue_xmit
> on the host handles it by calling dev_pick_tx().
> 
> > > Guest interrupts for a 4 TXQ device after a 5 min test:
> > > # egrep "virtio0|CPU" /proc/interrupts
> > >       CPU0     CPU1     CPU2    CPU3
> > > 40:   0        0        0       0        PCI-MSI-edge  virtio0-config
> > > 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> > > 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> > > 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> > > 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> > > 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
> >
> > Does this mean each interrupt is constantly bouncing between CPUs?
> 
> Yes. I didn't do *any* tuning for the tests. The only "tuning"
> was to use 64K IO size with netperf. When I ran default netperf
> (16K), I got a little lesser improvement in BW and worse(!) SD
> than with 64K.
> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08 10:48     ` Michael S. Tsirkin
@ 2010-09-08 12:19       ` Krishna Kumar2
  0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 12:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rick.jones2, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 04:18:33 PM:

>
_______________________________________________________________________________

> >
> > > >                            TCP (#numtxqs=2)
> > > > N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1
RSD2
> > (%)
> > > >
> > >
> >
>
_______________________________________________________________________________

> >
> > > > 4       26387   40716 (54.30)   20      28   (40.00)    86i     85
> > (-1.16)
> > > > 8       24356   41843 (71.79)   88      129  (46.59)    372     362
> > (-2.68)
> > > > 16      23587   40546 (71.89)   375     564  (50.40)    1558
1519
> > (-2.50)
> > > > 32      22927   39490 (72.24)   1617    2171 (34.26)    6694
5722
> > (-14.52)
> > > > 48      23067   39238 (70.10)   3931    5170 (31.51)    15823
13552
> > (-14.35)
> > > > 64      22927   38750 (69.01)   7142    9914 (38.81)    28972
26173
> > (-9.66)
> > > > 96      22568   38520 (70.68)   16258   27844 (71.26)   65944
73031
> > (10.74)
> > >
> > > That's a significant hit in TCP SD. Is it caused by the imbalance
between
> > > number of queues for TX and RX? Since you mention RX is complete,
> > > maybe measure with a balanced TX/RX?
> >
> > Yes, I am not sure why it is so high.
>
> Any errors at higher levels? Are any packets reordered?

I haven't seen any messages logged, and retransmission is similar
to non-mq case. Device also has no errors/dropped packets. Anything
else I should look for?

On the host:

# ifconfig vnet0
vnet0     Link encap:Ethernet  HWaddr 9A:9D:99:E1:CA:CE
          inet6 addr: fe80::989d:99ff:fee1:cace/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5090371 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5054616 errors:0 dropped:0 overruns:65 carrier:0
          collisions:0 txqueuelen:500
          RX bytes:237793761392 (221.4 GiB)  TX bytes:333630070 (318.1 MiB)
# netstat -s  |grep -i retrans
    1310 segments retransmited
    35 times recovered from packet loss due to fast retransmit
    1 timeouts after reno fast retransmit
    41 fast retransmits
    1236 retransmits in slow start

So retranmissions are 0.025% of total packets received from the guest.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08 10:17       ` Krishna Kumar2
@ 2010-09-08 14:12         ` Arnd Bergmann
  2010-09-08 16:47           ` Krishna Kumar2
  0 siblings, 1 reply; 43+ messages in thread
From: Arnd Bergmann @ 2010-09-08 14:12 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Avi Kivity, anthony, davem, kvm, mst, netdev, rusty

On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > The new guest and qemu code work with old vhost-net, just with reduced
> > performance, yes?
> 
> Yes, I have tested new guest/qemu with old vhost but using
> #numtxqs=1 (or not passing any arguments at all to qemu to
> enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> since vhost_net_set_backend in the unmodified vhost checks
> for boundary overflow.
> 
> I have also tested running an unmodified guest with new
> vhost/qemu, but qemu should not specify numtxqs>1.

Can you live migrate a new guest from new-qemu/new-kernel
to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
If not, do we need to support all those cases?

	Arnd

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08 14:12         ` Arnd Bergmann
@ 2010-09-08 16:47           ` Krishna Kumar2
  2010-09-09 10:40             ` Arnd Bergmann
  0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 16:47 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: anthony, Avi Kivity, davem, kvm, mst, netdev, rusty

> On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > > The new guest and qemu code work with old vhost-net, just with
reduced
> > > performance, yes?
> >
> > Yes, I have tested new guest/qemu with old vhost but using
> > #numtxqs=1 (or not passing any arguments at all to qemu to
> > enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> > since vhost_net_set_backend in the unmodified vhost checks
> > for boundary overflow.
> >
> > I have also tested running an unmodified guest with new
> > vhost/qemu, but qemu should not specify numtxqs>1.
>
> Can you live migrate a new guest from new-qemu/new-kernel
> to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> If not, do we need to support all those cases?

I have not tried this, though I added some minimal code in
virtio_net_load and virtio_net_save. I don't know what needs
to be done exactly at this time. I forgot to put this in the
"Next steps" list of things to do.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08  8:10 ` Michael S. Tsirkin
  2010-09-08  9:23   ` Krishna Kumar2
@ 2010-09-08 16:47   ` Krishna Kumar2
       [not found]   ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
  2 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 16:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:40:11 PM:

>
_______________________________________________________________________________

> >                        UDP (#numtxqs=8)
> > N#      BW1     BW2   (%)      SD1     SD2   (%)
> > __________________________________________________________
> > 4       29836   56761 (90.24)   67      63    (-5.97)
> > 8       27666   63767 (130.48)  326     265   (-18.71)
> > 16      25452   60665 (138.35)  1396    1269  (-9.09)
> > 32      26172   63491 (142.59)  5617    4202  (-25.19)
> > 48      26146   64629 (147.18)  12813   9316  (-27.29)
> > 64      25575   65448 (155.90)  23063   16346 (-29.12)
> > 128     26454   63772 (141.06)  91054   85051 (-6.59)
> > __________________________________________________________
> > N#: Number of netperf sessions, 90 sec runs
> > BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
> >               SD for original code
> > BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
> >               SD for new code. e.g. BW2=40716 means average BW2 was
> >               20358 mbps.
> >
>
> What happens with a single netperf?
> host -> guest performance with TCP and small packet speed
> are also worth measuring.

Guest -> Host (single netperf):
I am getting a drop of almost 20%. I am trying to figure out
why.

Host -> guest (single netperf):
I am getting an improvement of almost 15%. Again - unexpected.

Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
for runs upto 128 sessions. With fewer netperf (under 8), there
was a drop of 3-7% in #packets, but beyond that, the #packets
improved significantly to give an average improvement of 7.4%.

So it seems that fewer sessions is having negative effect for
some reason on the tx side. The code path in virtio-net has not
changed much, so the drop in some cases is quite unexpected.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-08  7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
@ 2010-09-09  3:49   ` Rusty Russell
  2010-09-09  5:23     ` Krishna Kumar2
  0 siblings, 1 reply; 43+ messages in thread
From: Rusty Russell @ 2010-09-09  3:49 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, netdev, anthony, kvm, mst

On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote:
> Add virtio_get_queue_index() to get the queue index of a
> vq.  This is needed by the cb handler to locate the queue
> that should be processed.

This seems a bit weird.  I mean, the driver used vdev->config->find_vqs
to find the queues, which returns them (in order).  So, can't you put this
into your struct send_queue?

Also, why define VIRTIO_MAX_TXQS?  If the driver can't handle all of them,
it should simply not use them...

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-09  3:49   ` Rusty Russell
@ 2010-09-09  5:23     ` Krishna Kumar2
  2010-09-09 12:14       ` Rusty Russell
  0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09  5:23 UTC (permalink / raw)
  To: Rusty Russell; +Cc: anthony, davem, kvm, mst, netdev

Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 09:19:39 AM:

> On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote:
> > Add virtio_get_queue_index() to get the queue index of a
> > vq.  This is needed by the cb handler to locate the queue
> > that should be processed.
>
> This seems a bit weird.  I mean, the driver used vdev->config->find_vqs
> to find the queues, which returns them (in order).  So, can't you put
this
> into your struct send_queue?

I am saving the vqs in the send_queue, but the cb needs
to locate the device txq from the svq. The only other way
I could think of is to iterate through the send_queue's
and compare svq against sq[i]->svq, but cb's happen quite
a bit. Is there a better way?

static void skb_xmit_done(struct virtqueue *svq)
{
	struct virtnet_info *vi = svq->vdev->priv;
	int qnum = virtio_get_queue_index(svq) - 1;     /* 0 is RX vq */

	/* Suppress further interrupts. */
	virtqueue_disable_cb(svq);

	/* We were probably waiting for more output buffers. */
	netif_wake_subqueue(vi->dev, qnum);
}

> Also, why define VIRTIO_MAX_TXQS?  If the driver can't handle all of
them,
> it should simply not use them...

The main reason was vhost :) Since vhost_net_release
should not fail (__fput can't handle f_op->release()
failure), I needed a maximum number of socks to
clean up:

#define MAX_VQS	(1 + VIRTIO_MAX_TXQS)
static int vhost_net_release(struct inode *inode, struct file *f)
{
	struct vhost_net *n = f->private_data;
	struct vhost_dev *dev = &n->dev;
	struct socket *socks[MAX_VQS];
	int i;

	vhost_net_stop(n, socks);
	vhost_net_flush(n);
	vhost_dev_cleanup(dev);

	for (i = n->dev.nvqs - 1; i >= 0; i--)
		if (socks[i])
			fput(socks[i]->file);
	...
}

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
       [not found]   ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
@ 2010-09-09  9:45     ` Krishna Kumar2
  2010-09-09 23:00       ` Sridhar Samudrala
  2010-09-12 11:40       ` Michael S. Tsirkin
       [not found]     ` <OF8043B2B7.7048D739-ON65257799.0021A2EE-65257799.00356B3E@LocalDomain>
  1 sibling, 2 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09  9:45 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, Michael S. Tsirkin, netdev, rusty

> Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:

Some more results and likely cause for single netperf
degradation below.


> Guest -> Host (single netperf):
> I am getting a drop of almost 20%. I am trying to figure out
> why.
>
> Host -> guest (single netperf):
> I am getting an improvement of almost 15%. Again - unexpected.
>
> Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
> for runs upto 128 sessions. With fewer netperf (under 8), there
> was a drop of 3-7% in #packets, but beyond that, the #packets
> improved significantly to give an average improvement of 7.4%.
>
> So it seems that fewer sessions is having negative effect for
> some reason on the tx side. The code path in virtio-net has not
> changed much, so the drop in some cases is quite unexpected.

The drop for the single netperf seems to be due to multiple vhost.
I changed the patch to start *single* vhost:

Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
Guest -> Host (1 netperf)     : Latency: -3%, SD: 3.5%

Single vhost performs well but hits the barrier at 16 netperf
sessions:

SINGLE vhost (Guest -> Host):
	1 netperf:    BW: 10.7%     SD: -1.4%
	4 netperfs:   BW: 3%        SD: 1.4%
	8 netperfs:   BW: 17.7%     SD: -10%
      16 netperfs:  BW: 4.7%      SD: -7.0%
      32 netperfs:  BW: -6.1%     SD: -5.7%
BW and SD both improves (guest multiple txqs help). For 32
netperfs, SD improves.

But with multiple vhosts, guest is able to send more packets
and BW increases much more (SD too increases, but I think
that is expected). From the earlier results:

N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
_______________________________________________________________________________
4       26387   40716 (54.30)   20      28   (40.00)    86      85
(-1.16)
8       24356   41843 (71.79)   88      129  (46.59)    372     362
(-2.68)
16      23587   40546 (71.89)   375     564  (50.40)    1558    1519
(-2.50)
32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722
(-14.52)
48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552
(-14.35)
64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173
(-9.66)
96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)
_______________________________________________________________________________
(All tests were done without any tuning)

>From my testing:

1. Single vhost improves mq guest performance upto 16
   netperfs but degrades after that.
2. Multiple vhost degrades single netperf guest
   performance, but significantly improves performance
   for any number of netperf sessions.

Likely cause for the 1 stream degradation with multiple
vhost patch:

1. Two vhosts run handling the RX and TX respectively.
   I think the issue is related to cache ping-pong esp
   since these run on different cpus/sockets.
2. I (re-)modified the patch to share RX with TX[0]. The
   performance drop is the same, but the reason is the
   guest is not using txq[0] in most cases (dev_pick_tx),
   so vhost's rx and tx are running on different threads.
   But whenever the guest uses txq[0], only one vhost
   runs and the performance is similar to original.

I went back to my *submitted* patch and started a guest
with numtxq=16 and pinned every vhost to cpus #0&1. Now
whether guest used txq[0] or txq[n], the performance is
similar or better (between 10-27% across 10 runs) than
original code. Also, -6% to -24% improvement in SD.

I will start a full test run of original vs submitted
code with minimal tuning (Avi also suggested the same),
and re-send. Please let me know if you need any other
data.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-08 16:47           ` Krishna Kumar2
@ 2010-09-09 10:40             ` Arnd Bergmann
  2010-09-09 13:19               ` Krishna Kumar2
  0 siblings, 1 reply; 43+ messages in thread
From: Arnd Bergmann @ 2010-09-09 10:40 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, Avi Kivity, davem, kvm, mst, netdev, rusty

On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > > > The new guest and qemu code work with old vhost-net, just with
> reduced
> > > > performance, yes?
> > >
> > > Yes, I have tested new guest/qemu with old vhost but using
> > > #numtxqs=1 (or not passing any arguments at all to qemu to
> > > enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> > > since vhost_net_set_backend in the unmodified vhost checks
> > > for boundary overflow.
> > >
> > > I have also tested running an unmodified guest with new
> > > vhost/qemu, but qemu should not specify numtxqs>1.
> >
> > Can you live migrate a new guest from new-qemu/new-kernel
> > to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> > If not, do we need to support all those cases?
> 
> I have not tried this, though I added some minimal code in
> virtio_net_load and virtio_net_save. I don't know what needs
> to be done exactly at this time. I forgot to put this in the
> "Next steps" list of things to do.

I was mostly trying to find out if you think it should work
or if there are specific reasons why it would not.
E.g. when migrating to a machine that has an old qemu, the guest
gets reduced to a single queue, but it's not clear to me how
it can learn about this, or if it can get hidden by the outbound
qemu.

	Arnd

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-09  5:23     ` Krishna Kumar2
@ 2010-09-09 12:14       ` Rusty Russell
  2010-09-09 13:49         ` Krishna Kumar2
  0 siblings, 1 reply; 43+ messages in thread
From: Rusty Russell @ 2010-09-09 12:14 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, mst, netdev

On Thu, 9 Sep 2010 02:53:52 pm Krishna Kumar2 wrote:
> Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 09:19:39 AM:
> 
> > On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote:
> > > Add virtio_get_queue_index() to get the queue index of a
> > > vq.  This is needed by the cb handler to locate the queue
> > > that should be processed.
> >
> > This seems a bit weird.  I mean, the driver used vdev->config->find_vqs
> > to find the queues, which returns them (in order).  So, can't you put
> this
> > into your struct send_queue?
> 
> I am saving the vqs in the send_queue, but the cb needs
> to locate the device txq from the svq. The only other way
> I could think of is to iterate through the send_queue's
> and compare svq against sq[i]->svq, but cb's happen quite
> a bit. Is there a better way?

Ah, good point.  Move the queue index into the struct virtqueue?

> > Also, why define VIRTIO_MAX_TXQS?  If the driver can't handle all of
> them,
> > it should simply not use them...
> 
> The main reason was vhost :) Since vhost_net_release
> should not fail (__fput can't handle f_op->release()
> failure), I needed a maximum number of socks to
> clean up:

Ah, then it belongs in the vhost headers.  The guest shouldn't see such
a restriction if it doesn't apply; it's a host thing.

Oh, and I think you could profitably use virtio_config_val(), too.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
       [not found]     ` <OF8043B2B7.7048D739-ON65257799.0021A2EE-65257799.00356B3E@LocalDomain>
@ 2010-09-09 13:18       ` Krishna Kumar2
  0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 13:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty

Krishna Kumar2/India/IBM wrote on 09/09/2010 03:15:53 PM:

> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.

Same patch, only change is that I ran "taskset -p 03
<all vhost threads>", no other tuning on host or guest.
Default netperf without any options. The BW is the sum
across two iterations, each is 60secs. Guest is started
with 2 txqs.

BW1/BW2: BW for org & new in mbps
SD1/SD2: SD for org & new
RSD1/RSD2: Remote SD for org & new
_______________________________________________________________________________
#    BW1       BW2    (%)        SD1    SD2   (%)      RSD1    RSD2  (%)
_______________________________________________________________________________
1    20903     19422  (-7.08)    1      1    (0)       6       7
(16.66)
2    21963     24330  (10.77)    6      6    (0)       25      25    (0)
4    22042     31841  (44.45)    23     28   (21.73)   102     110   (7.84)
8    21674     32045  (47.84)    97     111  (14.43)   419     421   (.47)
16   22281     31361  (40.75)    379    551  (45.38)   1663    2110
(26.87)
24   22521     31945  (41.84)    857    981  (14.46)   3748    3742  (-.16)
32   22976     32473  (41.33)    1528   1806  (18.19)  6594    6885  (4.41)
40   23197     32594  (40.50)    2390   2755  (15.27)  10239   10450 (2.06)
48   22973     32757  (42.58)    3542   3786  (6.88)   15074   14395
(-4.50)
64   23809     32814  (37.82)    6486   6981  (7.63)   27244   26381
(-3.16)
80   23564     32682  (38.69)    10169  11133 (9.47)   43118   41397
(-3.99)
96   22977     33069  (43.92)    14954  15881 (6.19)   62948   59071
(-6.15)
128  23649     33032  (39.67)    27067  28832 (6.52)   113892  106096
(-6.84)
_______________________________________________________________________________
     294534    400371 (35.9)     67504  72858 (7.9)    285077  271096
(-4.9)
_______________________________________________________________________________

I will try more tuning later as Avi suggested, wanted to test
the minimal for now.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-09 10:40             ` Arnd Bergmann
@ 2010-09-09 13:19               ` Krishna Kumar2
  0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 13:19 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: anthony, Avi Kivity, davem, kvm, mst, netdev, rusty

Arnd Bergmann <arnd@arndb.de> wrote on 09/09/2010 04:10:27 PM:

> > > Can you live migrate a new guest from new-qemu/new-kernel
> > > to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> > > If not, do we need to support all those cases?
> >
> > I have not tried this, though I added some minimal code in
> > virtio_net_load and virtio_net_save. I don't know what needs
> > to be done exactly at this time. I forgot to put this in the
> > "Next steps" list of things to do.
>
> I was mostly trying to find out if you think it should work
> or if there are specific reasons why it would not.
> E.g. when migrating to a machine that has an old qemu, the guest
> gets reduced to a single queue, but it's not clear to me how
> it can learn about this, or if it can get hidden by the outbound
> qemu.

I agree, I am also not sure how the old guest will handle this.
Sorry about my ignorance on migration :(

Regards,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-09 12:14       ` Rusty Russell
@ 2010-09-09 13:49         ` Krishna Kumar2
  2010-09-10  3:33           ` Rusty Russell
  2010-09-12 11:46           ` Michael S. Tsirkin
  0 siblings, 2 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 13:49 UTC (permalink / raw)
  To: Rusty Russell; +Cc: anthony, davem, kvm, mst, netdev

Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 05:44:25 PM:
>
> > > This seems a bit weird.  I mean, the driver used vdev->config->
find_vqs
> > > to find the queues, which returns them (in order).  So, can't you put
> > this
> > > into your struct send_queue?
> >
> > I am saving the vqs in the send_queue, but the cb needs
> > to locate the device txq from the svq. The only other way
> > I could think of is to iterate through the send_queue's
> > and compare svq against sq[i]->svq, but cb's happen quite
> > a bit. Is there a better way?
>
> Ah, good point.  Move the queue index into the struct virtqueue?

Is it OK to move the queue_index from virtio_pci_vq_info
to virtqueue? I didn't want to change any data structures
in virtio for this patch, but I can do it either way.

> > > Also, why define VIRTIO_MAX_TXQS?  If the driver can't handle all of
> > them,
> > > it should simply not use them...
> >
> > The main reason was vhost :) Since vhost_net_release
> > should not fail (__fput can't handle f_op->release()
> > failure), I needed a maximum number of socks to
> > clean up:
>
> Ah, then it belongs in the vhost headers.  The guest shouldn't see such
> a restriction if it doesn't apply; it's a host thing.
>
> Oh, and I think you could profitably use virtio_config_val(), too.

OK, I will make those changes. Thanks for the reference to
virtio_config_val(), I will use it in guest probe instead of
the cumbersome way I am doing now. Unfortunately I need a
constant in vhost for now.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-09  9:45     ` Krishna Kumar2
@ 2010-09-09 23:00       ` Sridhar Samudrala
  2010-09-10  5:19         ` Krishna Kumar2
  2010-09-12 11:40       ` Michael S. Tsirkin
  1 sibling, 1 reply; 43+ messages in thread
From: Sridhar Samudrala @ 2010-09-09 23:00 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, Michael S. Tsirkin, netdev, rusty

  On 9/9/2010 2:45 AM, Krishna Kumar2 wrote:
>> Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:
> Some more results and likely cause for single netperf
> degradation below.
>
>
>> Guest ->  Host (single netperf):
>> I am getting a drop of almost 20%. I am trying to figure out
>> why.
>>
>> Host ->  guest (single netperf):
>> I am getting an improvement of almost 15%. Again - unexpected.
>>
>> Guest ->  Host TCP_RR: I get an average 7.4% increase in #packets
>> for runs upto 128 sessions. With fewer netperf (under 8), there
>> was a drop of 3-7% in #packets, but beyond that, the #packets
>> improved significantly to give an average improvement of 7.4%.
>>
>> So it seems that fewer sessions is having negative effect for
>> some reason on the tx side. The code path in virtio-net has not
>> changed much, so the drop in some cases is quite unexpected.
> The drop for the single netperf seems to be due to multiple vhost.
> I changed the patch to start *single* vhost:
>
> Guest ->  Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
> Guest ->  Host (1 netperf)     : Latency: -3%, SD: 3.5%
I remember seeing similar issue when using a separate vhost thread for 
TX and
RX queues.  Basically, we should have the same vhost thread process a 
TCP flow
in both directions. I guess this allows the data and ACKs to be 
processed in sync.


Thanks
Sridhar
> Single vhost performs well but hits the barrier at 16 netperf
> sessions:
>
> SINGLE vhost (Guest ->  Host):
> 	1 netperf:    BW: 10.7%     SD: -1.4%
> 	4 netperfs:   BW: 3%        SD: 1.4%
> 	8 netperfs:   BW: 17.7%     SD: -10%
>        16 netperfs:  BW: 4.7%      SD: -7.0%
>        32 netperfs:  BW: -6.1%     SD: -5.7%
> BW and SD both improves (guest multiple txqs help). For 32
> netperfs, SD improves.
>
> But with multiple vhosts, guest is able to send more packets
> and BW increases much more (SD too increases, but I think
> that is expected). From the earlier results:
>
> N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
> _______________________________________________________________________________
> 4       26387   40716 (54.30)   20      28   (40.00)    86      85
> (-1.16)
> 8       24356   41843 (71.79)   88      129  (46.59)    372     362
> (-2.68)
> 16      23587   40546 (71.89)   375     564  (50.40)    1558    1519
> (-2.50)
> 32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722
> (-14.52)
> 48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552
> (-14.35)
> 64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173
> (-9.66)
> 96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
> (10.74)
> _______________________________________________________________________________
> (All tests were done without any tuning)
>
>  From my testing:
>
> 1. Single vhost improves mq guest performance upto 16
>     netperfs but degrades after that.
> 2. Multiple vhost degrades single netperf guest
>     performance, but significantly improves performance
>     for any number of netperf sessions.
>
> Likely cause for the 1 stream degradation with multiple
> vhost patch:
>
> 1. Two vhosts run handling the RX and TX respectively.
>     I think the issue is related to cache ping-pong esp
>     since these run on different cpus/sockets.
> 2. I (re-)modified the patch to share RX with TX[0]. The
>     performance drop is the same, but the reason is the
>     guest is not using txq[0] in most cases (dev_pick_tx),
>     so vhost's rx and tx are running on different threads.
>     But whenever the guest uses txq[0], only one vhost
>     runs and the performance is similar to original.
>
> I went back to my *submitted* patch and started a guest
> with numtxq=16 and pinned every vhost to cpus #0&1. Now
> whether guest used txq[0] or txq[n], the performance is
> similar or better (between 10-27% across 10 runs) than
> original code. Also, -6% to -24% improvement in SD.
>
> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.
>
> Thanks,
>
> - KK
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-09 13:49         ` Krishna Kumar2
@ 2010-09-10  3:33           ` Rusty Russell
  2010-09-12 11:46           ` Michael S. Tsirkin
  1 sibling, 0 replies; 43+ messages in thread
From: Rusty Russell @ 2010-09-10  3:33 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, mst, netdev

On Thu, 9 Sep 2010 11:19:33 pm Krishna Kumar2 wrote:
> Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 05:44:25 PM:
> > Ah, good point.  Move the queue index into the struct virtqueue?
> 
> Is it OK to move the queue_index from virtio_pci_vq_info
> to virtqueue? I didn't want to change any data structures
> in virtio for this patch, but I can do it either way.

Yep, it's logical to me.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-09 23:00       ` Sridhar Samudrala
@ 2010-09-10  5:19         ` Krishna Kumar2
  0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-10  5:19 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: anthony, davem, kvm, Michael S. Tsirkin, netdev, rusty

Sridhar Samudrala <sri@us.ibm.com> wrote on 09/10/2010 04:30:24 AM:

> I remember seeing similar issue when using a separate vhost thread for
> TX and
> RX queues.  Basically, we should have the same vhost thread process a
> TCP flow
> in both directions. I guess this allows the data and ACKs to be
> processed in sync.

I was trying that by sharing threads between rx and tx[0], but
that didn't work either since guest rarely picks txq=0. I was
able to get reasonable single stream performance by pinning
vhosts to the same cpu.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-09  9:45     ` Krishna Kumar2
  2010-09-09 23:00       ` Sridhar Samudrala
@ 2010-09-12 11:40       ` Michael S. Tsirkin
  2010-09-13  4:12         ` Krishna Kumar2
  1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-12 11:40 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, rusty

On Thu, Sep 09, 2010 at 03:15:53PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:
> 
> Some more results and likely cause for single netperf
> degradation below.
> 
> 
> > Guest -> Host (single netperf):
> > I am getting a drop of almost 20%. I am trying to figure out
> > why.
> >
> > Host -> guest (single netperf):
> > I am getting an improvement of almost 15%. Again - unexpected.
> >
> > Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
> > for runs upto 128 sessions. With fewer netperf (under 8), there
> > was a drop of 3-7% in #packets, but beyond that, the #packets
> > improved significantly to give an average improvement of 7.4%.
> >
> > So it seems that fewer sessions is having negative effect for
> > some reason on the tx side. The code path in virtio-net has not
> > changed much, so the drop in some cases is quite unexpected.
> 
> The drop for the single netperf seems to be due to multiple vhost.
> I changed the patch to start *single* vhost:
> 
> Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
> Guest -> Host (1 netperf)     : Latency: -3%, SD: 3.5%
> 
> Single vhost performs well but hits the barrier at 16 netperf
> sessions:
> 
> SINGLE vhost (Guest -> Host):
> 	1 netperf:    BW: 10.7%     SD: -1.4%
> 	4 netperfs:   BW: 3%        SD: 1.4%
> 	8 netperfs:   BW: 17.7%     SD: -10%
>       16 netperfs:  BW: 4.7%      SD: -7.0%
>       32 netperfs:  BW: -6.1%     SD: -5.7%
> BW and SD both improves (guest multiple txqs help). For 32
> netperfs, SD improves.
> 
> But with multiple vhosts, guest is able to send more packets
> and BW increases much more (SD too increases, but I think
> that is expected).

Why is this expected?

> From the earlier results:
> 
> N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
> _______________________________________________________________________________
> 4       26387   40716 (54.30)   20      28   (40.00)    86      85
> (-1.16)
> 8       24356   41843 (71.79)   88      129  (46.59)    372     362
> (-2.68)
> 16      23587   40546 (71.89)   375     564  (50.40)    1558    1519
> (-2.50)
> 32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722
> (-14.52)
> 48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552
> (-14.35)
> 64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173
> (-9.66)
> 96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
> (10.74)
> _______________________________________________________________________________
> (All tests were done without any tuning)
> 
> >From my testing:
> 
> 1. Single vhost improves mq guest performance upto 16
>    netperfs but degrades after that.
> 2. Multiple vhost degrades single netperf guest
>    performance, but significantly improves performance
>    for any number of netperf sessions.
> 
> Likely cause for the 1 stream degradation with multiple
> vhost patch:
> 
> 1. Two vhosts run handling the RX and TX respectively.
>    I think the issue is related to cache ping-pong esp
>    since these run on different cpus/sockets.

Right. With TCP I think we are better off handling
TX and RX for a socket by the same vhost, so that
packet and its ack are handled by the same thread.
Is this what happens with RX multiqueue patch?
How do we select an RX queue to put the packet on?


> 2. I (re-)modified the patch to share RX with TX[0]. The
>    performance drop is the same, but the reason is the
>    guest is not using txq[0] in most cases (dev_pick_tx),
>    so vhost's rx and tx are running on different threads.
>    But whenever the guest uses txq[0], only one vhost
>    runs and the performance is similar to original.
> 
> I went back to my *submitted* patch and started a guest
> with numtxq=16 and pinned every vhost to cpus #0&1. Now
> whether guest used txq[0] or txq[n], the performance is
> similar or better (between 10-27% across 10 runs) than
> original code. Also, -6% to -24% improvement in SD.
> 
> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.
> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-09 13:49         ` Krishna Kumar2
  2010-09-10  3:33           ` Rusty Russell
@ 2010-09-12 11:46           ` Michael S. Tsirkin
  2010-09-13  4:20             ` Krishna Kumar2
  1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-12 11:46 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Rusty Russell, anthony, davem, kvm, netdev

On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> Unfortunately I need a
> constant in vhost for now.

Maybe not even that: you create multiple vhost-net
devices so vhost-net in kernel does not care about these
either, right? So this can be just part of vhost_net.h
in qemu.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-12 11:40       ` Michael S. Tsirkin
@ 2010-09-13  4:12         ` Krishna Kumar2
  2010-09-13 11:50           ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-13  4:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:10:25 PM:

> > SINGLE vhost (Guest -> Host):
> >    1 netperf:    BW: 10.7%     SD: -1.4%
> >    4 netperfs:   BW: 3%        SD: 1.4%
> >    8 netperfs:   BW: 17.7%     SD: -10%
> >       16 netperfs:  BW: 4.7%      SD: -7.0%
> >       32 netperfs:  BW: -6.1%     SD: -5.7%
> > BW and SD both improves (guest multiple txqs help). For 32
> > netperfs, SD improves.
> >
> > But with multiple vhosts, guest is able to send more packets
> > and BW increases much more (SD too increases, but I think
> > that is expected).
>
> Why is this expected?

Results with the original kernel:
_____________________________
#       BW      SD      RSD
______________________________
1       20903   1       6
2       21963   6       25
4       22042   23      102
8       21674   97      419
16      22281   379     1663
24      22521   857     3748
32      22976   1528    6594
40      23197   2390    10239
48      22973   3542    15074
64      23809   6486    27244
80      23564   10169   43118
96      22977   14954   62948
128     23649   27067   113892
________________________________

With higher number of threads running in parallel, SD
increased. In this case most threads run in parallel
only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
higher number of threads run in parallel through
ndo_start_xmit. I *think* the increase in SD is to do
with higher # of threads running for larger code path
>From the numbers I posted with the patch (cut-n-paste
only the % parts), BW increased much more than the SD,
sometimes more than twice the increase in SD.

N#      BW%     SD%      RSD%
4       54.30   40.00    -1.16
8       71.79   46.59    -2.68
16      71.89   50.40    -2.50
32      72.24   34.26    -14.52
48      70.10   31.51    -14.35
64      69.01   38.81    -9.66
96      70.68   71.26    10.74

I also think SD calculation gets skewed for guest->local
host testing. For this test, I ran a guest with numtxqs=16.
The first result below is with my patch, which creates 16
vhosts. The second result is with a modified patch which
creates only 2 vhosts (testing with #netperfs = 64):

#vhosts  BW%     SD%        RSD%
16       20.79   186.01     149.74
2        30.89   34.55      18.44

The remote SD increases with the number of vhost threads,
but that number seems to correlate with guest SD. So though
BW% increased slightly from 20% to 30%, SD fell drastically
from 186% to 34%. I think it could be a calculation skew
with host SD, which also fell from 150% to 18%.

I am planning to submit 2nd patch rev with restricted
number of vhosts.

> > Likely cause for the 1 stream degradation with multiple
> > vhost patch:
> >
> > 1. Two vhosts run handling the RX and TX respectively.
> >    I think the issue is related to cache ping-pong esp
> >    since these run on different cpus/sockets.
>
> Right. With TCP I think we are better off handling
> TX and RX for a socket by the same vhost, so that
> packet and its ack are handled by the same thread.
> Is this what happens with RX multiqueue patch?
> How do we select an RX queue to put the packet on?

My (unsubmitted) RX patch doesn't do this yet, that is
something I will check.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-12 11:46           ` Michael S. Tsirkin
@ 2010-09-13  4:20             ` Krishna Kumar2
  2010-09-13  9:04               ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-13  4:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, Rusty Russell

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:

> "Michael S. Tsirkin" <mst@redhat.com>
> 09/12/2010 05:16 PM
>
> On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> > Unfortunately I need a
> > constant in vhost for now.
>
> Maybe not even that: you create multiple vhost-net
> devices so vhost-net in kernel does not care about these
> either, right? So this can be just part of vhost_net.h
> in qemu.

Sorry, I didn't understand what you meant.

I can remove all socks[] arrays/constants by pre-allocating
sockets in vhost_setup_vqs. Then I can remove all "socks"
parameters in vhost_net_stop, vhost_net_release and
vhost_net_reset_owner.

Does this make sense?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-13  4:20             ` Krishna Kumar2
@ 2010-09-13  9:04               ` Michael S. Tsirkin
  2010-09-13 15:59                 ` Anthony Liguori
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-13  9:04 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, Rusty Russell

On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:
> 
> > "Michael S. Tsirkin" <mst@redhat.com>
> > 09/12/2010 05:16 PM
> >
> > On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> > > Unfortunately I need a
> > > constant in vhost for now.
> >
> > Maybe not even that: you create multiple vhost-net
> > devices so vhost-net in kernel does not care about these
> > either, right? So this can be just part of vhost_net.h
> > in qemu.
> 
> Sorry, I didn't understand what you meant.
> 
> I can remove all socks[] arrays/constants by pre-allocating
> sockets in vhost_setup_vqs. Then I can remove all "socks"
> parameters in vhost_net_stop, vhost_net_release and
> vhost_net_reset_owner.
> 
> Does this make sense?
> 
> Thanks,
> 
> - KK

Here's what I mean: each vhost device includes 1 TX
and 1 RX VQ. Instead of teaching vhost about multiqueue,
we could simply open /dev/vhost-net multiple times.
How many times would be up to qemu.

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-13  4:12         ` Krishna Kumar2
@ 2010-09-13 11:50           ` Michael S. Tsirkin
  2010-09-13 16:23             ` Krishna Kumar2
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-13 11:50 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, rusty, avi

On Mon, Sep 13, 2010 at 09:42:22AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:10:25 PM:
> 
> > > SINGLE vhost (Guest -> Host):
> > >    1 netperf:    BW: 10.7%     SD: -1.4%
> > >    4 netperfs:   BW: 3%        SD: 1.4%
> > >    8 netperfs:   BW: 17.7%     SD: -10%
> > >       16 netperfs:  BW: 4.7%      SD: -7.0%
> > >       32 netperfs:  BW: -6.1%     SD: -5.7%
> > > BW and SD both improves (guest multiple txqs help). For 32
> > > netperfs, SD improves.
> > >
> > > But with multiple vhosts, guest is able to send more packets
> > > and BW increases much more (SD too increases, but I think
> > > that is expected).
> >
> > Why is this expected?
> 
> Results with the original kernel:
> _____________________________
> #       BW      SD      RSD
> ______________________________
> 1       20903   1       6
> 2       21963   6       25
> 4       22042   23      102
> 8       21674   97      419
> 16      22281   379     1663
> 24      22521   857     3748
> 32      22976   1528    6594
> 40      23197   2390    10239
> 48      22973   3542    15074
> 64      23809   6486    27244
> 80      23564   10169   43118
> 96      22977   14954   62948
> 128     23649   27067   113892
> ________________________________
> 
> With higher number of threads running in parallel, SD
> increased. In this case most threads run in parallel
> only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> higher number of threads run in parallel through
> ndo_start_xmit. I *think* the increase in SD is to do
> with higher # of threads running for larger code path
> >From the numbers I posted with the patch (cut-n-paste
> only the % parts), BW increased much more than the SD,
> sometimes more than twice the increase in SD.

Service demand is BW/CPU, right? So if BW goes up by 50%
and SD by 40%, this means that CPU more than doubled.

> N#      BW%     SD%      RSD%
> 4       54.30   40.00    -1.16
> 8       71.79   46.59    -2.68
> 16      71.89   50.40    -2.50
> 32      72.24   34.26    -14.52
> 48      70.10   31.51    -14.35
> 64      69.01   38.81    -9.66
> 96      70.68   71.26    10.74
> 
> I also think SD calculation gets skewed for guest->local
> host testing.

If it's broken, let's fix it?

> For this test, I ran a guest with numtxqs=16.
> The first result below is with my patch, which creates 16
> vhosts. The second result is with a modified patch which
> creates only 2 vhosts (testing with #netperfs = 64):

My guess is it's not a good idea to have more TX VQs than guest CPUs.

I realize for management it's easier to pass in a single vhost fd, but
just for testing it's probably easier to add code in userspace to open
/dev/vhost multiple times.

> 
> #vhosts  BW%     SD%        RSD%
> 16       20.79   186.01     149.74
> 2        30.89   34.55      18.44
> 
> The remote SD increases with the number of vhost threads,
> but that number seems to correlate with guest SD. So though
> BW% increased slightly from 20% to 30%, SD fell drastically
> from 186% to 34%. I think it could be a calculation skew
> with host SD, which also fell from 150% to 18%.

I think by default netperf looks in /proc/stat for CPU utilization data:
so host CPU utilization will include the guest CPU, I think?

I would go further and claim that for host/guest TCP
CPU utilization and SD should always be identical.
Makes sense?

> 
> I am planning to submit 2nd patch rev with restricted
> number of vhosts.
> 
> > > Likely cause for the 1 stream degradation with multiple
> > > vhost patch:
> > >
> > > 1. Two vhosts run handling the RX and TX respectively.
> > >    I think the issue is related to cache ping-pong esp
> > >    since these run on different cpus/sockets.
> >
> > Right. With TCP I think we are better off handling
> > TX and RX for a socket by the same vhost, so that
> > packet and its ack are handled by the same thread.
> > Is this what happens with RX multiqueue patch?
> > How do we select an RX queue to put the packet on?
> 
> My (unsubmitted) RX patch doesn't do this yet, that is
> something I will check.
> 
> Thanks,
> 
> - KK

You'll want to work on top of net-next, I think there's
RX flow filtering work going on there.

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-13  9:04               ` Michael S. Tsirkin
@ 2010-09-13 15:59                 ` Anthony Liguori
  2010-09-13 16:30                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Anthony Liguori @ 2010-09-13 15:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell

On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
> On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
>    
>> "Michael S. Tsirkin"<mst@redhat.com>  wrote on 09/12/2010 05:16:37 PM:
>>
>>      
>>> "Michael S. Tsirkin"<mst@redhat.com>
>>> 09/12/2010 05:16 PM
>>>
>>> On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
>>>        
>>>> Unfortunately I need a
>>>> constant in vhost for now.
>>>>          
>>> Maybe not even that: you create multiple vhost-net
>>> devices so vhost-net in kernel does not care about these
>>> either, right? So this can be just part of vhost_net.h
>>> in qemu.
>>>        
>> Sorry, I didn't understand what you meant.
>>
>> I can remove all socks[] arrays/constants by pre-allocating
>> sockets in vhost_setup_vqs. Then I can remove all "socks"
>> parameters in vhost_net_stop, vhost_net_release and
>> vhost_net_reset_owner.
>>
>> Does this make sense?
>>
>> Thanks,
>>
>> - KK
>>      
> Here's what I mean: each vhost device includes 1 TX
> and 1 RX VQ. Instead of teaching vhost about multiqueue,
> we could simply open /dev/vhost-net multiple times.
> How many times would be up to qemu.
>    

Trouble is, each vhost-net device is associated with 1 tun/tap device 
which means that each vhost-net device is associated with a transmit and 
receive queue.

I don't know if you'll always have an equal number of transmit and 
receive queues but there's certainly  challenge in terms of flexibility 
with this model.

Regards,

Anthony Liguori



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-13 11:50           ` Michael S. Tsirkin
@ 2010-09-13 16:23             ` Krishna Kumar2
  2010-09-15  5:33               ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-13 16:23 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, avi, davem, kvm, netdev, rusty, rick.jones2

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/13/2010 05:20:55 PM:

> > Results with the original kernel:
> > _____________________________
> > #       BW      SD      RSD
> > ______________________________
> > 1       20903   1       6
> > 2       21963   6       25
> > 4       22042   23      102
> > 8       21674   97      419
> > 16      22281   379     1663
> > 24      22521   857     3748
> > 32      22976   1528    6594
> > 40      23197   2390    10239
> > 48      22973   3542    15074
> > 64      23809   6486    27244
> > 80      23564   10169   43118
> > 96      22977   14954   62948
> > 128     23649   27067   113892
> > ________________________________
> >
> > With higher number of threads running in parallel, SD
> > increased. In this case most threads run in parallel
> > only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> > higher number of threads run in parallel through
> > ndo_start_xmit. I *think* the increase in SD is to do
> > with higher # of threads running for larger code path
> > >From the numbers I posted with the patch (cut-n-paste
> > only the % parts), BW increased much more than the SD,
> > sometimes more than twice the increase in SD.
>
> Service demand is BW/CPU, right? So if BW goes up by 50%
> and SD by 40%, this means that CPU more than doubled.

I think the SD calculation might be more complicated,
I think it does it based on adding up averages sampled
and stored during the run. But, I still don't see how CPU
can double?? e.g.
	BW: 1000 -> 1500 (50%)
	SD: 100 -> 140 (40%)
	CPU: 10 -> 10.71 (7.1%)

> > N#      BW%     SD%      RSD%
> > 4       54.30   40.00    -1.16
> > 8       71.79   46.59    -2.68
> > 16      71.89   50.40    -2.50
> > 32      72.24   34.26    -14.52
> > 48      70.10   31.51    -14.35
> > 64      69.01   38.81    -9.66
> > 96      70.68   71.26    10.74
> >
> > I also think SD calculation gets skewed for guest->local
> > host testing.
>
> If it's broken, let's fix it?
>
> > For this test, I ran a guest with numtxqs=16.
> > The first result below is with my patch, which creates 16
> > vhosts. The second result is with a modified patch which
> > creates only 2 vhosts (testing with #netperfs = 64):
>
> My guess is it's not a good idea to have more TX VQs than guest CPUs.

Definitely, I will try to run tomorrow with more reasonable
values, also will test with my second version of the patch
that creates restricted number of vhosts and post results.

> I realize for management it's easier to pass in a single vhost fd, but
> just for testing it's probably easier to add code in userspace to open
> /dev/vhost multiple times.
>
> >
> > #vhosts  BW%     SD%        RSD%
> > 16       20.79   186.01     149.74
> > 2        30.89   34.55      18.44
> >
> > The remote SD increases with the number of vhost threads,
> > but that number seems to correlate with guest SD. So though
> > BW% increased slightly from 20% to 30%, SD fell drastically
> > from 186% to 34%. I think it could be a calculation skew
> > with host SD, which also fell from 150% to 18%.
>
> I think by default netperf looks in /proc/stat for CPU utilization data:
> so host CPU utilization will include the guest CPU, I think?

It appears that way to me too, but the data above seems to
suggest the opposite...

> I would go further and claim that for host/guest TCP
> CPU utilization and SD should always be identical.
> Makes sense?

It makes sense to me, but once again I am not sure how SD
is really done, or whether it is linear to CPU. Cc'ing Rick
in case he can comment....

>
> >
> > I am planning to submit 2nd patch rev with restricted
> > number of vhosts.
> >
> > > > Likely cause for the 1 stream degradation with multiple
> > > > vhost patch:
> > > >
> > > > 1. Two vhosts run handling the RX and TX respectively.
> > > >    I think the issue is related to cache ping-pong esp
> > > >    since these run on different cpus/sockets.
> > >
> > > Right. With TCP I think we are better off handling
> > > TX and RX for a socket by the same vhost, so that
> > > packet and its ack are handled by the same thread.
> > > Is this what happens with RX multiqueue patch?
> > > How do we select an RX queue to put the packet on?
> >
> > My (unsubmitted) RX patch doesn't do this yet, that is
> > something I will check.
> >
> > Thanks,
> >
> > - KK
>
> You'll want to work on top of net-next, I think there's
> RX flow filtering work going on there.

Thanks Michael, I will follow up on that for the RX patch,
plus your suggestion on tying RX with TX.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-13 15:59                 ` Anthony Liguori
@ 2010-09-13 16:30                   ` Michael S. Tsirkin
  2010-09-13 17:00                     ` Avi Kivity
  2010-09-13 17:40                     ` Anthony Liguori
  0 siblings, 2 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-13 16:30 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell

On Mon, Sep 13, 2010 at 10:59:34AM -0500, Anthony Liguori wrote:
> On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
> >On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
> >>"Michael S. Tsirkin"<mst@redhat.com>  wrote on 09/12/2010 05:16:37 PM:
> >>
> >>>"Michael S. Tsirkin"<mst@redhat.com>
> >>>09/12/2010 05:16 PM
> >>>
> >>>On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> >>>>Unfortunately I need a
> >>>>constant in vhost for now.
> >>>Maybe not even that: you create multiple vhost-net
> >>>devices so vhost-net in kernel does not care about these
> >>>either, right? So this can be just part of vhost_net.h
> >>>in qemu.
> >>Sorry, I didn't understand what you meant.
> >>
> >>I can remove all socks[] arrays/constants by pre-allocating
> >>sockets in vhost_setup_vqs. Then I can remove all "socks"
> >>parameters in vhost_net_stop, vhost_net_release and
> >>vhost_net_reset_owner.
> >>
> >>Does this make sense?
> >>
> >>Thanks,
> >>
> >>- KK
> >Here's what I mean: each vhost device includes 1 TX
> >and 1 RX VQ. Instead of teaching vhost about multiqueue,
> >we could simply open /dev/vhost-net multiple times.
> >How many times would be up to qemu.
> 
> Trouble is, each vhost-net device is associated with 1 tun/tap
> device which means that each vhost-net device is associated with a
> transmit and receive queue.
> 
> I don't know if you'll always have an equal number of transmit and
> receive queues but there's certainly  challenge in terms of
> flexibility with this model.
> 
> Regards,
> 
> Anthony Liguori

Not really, TX and RX can be mapped to different devices,
or you can only map one of these. What is the trouble?
What other features would you desire in terms of flexibility?

-- 
MST

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-13 16:30                   ` Michael S. Tsirkin
@ 2010-09-13 17:00                     ` Avi Kivity
  2010-09-15  5:35                       ` Michael S. Tsirkin
  2010-09-13 17:40                     ` Anthony Liguori
  1 sibling, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-09-13 17:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anthony Liguori, Krishna Kumar2, davem, kvm, netdev, Rusty Russell

  On 09/13/2010 06:30 PM, Michael S. Tsirkin wrote:
> Trouble is, each vhost-net device is associated with 1 tun/tap
> device which means that each vhost-net device is associated with a
> transmit and receive queue.
>
> I don't know if you'll always have an equal number of transmit and
> receive queues but there's certainly  challenge in terms of
> flexibility with this model.
>
> Regards,
>
> Anthony Liguori
> Not really, TX and RX can be mapped to different devices,
> or you can only map one of these. What is the trouble?

Suppose you have one multiqueue-capable ethernet card.  How can you 
connect it to multiple rx/tx queues?

tx is in principle doable, but what about rx?

What does "only map one of these" mean?  Connect the device with one 
queue (presumably rx), and terminate the others?


Will packet classification work (does the current multiqueue proposal 
support it)?


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-13 16:30                   ` Michael S. Tsirkin
  2010-09-13 17:00                     ` Avi Kivity
@ 2010-09-13 17:40                     ` Anthony Liguori
  2010-09-15  5:40                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 43+ messages in thread
From: Anthony Liguori @ 2010-09-13 17:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell

On 09/13/2010 11:30 AM, Michael S. Tsirkin wrote:
> On Mon, Sep 13, 2010 at 10:59:34AM -0500, Anthony Liguori wrote:
>    
>> On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
>>      
>>> On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
>>>        
>>>> "Michael S. Tsirkin"<mst@redhat.com>   wrote on 09/12/2010 05:16:37 PM:
>>>>
>>>>          
>>>>> "Michael S. Tsirkin"<mst@redhat.com>
>>>>> 09/12/2010 05:16 PM
>>>>>
>>>>> On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
>>>>>            
>>>>>> Unfortunately I need a
>>>>>> constant in vhost for now.
>>>>>>              
>>>>> Maybe not even that: you create multiple vhost-net
>>>>> devices so vhost-net in kernel does not care about these
>>>>> either, right? So this can be just part of vhost_net.h
>>>>> in qemu.
>>>>>            
>>>> Sorry, I didn't understand what you meant.
>>>>
>>>> I can remove all socks[] arrays/constants by pre-allocating
>>>> sockets in vhost_setup_vqs. Then I can remove all "socks"
>>>> parameters in vhost_net_stop, vhost_net_release and
>>>> vhost_net_reset_owner.
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>>
>>>> - KK
>>>>          
>>> Here's what I mean: each vhost device includes 1 TX
>>> and 1 RX VQ. Instead of teaching vhost about multiqueue,
>>> we could simply open /dev/vhost-net multiple times.
>>> How many times would be up to qemu.
>>>        
>> Trouble is, each vhost-net device is associated with 1 tun/tap
>> device which means that each vhost-net device is associated with a
>> transmit and receive queue.
>>
>> I don't know if you'll always have an equal number of transmit and
>> receive queues but there's certainly  challenge in terms of
>> flexibility with this model.
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> Not really, TX and RX can be mapped to different devices,
>    

It's just a little odd.  Would you bond multiple tun tap devices to 
achieve multi-queue TX?  For RX, do you somehow limit RX to only one of 
those devices?

If we were doing this in QEMU (and btw, there needs to be userspace 
patches before we implement this in the kernel side), I think it would 
make more sense to just rely on doing a multithreaded write to a single 
tun/tap device and then to hope that in can be made smarter at the 
macvtap layer.

Regards,

Anthony Liguori

Regards,

Anthony Liguori

> or you can only map one of these. What is the trouble?
> What other features would you desire in terms of flexibility?
>
>    


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-13 16:23             ` Krishna Kumar2
@ 2010-09-15  5:33               ` Michael S. Tsirkin
  0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-15  5:33 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, avi, davem, kvm, netdev, rusty, rick.jones2

On Mon, Sep 13, 2010 at 09:53:40PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/13/2010 05:20:55 PM:
> 
> > > Results with the original kernel:
> > > _____________________________
> > > #       BW      SD      RSD
> > > ______________________________
> > > 1       20903   1       6
> > > 2       21963   6       25
> > > 4       22042   23      102
> > > 8       21674   97      419
> > > 16      22281   379     1663
> > > 24      22521   857     3748
> > > 32      22976   1528    6594
> > > 40      23197   2390    10239
> > > 48      22973   3542    15074
> > > 64      23809   6486    27244
> > > 80      23564   10169   43118
> > > 96      22977   14954   62948
> > > 128     23649   27067   113892
> > > ________________________________
> > >
> > > With higher number of threads running in parallel, SD
> > > increased. In this case most threads run in parallel
> > > only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> > > higher number of threads run in parallel through
> > > ndo_start_xmit. I *think* the increase in SD is to do
> > > with higher # of threads running for larger code path
> > > >From the numbers I posted with the patch (cut-n-paste
> > > only the % parts), BW increased much more than the SD,
> > > sometimes more than twice the increase in SD.
> >
> > Service demand is BW/CPU, right? So if BW goes up by 50%
> > and SD by 40%, this means that CPU more than doubled.
> 
> I think the SD calculation might be more complicated,
> I think it does it based on adding up averages sampled
> and stored during the run. But, I still don't see how CPU
> can double?? e.g.
> 	BW: 1000 -> 1500 (50%)
> 	SD: 100 -> 140 (40%)
> 	CPU: 10 -> 10.71 (7.1%)

Hmm. Time to look at the source. Which netperf version did you use?

> > > N#      BW%     SD%      RSD%
> > > 4       54.30   40.00    -1.16
> > > 8       71.79   46.59    -2.68
> > > 16      71.89   50.40    -2.50
> > > 32      72.24   34.26    -14.52
> > > 48      70.10   31.51    -14.35
> > > 64      69.01   38.81    -9.66
> > > 96      70.68   71.26    10.74
> > >
> > > I also think SD calculation gets skewed for guest->local
> > > host testing.
> >
> > If it's broken, let's fix it?
> >
> > > For this test, I ran a guest with numtxqs=16.
> > > The first result below is with my patch, which creates 16
> > > vhosts. The second result is with a modified patch which
> > > creates only 2 vhosts (testing with #netperfs = 64):
> >
> > My guess is it's not a good idea to have more TX VQs than guest CPUs.
> 
> Definitely, I will try to run tomorrow with more reasonable
> values, also will test with my second version of the patch
> that creates restricted number of vhosts and post results.
> 
> > I realize for management it's easier to pass in a single vhost fd, but
> > just for testing it's probably easier to add code in userspace to open
> > /dev/vhost multiple times.
> >
> > >
> > > #vhosts  BW%     SD%        RSD%
> > > 16       20.79   186.01     149.74
> > > 2        30.89   34.55      18.44
> > >
> > > The remote SD increases with the number of vhost threads,
> > > but that number seems to correlate with guest SD. So though
> > > BW% increased slightly from 20% to 30%, SD fell drastically
> > > from 186% to 34%. I think it could be a calculation skew
> > > with host SD, which also fell from 150% to 18%.
> >
> > I think by default netperf looks in /proc/stat for CPU utilization data:
> > so host CPU utilization will include the guest CPU, I think?
> 
> It appears that way to me too, but the data above seems to
> suggest the opposite...
> 
> > I would go further and claim that for host/guest TCP
> > CPU utilization and SD should always be identical.
> > Makes sense?
> 
> It makes sense to me, but once again I am not sure how SD
> is really done, or whether it is linear to CPU. Cc'ing Rick
> in case he can comment....

Me neither. I should rephrase: I think we should always
use host CPU utilization always.

> >
> > >
> > > I am planning to submit 2nd patch rev with restricted
> > > number of vhosts.
> > >
> > > > > Likely cause for the 1 stream degradation with multiple
> > > > > vhost patch:
> > > > >
> > > > > 1. Two vhosts run handling the RX and TX respectively.
> > > > >    I think the issue is related to cache ping-pong esp
> > > > >    since these run on different cpus/sockets.
> > > >
> > > > Right. With TCP I think we are better off handling
> > > > TX and RX for a socket by the same vhost, so that
> > > > packet and its ack are handled by the same thread.
> > > > Is this what happens with RX multiqueue patch?
> > > > How do we select an RX queue to put the packet on?
> > >
> > > My (unsubmitted) RX patch doesn't do this yet, that is
> > > something I will check.
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > You'll want to work on top of net-next, I think there's
> > RX flow filtering work going on there.
> 
> Thanks Michael, I will follow up on that for the RX patch,
> plus your suggestion on tying RX with TX.
> 
> Thanks,
> 
> - KK
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-13 17:00                     ` Avi Kivity
@ 2010-09-15  5:35                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-15  5:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Krishna Kumar2, davem, kvm, netdev, Rusty Russell

On Mon, Sep 13, 2010 at 07:00:51PM +0200, Avi Kivity wrote:
>  On 09/13/2010 06:30 PM, Michael S. Tsirkin wrote:
> >Trouble is, each vhost-net device is associated with 1 tun/tap
> >device which means that each vhost-net device is associated with a
> >transmit and receive queue.
> >
> >I don't know if you'll always have an equal number of transmit and
> >receive queues but there's certainly  challenge in terms of
> >flexibility with this model.
> >
> >Regards,
> >
> >Anthony Liguori
> >Not really, TX and RX can be mapped to different devices,
> >or you can only map one of these. What is the trouble?
> 
> Suppose you have one multiqueue-capable ethernet card.  How can you
> connect it to multiple rx/tx queues?
> tx is in principle doable, but what about rx?
> 
> What does "only map one of these" mean?  Connect the device with one
> queue (presumably rx), and terminate the others?
> 
> 
> Will packet classification work (does the current multiqueue
> proposal support it)?
> 

This is a non trivial problem, but
this needs to be handled in tap, not in vhost net.
If tap gives you multiple queues, vhost-net will happily
let you connect vqs to these.

> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
  2010-09-13 17:40                     ` Anthony Liguori
@ 2010-09-15  5:40                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-15  5:40 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell

On Mon, Sep 13, 2010 at 12:40:11PM -0500, Anthony Liguori wrote:
> On 09/13/2010 11:30 AM, Michael S. Tsirkin wrote:
> >On Mon, Sep 13, 2010 at 10:59:34AM -0500, Anthony Liguori wrote:
> >>On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
> >>>On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
> >>>>"Michael S. Tsirkin"<mst@redhat.com>   wrote on 09/12/2010 05:16:37 PM:
> >>>>
> >>>>>"Michael S. Tsirkin"<mst@redhat.com>
> >>>>>09/12/2010 05:16 PM
> >>>>>
> >>>>>On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> >>>>>>Unfortunately I need a
> >>>>>>constant in vhost for now.
> >>>>>Maybe not even that: you create multiple vhost-net
> >>>>>devices so vhost-net in kernel does not care about these
> >>>>>either, right? So this can be just part of vhost_net.h
> >>>>>in qemu.
> >>>>Sorry, I didn't understand what you meant.
> >>>>
> >>>>I can remove all socks[] arrays/constants by pre-allocating
> >>>>sockets in vhost_setup_vqs. Then I can remove all "socks"
> >>>>parameters in vhost_net_stop, vhost_net_release and
> >>>>vhost_net_reset_owner.
> >>>>
> >>>>Does this make sense?
> >>>>
> >>>>Thanks,
> >>>>
> >>>>- KK
> >>>Here's what I mean: each vhost device includes 1 TX
> >>>and 1 RX VQ. Instead of teaching vhost about multiqueue,
> >>>we could simply open /dev/vhost-net multiple times.
> >>>How many times would be up to qemu.
> >>Trouble is, each vhost-net device is associated with 1 tun/tap
> >>device which means that each vhost-net device is associated with a
> >>transmit and receive queue.
> >>
> >>I don't know if you'll always have an equal number of transmit and
> >>receive queues but there's certainly  challenge in terms of
> >>flexibility with this model.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >Not really, TX and RX can be mapped to different devices,
> 
> It's just a little odd.  Would you bond multiple tun tap devices to
> achieve multi-queue TX?  For RX, do you somehow limit RX to only one
> of those devices?

Exatly in the way the patches we discuss here do this:
we already have a per-queue fd.

> If we were doing this in QEMU (and btw, there needs to be userspace
> patches before we implement this in the kernel side),

I agree that Feature parity is nice to have, but
I don't see a huge problem with (hopefully temporarily) only
supporting feature X with kernel acceleration, BTW.
This is already the case with checksum offloading features.

> I think it
> would make more sense to just rely on doing a multithreaded write to
> a single tun/tap device and then to hope that in can be made smarter
> at the macvtap layer.

No, an fd serializes access, so you need seperate fds for multithreaded
writes to work.  Think about how e.g. select will work.

> Regards,
> 
> Anthony Liguori
> 
> Regards,
> 
> Anthony Liguori
> 
> >or you can only map one of these. What is the trouble?
> >What other features would you desire in terms of flexibility?
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2010-09-15  5:46 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08  7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
2010-09-08  7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
2010-09-09  3:49   ` Rusty Russell
2010-09-09  5:23     ` Krishna Kumar2
2010-09-09 12:14       ` Rusty Russell
2010-09-09 13:49         ` Krishna Kumar2
2010-09-10  3:33           ` Rusty Russell
2010-09-12 11:46           ` Michael S. Tsirkin
2010-09-13  4:20             ` Krishna Kumar2
2010-09-13  9:04               ` Michael S. Tsirkin
2010-09-13 15:59                 ` Anthony Liguori
2010-09-13 16:30                   ` Michael S. Tsirkin
2010-09-13 17:00                     ` Avi Kivity
2010-09-15  5:35                       ` Michael S. Tsirkin
2010-09-13 17:40                     ` Anthony Liguori
2010-09-15  5:40                       ` Michael S. Tsirkin
2010-09-08  7:29 ` [RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
2010-09-08  7:29 ` [RFC PATCH 3/4] Changes for vhost Krishna Kumar
2010-09-08  7:29 ` [RFC PATCH 4/4] qemu changes Krishna Kumar
2010-09-08  7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
2010-09-08  9:22   ` Krishna Kumar2
2010-09-08  9:28     ` Avi Kivity
2010-09-08 10:17       ` Krishna Kumar2
2010-09-08 14:12         ` Arnd Bergmann
2010-09-08 16:47           ` Krishna Kumar2
2010-09-09 10:40             ` Arnd Bergmann
2010-09-09 13:19               ` Krishna Kumar2
2010-09-08  8:10 ` Michael S. Tsirkin
2010-09-08  9:23   ` Krishna Kumar2
2010-09-08 10:48     ` Michael S. Tsirkin
2010-09-08 12:19       ` Krishna Kumar2
2010-09-08 16:47   ` Krishna Kumar2
     [not found]   ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
2010-09-09  9:45     ` Krishna Kumar2
2010-09-09 23:00       ` Sridhar Samudrala
2010-09-10  5:19         ` Krishna Kumar2
2010-09-12 11:40       ` Michael S. Tsirkin
2010-09-13  4:12         ` Krishna Kumar2
2010-09-13 11:50           ` Michael S. Tsirkin
2010-09-13 16:23             ` Krishna Kumar2
2010-09-15  5:33               ` Michael S. Tsirkin
     [not found]     ` <OF8043B2B7.7048D739-ON65257799.0021A2EE-65257799.00356B3E@LocalDomain>
2010-09-09 13:18       ` Krishna Kumar2
2010-09-08  8:13 ` Michael S. Tsirkin
2010-09-08  9:28   ` Krishna Kumar2

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.