All of lore.kernel.org
 help / color / mirror / Atom feed
* [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
@ 2010-09-17 10:03 Krishna Kumar
  2010-09-17 10:03 ` [v2 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Krishna Kumar @ 2010-09-17 10:03 UTC (permalink / raw)
  To: rusty, davem, mst; +Cc: kvm, arnd, netdev, avi, anthony, Krishna Kumar

Following patches implement transmit MQ in virtio-net.  Also
included is the user qemu changes. MQ is disabled by default
unless qemu specifies it.

1. This feature was first implemented with a single vhost.
   Testing showed 3-8% performance gain for upto 8 netperf
   sessions (and sometimes 16), but BW dropped with more
   sessions.  However, adding more vhosts improved BW
   significantly all the way to 128 sessions. Multiple
   vhost is implemented in-kernel by passing an argument
   to SET_OWNER (retaining backward compatibility). The
   vhost patch adds 173 source lines (incl comments).
2. BW -> CPU/SD equation: Average TCP performance increased
   23% compared to almost 70% for earlier patch (with
   unrestricted #vhosts).  SD improved -4.2% while it had
   increased 55% for the earlier patch.  Increasing #vhosts
   has it's pros and cons, but this patch lays emphasis on
   reducing CPU utilization.  Another option could be a
   tunable to select number of vhosts threads.
3. Interoperability: Many combinations, but not all, of qemu,
   host, guest tested together.  Tested with multiple i/f's
   on guest, with both mq=on/off, vhost=on/off, etc.

                  Changes from rev1:
                  ------------------
1. Move queue_index from virtio_pci_vq_info to virtqueue,
   and resulting changes to existing code and to the patch.
2. virtio-net probe uses virtio_config_val.
3. Remove constants: VIRTIO_MAX_TXQS, MAX_VQS, all arrays
   allocated on stack, etc.
4. Restrict number of vhost threads to 2 - I get much better
   cpu/sd results (without any tuning) with low number of vhost
   threads.  Higher vhosts gives better average BW performance
   (from average of 45%), but SD increases significantly (90%).
5. Working of vhost threads changes, eg for numtxqs=4:
       vhost-0: handles RX
       vhost-1: handles TX[0]
       vhost-0: handles TX[1]
       vhost-1: handles TX[2]
       vhost-0: handles TX[3]

                  Enabling MQ on virtio:
                  -----------------------
When following options are passed to qemu:
        - smp > 1
        - vhost=on
        - mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using
an optional 'numtxqs' option.  e.g. for a smp=4 guest:
        vhost=on                   ->   #txqueues = 1
        vhost=on,mq=on             ->   #txqueues = 4
        vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
        vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2


                   Performance (guest -> local host):
                   -----------------------------------
System configuration:
        Host:  8 Intel Xeon, 8 GB memory
        Guest: 4 cpus, 2 GB memory, numtxqs=4
All testing without any system tuning, and default netperf
Results split across two tables to show SD and CPU usage:
________________________________________________________________________
                    TCP: BW vs CPU/Remote CPU utilization:
#    BW1    BW2 (%)        CPU1    CPU2 (%)     RCPU1  RCPU2 (%)
________________________________________________________________________
1    69971  65376 (-6.56)  134   170  (26.86)   322    376   (16.77)
2    20911  24839 (18.78)  107   139  (29.90)   217    264   (21.65)
4    21431  28912 (34.90)  213   318  (49.29)   444    541   (21.84)
8    21857  34592 (58.26)  444   859  (93.46)   901    1247  (38.40)
16   22368  33083 (47.90)  899   1523 (69.41)   1813   2410  (32.92)
24   22556  32578 (44.43)  1347  2249 (66.96)   2712   3606  (32.96)
32   22727  30923 (36.06)  1806  2506 (38.75)   3622   3952  (9.11)
40   23054  29334 (27.24)  2319  2872 (23.84)   4544   4551  (.15)
48   23006  28800 (25.18)  2827  2990 (5.76)    5465   4718  (-13.66)
64   23411  27661 (18.15)  3708  3306 (-10.84)  7231   5218  (-27.83)
80   23175  27141 (17.11)  4796  4509 (-5.98)   9152   7182  (-21.52)
96   23337  26759 (14.66)  5603  4543 (-18.91)  10890  7162  (-34.23)
128  22726  28339 (24.69)  7559  6395 (-15.39)  14600  10169 (-30.34)
________________________________________________________________________
Summary:    BW: 22.8%    CPU: 1.9%    RCPU: -17.0%
________________________________________________________________________
                    TCP: BW vs SD/Remote SD:
#    BW1    BW2 (%)        SD1      SD2  (%)        RSD1    RSD2   (%)
________________________________________________________________________
1    69971  65376 (-6.56)  4       6     (50.00)    21      26     (23.80)
2    20911  24839 (18.78)  6       7     (16.66)    27      28     (3.70)
4    21431  28912 (34.90)  26      31    (19.23)    108     111    (2.77)
8    21857  34592 (58.26)  106     135   (27.35)    432     393    (-9.02)
16   22368  33083 (47.90)  431     577   (33.87)    1742    1828   (4.93)
24   22556  32578 (44.43)  972     1393  (43.31)    3915    4479   (14.40)
32   22727  30923 (36.06)  1723    2165  (25.65)    6908    6842   (-.95)
40   23054  29334 (27.24)  2774    2761  (-.46)     10874   8764   (-19.40)
48   23006  28800 (25.18)  4126    3847  (-6.76)    15953   12172  (-23.70)
64   23411  27661 (18.15)  7216    6035  (-16.36)   28146   19078  (-32.21)
80   23175  27141 (17.11)  11729   12454 (6.18)     44765   39750  (-11.20)
96   23337  26759 (14.66)  16745   15905 (-5.01)    65099   50261  (-22.79)
128  22726  28339 (24.69)  30571   27893 (-8.76)    118089  89994  (-23.79)
________________________________________________________________________
Summary:    BW: 22.8%    SD: -4.21%    RSD: -21.06%
________________________________________________________________________
                       UDP: BW vs SD/CPU
#      BW1      BW2 (%)      CPU1      CPU2 (%)      SD1    SD2    (%)
_____________________________________________________________________________
1      36521    37415 (2.44)   61     61    (0)      2      2      (0)
4      28585    46903 (64.08)  397    546   (37.53)  72     68     (-5.55)
8      26649    44694 (67.71)  851    1243  (46.06)  334    339    (1.49)
16     25905    43385 (67.47)  1740   2631  (51.20)  1409   1572   (11.56)
32     24980    40448 (61.92)  3502   5360  (53.05)  5881   6401   (8.84)
48     27439    39451 (43.77)  5410   8324  (53.86)  12475  14855  (19.07)
64     25682    39915 (55.42)  7165   10825 (51.08)  23404  25982  (11.01)
96     26205    40190 (53.36)  10855  16283 (50.00)  52124  75014  (43.91)
128    25741    40252 (56.37)  14448  22186 (53.55)  133922 96843  (-27.68)
____________________________________________________________________________
Summary:       BW: 50.4      CPU: 51.8      SD: -27.68
_____________________________________________________________________________
N#: Number of netperf sessions, 60 sec runs
BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
              SD for original code
BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
              SD for new code.
CPU1,CPU2,RCPU1,RCPU2: Similar to SD.

For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
for degradation for 1 stream case:
    1. Without any tuning, BW falls -6.5%.
    2. When vhosts on server were bound to CPU0, BW was as good
       as with original code.
    3. When new code was started with numtxqs=1 (or mq=off, which
       is the default), there was no degradation.


                       Next steps:
                       -----------
1. MQ RX patch is also complete - plan to submit once TX is OK (as
   well as after identifying bandwidth degradations for some test
   cases).
2. Cache-align data structures: I didn't see any BW/SD improvement
   after making the sq's (and similarly for vhost) cache-aligned
   statically:
        struct virtnet_info {
                ...
                struct send_queue sq[16] ____cacheline_aligned_in_smp;
                ...
        };
3. Migration is not tested.

Review/feedback appreciated.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [v2 RFC PATCH 1/4] Change virtqueue structure
  2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
@ 2010-09-17 10:03 ` Krishna Kumar
  2010-09-17 10:03 ` [v2 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Krishna Kumar @ 2010-09-17 10:03 UTC (permalink / raw)
  To: rusty, davem, mst; +Cc: kvm, arnd, netdev, avi, anthony, Krishna Kumar

Move queue_index from virtio_pci_vq_info to virtqueue. This
allows callback handlers to figure out the queue number for
the vq that needs attention.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> 
---
 drivers/virtio/virtio_pci.c |   10 +++-------
 include/linux/virtio.h      |    1 +
 2 files changed, 4 insertions(+), 7 deletions(-)

diff -ruNp org2/include/linux/virtio.h tx_only2/include/linux/virtio.h
--- org2/include/linux/virtio.h	2010-06-02 18:46:43.000000000 +0530
+++ tx_only2/include/linux/virtio.h	2010-09-16 15:24:01.000000000 +0530
@@ -22,6 +22,7 @@ struct virtqueue {
 	void (*callback)(struct virtqueue *vq);
 	const char *name;
 	struct virtio_device *vdev;
+	int queue_index;	/* the index of the queue */
 	void *priv;
 };
 
diff -ruNp org2/drivers/virtio/virtio_pci.c tx_only2/drivers/virtio/virtio_pci.c
--- org2/drivers/virtio/virtio_pci.c	2010-08-05 14:48:06.000000000 +0530
+++ tx_only2/drivers/virtio/virtio_pci.c	2010-09-16 15:24:01.000000000 +0530
@@ -75,9 +75,6 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -185,11 +182,10 @@ static void vp_reset(struct virtio_devic
 static void vp_notify(struct virtqueue *vq)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	struct virtio_pci_vq_info *info = vq->priv;
 
 	/* we write the queue's selector into the notification register to
 	 * signal the other end */
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -385,7 +381,6 @@ static struct virtqueue *setup_vq(struct
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	info->queue_index = index;
 	info->num = num;
 	info->msix_vector = msix_vec;
 
@@ -408,6 +403,7 @@ static struct virtqueue *setup_vq(struct
 		goto out_activate_queue;
 	}
 
+	vq->queue_index = index;
 	vq->priv = info;
 	info->vq = vq;
 
@@ -446,7 +442,7 @@ static void vp_del_vq(struct virtqueue *
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vp_dev->lock, flags);
 
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
+	iowrite16(vq->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
 	if (vp_dev->msix_enabled) {
 		iowrite16(VIRTIO_MSI_NO_VECTOR,

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [v2 RFC PATCH 2/4] Changes for virtio-net
  2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
  2010-09-17 10:03 ` [v2 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
@ 2010-09-17 10:03 ` Krishna Kumar
  2010-09-17 10:25   ` Eric Dumazet
  2010-09-17 10:03 ` [v2 RFC PATCH 3/4] Changes for vhost Krishna Kumar
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Krishna Kumar @ 2010-09-17 10:03 UTC (permalink / raw)
  To: rusty, davem, mst; +Cc: kvm, arnd, netdev, avi, anthony, Krishna Kumar

Implement mq virtio-net driver. 

Though struct virtio_net_config changes, it works with old
qemu's since the last element is not accessed, unless qemu
sets VIRTIO_NET_F_NUMTXQS.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/net/virtio_net.c   |  213 ++++++++++++++++++++++++++---------
 include/linux/virtio_net.h |    3 
 2 files changed, 163 insertions(+), 53 deletions(-)

diff -ruNp org2/include/linux/virtio_net.h tx_only2/include/linux/virtio_net.h
--- org2/include/linux/virtio_net.h	2010-02-10 13:20:27.000000000 +0530
+++ tx_only2/include/linux/virtio_net.h	2010-09-16 15:24:01.000000000 +0530
@@ -26,6 +26,7 @@
 #define VIRTIO_NET_F_CTRL_RX	18	/* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN	19	/* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS	21	/* Device supports multiple TX queue */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 
@@ -34,6 +35,8 @@ struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* number of transmit queues */
+	__u16 numtxqs;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org2/drivers/net/virtio_net.c tx_only2/drivers/net/virtio_net.c
--- org2/drivers/net/virtio_net.c	2010-07-08 12:54:32.000000000 +0530
+++ tx_only2/drivers/net/virtio_net.c	2010-09-16 15:24:01.000000000 +0530
@@ -40,9 +40,20 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 
+/* Our representation of a send virtqueue */
+struct send_queue {
+	struct virtqueue *svq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
+	int numtxqs;			/* Number of tx queues */
+	struct send_queue *sq;
+	struct virtqueue *rvq;
+	struct virtqueue *cvq;
 	struct net_device *dev;
 	struct napi_struct napi;
 	unsigned int status;
@@ -62,9 +73,8 @@ struct virtnet_info {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
-	/* fragments + linear part + virtio header */
+	/* RX: fragments + linear part + virtio header */
 	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -120,12 +130,13 @@ static struct page *get_a_page(struct vi
 static void skb_xmit_done(struct virtqueue *svq)
 {
 	struct virtnet_info *vi = svq->vdev->priv;
+	int qnum = svq->queue_index - 1;	/* 0 is RX vq */
 
 	/* Suppress further interrupts. */
 	virtqueue_disable_cb(svq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, qnum);
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -495,12 +506,13 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+				       struct virtqueue *svq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 		vi->dev->stats.tx_bytes += skb->len;
 		vi->dev->stats.tx_packets++;
@@ -510,7 +522,8 @@ static unsigned int free_old_xmit_skbs(s
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+		    struct virtqueue *svq, struct scatterlist *tx_sg)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -548,12 +561,12 @@ static int xmit_skb(struct virtnet_info 
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, tx_sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(svq, tx_sg, hdr->num_sg,
 					0, skb);
 }
 
@@ -561,31 +574,34 @@ static netdev_tx_t start_xmit(struct sk_
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	int capacity;
+	int qnum = skb_get_queue_mapping(skb);
+	struct virtqueue *svq = vi->sq[qnum].svq;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, svq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, skb, svq, vi->sq[qnum].tx_sg);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
 		if (net_ratelimit()) {
 			if (likely(capacity == -ENOMEM)) {
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					 "TXQ (%d) failure: out of memory\n",
+					 qnum);
 			} else {
 				dev->stats.tx_fifo_errors++;
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					 "Unexpected TXQ (%d) failure: %d\n",
+					 qnum, capacity);
 			}
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(svq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -594,13 +610,13 @@ static netdev_tx_t start_xmit(struct sk_
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+		netif_stop_subqueue(dev, qnum);
+		if (unlikely(!virtqueue_enable_cb(svq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, svq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				netif_start_subqueue(dev, qnum);
+				virtqueue_disable_cb(svq);
 			}
 		}
 	}
@@ -871,10 +887,10 @@ static void virtnet_update_status(struct
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 }
 
@@ -885,18 +901,112 @@ static void virtnet_config_changed(struc
 	virtnet_update_status(vi);
 }
 
+#define MAX_DEVICE_NAME		16
+static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
+{
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int i, err = -ENOMEM;
+	int totalvqs;
+	char **names;
+
+	/* Allocate send queues */
+	vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
+	if (!vi->sq)
+		goto out;
+
+	/* setup initial send queue parameters */
+	for (i = 0; i < numtxqs; i++)
+		sg_init_table(vi->sq[i].tx_sg, ARRAY_SIZE(vi->sq[i].tx_sg));
+
+	/*
+	 * We expect 1 RX virtqueue followed by 'numtxqs' TX virtqueues, and
+	 * optionally one control virtqueue.
+	 */
+	totalvqs = 1 + numtxqs +
+		   virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Setup parameters for find_vqs */
+	vqs = kmalloc(totalvqs * sizeof(*vqs), GFP_KERNEL);
+	callbacks = kmalloc(totalvqs * sizeof(*callbacks), GFP_KERNEL);
+	names = kzalloc(totalvqs * sizeof(*names), GFP_KERNEL);
+	if (!vqs || !callbacks || !names)
+		goto free_mem;
+
+	/* Parameters for recv virtqueue */
+	callbacks[0] = skb_recv_done;
+	names[0] = "input";
+
+	/* Parameters for send virtqueues */
+	for (i = 1; i <= numtxqs; i++) {
+		callbacks[i] = skb_xmit_done;
+		names[i] = kmalloc(MAX_DEVICE_NAME * sizeof(*names[i]),
+				   GFP_KERNEL);
+		if (!names[i])
+			goto free_mem;
+		sprintf(names[i], "output.%d", i - 1);
+	}
+
+	/* Parameters for control virtqueue, if any */
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		callbacks[i] = NULL;
+		names[i] = "control";
+	}
+
+	err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs, callbacks,
+					 (const char **)names);
+	if (err)
+		goto free_mem;
+
+	vi->rvq = vqs[0];
+	for (i = 0; i < numtxqs; i++)
+		vi->sq[i].svq = vqs[i + 1];
+
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+		vi->cvq = vqs[i + 1];
+
+		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+	}
+
+free_mem:
+	if (names) {
+		for (i = 1; i <= numtxqs; i++)
+			kfree(names[i]);
+		kfree(names);
+	}
+
+	kfree(callbacks);
+	kfree(vqs);
+
+	if (err)
+		kfree(vi->sq);
+
+out:
+	return err;
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int err;
+	u16 numtxqs;
 	struct net_device *dev;
 	struct virtnet_info *vi;
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs;
+
+	/*
+	 * Find if host passed the number of transmit queues supported
+	 * by the device
+	 */
+	err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
+				offsetof(struct virtio_net_config, numtxqs),
+				&numtxqs);
+
+	/* We need atleast one txq */
+	if (err || !numtxqs)
+		numtxqs = 1;
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), numtxqs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -940,9 +1050,9 @@ static int virtnet_probe(struct virtio_d
 	vi->vdev = vdev;
 	vdev->priv = vi;
 	vi->pages = NULL;
+	vi->numtxqs = numtxqs;
 	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -953,23 +1063,10 @@ static int virtnet_probe(struct virtio_d
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
-
-	err = vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names);
+	/* Initialize our rx/tx queue parameters, and invoke find_vqs */
+	err = initialize_vqs(vi, numtxqs);
 	if (err)
-		goto free;
-
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
-
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
-		vi->cvq = vqs[2];
-
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
-			dev->features |= NETIF_F_HW_VLAN_FILTER;
-	}
+		goto free_netdev;
 
 	err = register_netdev(dev);
 	if (err) {
@@ -986,6 +1083,9 @@ static int virtnet_probe(struct virtio_d
 		goto unregister;
 	}
 
+	dev_info(&dev->dev, "(virtio-net) Allocated 1 RX and %d TX vq's\n",
+		 numtxqs);
+
 	vi->status = VIRTIO_NET_S_LINK_UP;
 	virtnet_update_status(vi);
 	netif_carrier_on(dev);
@@ -998,7 +1098,8 @@ unregister:
 	cancel_delayed_work_sync(&vi->refill);
 free_vqs:
 	vdev->config->del_vqs(vdev);
-free:
+	kfree(vi->sq);
+free_netdev:
 	free_netdev(dev);
 	return err;
 }
@@ -1006,11 +1107,17 @@ free:
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
+	int i;
+
+	for (i = 0; i < vi->numtxqs; i++) {
+		struct virtqueue *svq = vi->sq[i].svq;
+
+		while (1) {
+			buf = virtqueue_detach_unused_buf(svq);
+			if (!buf)
+				break;
+			dev_kfree_skb(buf);
+		}
 	}
 	while (1) {
 		buf = virtqueue_detach_unused_buf(vi->rvq);
@@ -1059,7 +1166,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
-	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
+	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_NUMTXQS,
 };
 
 static struct virtio_driver virtio_net_driver = {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [v2 RFC PATCH 3/4] Changes for vhost
  2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
  2010-09-17 10:03 ` [v2 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
  2010-09-17 10:03 ` [v2 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
@ 2010-09-17 10:03 ` Krishna Kumar
  2010-09-17 10:03 ` [v2 RFC PATCH 4/4] qemu changes Krishna Kumar
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Krishna Kumar @ 2010-09-17 10:03 UTC (permalink / raw)
  To: rusty, davem, mst; +Cc: kvm, arnd, netdev, avi, anthony, Krishna Kumar

Changes for mq vhost.

vhost_net_open is changed to allocate a vhost_net and
return.  The remaining initializations are delayed till
SET_OWNER. SET_OWNER is changed so that the argument
is used to figure out how many txqs to use.  Unmodified
qemu's will pass NULL, so this is recognized and handled
as numtxqs=1.
 
Besides changing handle_tx to use 'vq', this patch also
changes handle_rx to take vq as parameter.  The mq RX
patch requires this change, but till then it is consistent
(and less confusing) to make the interfaces for handling
rx and tx similar.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 drivers/vhost/net.c   |  273 ++++++++++++++++++++++++++--------------
 drivers/vhost/vhost.c |  186 +++++++++++++++++++--------
 drivers/vhost/vhost.h |   16 +-
 3 files changed, 324 insertions(+), 151 deletions(-)

diff -ruNp org2/drivers/vhost/vhost.h tx_only2/drivers/vhost/vhost.h
--- org2/drivers/vhost/vhost.h	2010-08-03 08:49:31.000000000 +0530
+++ tx_only2/drivers/vhost/vhost.h	2010-09-16 15:24:01.000000000 +0530
@@ -40,11 +40,11 @@ struct vhost_poll {
 	wait_queue_t              wait;
 	struct vhost_work	  work;
 	unsigned long		  mask;
-	struct vhost_dev	 *dev;
+	struct vhost_virtqueue	  *vq;  /* points back to vq */
 };
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev);
+		     unsigned long mask, struct vhost_virtqueue *vq);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -110,6 +110,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	struct task_struct *worker; /* vhost for this vq, shared btwn RX/TX */
+	spinlock_t *work_lock;
+	struct list_head *work_list;
+	int qnum;	/* 0 for RX, 1 -> n-1 for TX */
 };
 
 struct vhost_dev {
@@ -124,11 +128,13 @@ struct vhost_dev {
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
-	spinlock_t work_lock;
-	struct list_head work_list;
-	struct task_struct *worker;
+	spinlock_t *work_lock;
+	struct list_head *work_list;
 };
 
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs);
+void vhost_free_vqs(struct vhost_dev *dev);
+int vhost_get_num_threads(int nvqs);
 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
 long vhost_dev_check_owner(struct vhost_dev *);
 long vhost_dev_reset_owner(struct vhost_dev *);
diff -ruNp org2/drivers/vhost/net.c tx_only2/drivers/vhost/net.c
--- org2/drivers/vhost/net.c	2010-08-05 14:48:06.000000000 +0530
+++ tx_only2/drivers/vhost/net.c	2010-09-16 15:24:01.000000000 +0530
@@ -33,12 +33,6 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
-enum {
-	VHOST_NET_VQ_RX = 0,
-	VHOST_NET_VQ_TX = 1,
-	VHOST_NET_VQ_MAX = 2,
-};
-
 enum vhost_net_poll_state {
 	VHOST_NET_POLL_DISABLED = 0,
 	VHOST_NET_POLL_STARTED = 1,
@@ -47,12 +41,13 @@ enum vhost_net_poll_state {
 
 struct vhost_net {
 	struct vhost_dev dev;
-	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
-	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct vhost_virtqueue *vqs;
+	struct vhost_poll *poll;
+	struct socket **socks;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
-	enum vhost_net_poll_state tx_poll_state;
+	enum vhost_net_poll_state *tx_poll_state;
 };
 
 /* Pop first len bytes from iovec. Return number of segments used. */
@@ -92,28 +87,28 @@ static void copy_iovec_hdr(const struct 
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
+static void tx_poll_stop(struct vhost_net *net, int qnum)
 {
-	if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
+	if (likely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STARTED))
 		return;
-	vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
-	net->tx_poll_state = VHOST_NET_POLL_STOPPED;
+	vhost_poll_stop(&net->poll[qnum]);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
 }
 
 /* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
+static void tx_poll_start(struct vhost_net *net, struct socket *sock, int qnum)
 {
-	if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
+	if (unlikely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STOPPED))
 		return;
-	vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
-	net->tx_poll_state = VHOST_NET_POLL_STARTED;
+	vhost_poll_start(&net->poll[qnum], sock->file);
+	net->tx_poll_state[qnum] = VHOST_NET_POLL_STARTED;
 }
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_virtqueue *vq)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 	unsigned out, in, s;
 	int head;
 	struct msghdr msg = {
@@ -134,7 +129,7 @@ static void handle_tx(struct vhost_net *
 	wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 	if (wmem >= sock->sk->sk_sndbuf) {
 		mutex_lock(&vq->mutex);
-		tx_poll_start(net, sock);
+		tx_poll_start(net, sock, vq->qnum);
 		mutex_unlock(&vq->mutex);
 		return;
 	}
@@ -144,7 +139,7 @@ static void handle_tx(struct vhost_net *
 	vhost_disable_notify(vq);
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
-		tx_poll_stop(net);
+		tx_poll_stop(net, vq->qnum);
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
@@ -159,7 +154,7 @@ static void handle_tx(struct vhost_net *
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
 			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
+				tx_poll_start(net, sock, vq->qnum);
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
@@ -189,7 +184,7 @@ static void handle_tx(struct vhost_net *
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
 			vhost_discard_vq_desc(vq, 1);
-			tx_poll_start(net, sock);
+			tx_poll_start(net, sock, vq->qnum);
 			break;
 		}
 		if (err != len)
@@ -282,9 +277,9 @@ err:
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_big(struct vhost_net *net)
+static void handle_rx_big(struct vhost_virtqueue *vq,
+			  struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned out, in, log, s;
 	int head;
 	struct vhost_log *vq_log;
@@ -393,9 +388,9 @@ static void handle_rx_big(struct vhost_n
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
-static void handle_rx_mergeable(struct vhost_net *net)
+static void handle_rx_mergeable(struct vhost_virtqueue *vq,
+				struct vhost_net *net)
 {
-	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
 	unsigned uninitialized_var(in), log;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -500,96 +495,181 @@ static void handle_rx_mergeable(struct v
 	unuse_mm(net->dev.mm);
 }
 
-static void handle_rx(struct vhost_net *net)
+static void handle_rx(struct vhost_virtqueue *vq)
 {
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
-		handle_rx_mergeable(net);
+		handle_rx_mergeable(vq, net);
 	else
-		handle_rx_big(net);
+		handle_rx_big(vq, net);
 }
 
 static void handle_tx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_tx(net);
+	handle_tx(vq);
 }
 
 static void handle_rx_kick(struct vhost_work *work)
 {
 	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
 						  poll.work);
-	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
 
-	handle_rx(net);
+	handle_rx(vq);
 }
 
 static void handle_tx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_TX].work);
-	handle_tx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_tx(vq);
 }
 
 static void handle_rx_net(struct vhost_work *work)
 {
-	struct vhost_net *net = container_of(work, struct vhost_net,
-					     poll[VHOST_NET_VQ_RX].work);
-	handle_rx(net);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+						  work)->vq;
+
+	handle_rx(vq);
 }
 
-static int vhost_net_open(struct inode *inode, struct file *f)
+void vhost_free_vqs(struct vhost_dev *dev)
 {
-	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
-	struct vhost_dev *dev;
-	int r;
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
 
-	if (!n)
-		return -ENOMEM;
+	kfree(dev->work_list);
+	kfree(dev->work_lock);
+	kfree(n->socks);
+	kfree(n->tx_poll_state);
+	kfree(n->poll);
+	kfree(n->vqs);
 
-	dev = &n->dev;
-	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
-	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
-	if (r < 0) {
-		kfree(n);
-		return r;
+	/*
+	 * Reset so that vhost_net_release (after vhost_dev_set_owner call)
+	 * will notice.
+	 */
+	n->vqs = NULL;
+	n->poll = NULL;
+	n->socks = NULL;
+	n->tx_poll_state = NULL;
+	dev->work_lock = NULL;
+	dev->work_list = NULL;
+}
+
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs)
+{
+	struct vhost_net *n = container_of(dev, struct vhost_net, dev);
+	int num_threads;
+	int i, nvqs;
+	int ret;
+
+	if (numtxqs < 0)
+		return -EINVAL;
+
+	if (numtxqs == 0) {
+		/* Old qemu doesn't pass arguments to set_owner, use 1 txq */
+		numtxqs = 1;
+	}
+
+	/* Total number of virtqueues is 1 + numtxqs */
+	nvqs = numtxqs + 1;
+
+	/* Total number of vhost threads */
+	num_threads = vhost_get_num_threads(nvqs);
+
+	n->vqs = kmalloc(nvqs * sizeof(*n->vqs), GFP_KERNEL);
+	n->poll = kmalloc(nvqs * sizeof(*n->poll), GFP_KERNEL);
+	n->socks = kmalloc(nvqs * sizeof(*n->socks), GFP_KERNEL);
+	n->tx_poll_state = kmalloc(nvqs * sizeof(*n->tx_poll_state),
+				   GFP_KERNEL);
+	dev->work_lock = kmalloc(num_threads * sizeof(*dev->work_lock),
+				 GFP_KERNEL);
+	dev->work_list = kmalloc(num_threads * sizeof(*dev->work_list),
+				 GFP_KERNEL);
+
+	if (!n->vqs || !n->poll || !n->socks || !n->tx_poll_state ||
+	    !dev->work_lock || !dev->work_list) {
+		ret = -ENOMEM;
+		goto err;
 	}
 
-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
-	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	/* 1 RX, followed by 'numtxqs' TX queues */
+	n->vqs[0].handle_kick = handle_rx_kick;
 
-	f->private_data = n;
+	for (i = 1; i < nvqs; i++)
+		n->vqs[i].handle_kick = handle_tx_kick;
+
+	ret = vhost_dev_init(dev, n->vqs, nvqs);
+	if (ret < 0)
+		goto err;
+
+	vhost_poll_init(&n->poll[0], handle_rx_net, POLLIN, &n->vqs[0]);
+
+	for (i = 1; i < nvqs; i++) {
+		vhost_poll_init(&n->poll[i], handle_tx_net, POLLOUT,
+				&n->vqs[i]);
+		n->tx_poll_state[i] = VHOST_NET_POLL_DISABLED;
+	}
 
 	return 0;
+
+err:
+	/* Free all pointers that may have been allocated */
+	vhost_free_vqs(dev);
+
+	return ret;
+}
+
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+	struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
+	int ret = ENOMEM;
+
+	if (n) {
+		struct vhost_dev *dev = &n->dev;
+
+		f->private_data = n;
+		mutex_init(&dev->mutex);
+
+		/* Defer all other initialization till user does SET_OWNER */
+		ret = 0;
+	}
+
+	return ret;
 }
 
 static void vhost_net_disable_vq(struct vhost_net *n,
 				 struct vhost_virtqueue *vq)
 {
+	int qnum = vq->qnum;
+
 	if (!vq->private_data)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		tx_poll_stop(n);
-		n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	if (qnum) {	/* TX */
+		tx_poll_stop(n, qnum);
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_DISABLED;
 	} else
-		vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+		vhost_poll_stop(&n->poll[qnum]);
 }
 
 static void vhost_net_enable_vq(struct vhost_net *n,
 				struct vhost_virtqueue *vq)
 {
 	struct socket *sock = vq->private_data;
+	int qnum = vq->qnum;
+
 	if (!sock)
 		return;
-	if (vq == n->vqs + VHOST_NET_VQ_TX) {
-		n->tx_poll_state = VHOST_NET_POLL_STOPPED;
-		tx_poll_start(n, sock);
+
+	if (qnum) {	/* TX */
+		n->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
+		tx_poll_start(n, sock, qnum);
 	} else
-		vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+		vhost_poll_start(&n->poll[qnum], sock->file);
 }
 
 static struct socket *vhost_net_stop_vq(struct vhost_net *n,
@@ -605,11 +685,12 @@ static struct socket *vhost_net_stop_vq(
 	return sock;
 }
 
-static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
-			   struct socket **rx_sock)
+static void vhost_net_stop(struct vhost_net *n)
 {
-	*tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
-	*rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		n->socks[i] = vhost_net_stop_vq(n, &n->vqs[i]);
 }
 
 static void vhost_net_flush_vq(struct vhost_net *n, int index)
@@ -620,26 +701,33 @@ static void vhost_net_flush_vq(struct vh
 
 static void vhost_net_flush(struct vhost_net *n)
 {
-	vhost_net_flush_vq(n, VHOST_NET_VQ_TX);
-	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
+	int i;
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		vhost_net_flush_vq(n, i);
 }
 
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
-	struct socket *tx_sock;
-	struct socket *rx_sock;
+	struct vhost_dev *dev = &n->dev;
+	int i;
 
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	vhost_net_stop(n);
 	vhost_net_flush(n);
-	vhost_dev_cleanup(&n->dev);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+	vhost_dev_cleanup(dev);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (n->socks[i])
+			fput(n->socks[i]->file);
+
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+
+	/* Free all old pointers */
+	vhost_free_vqs(dev);
+
 	kfree(n);
 	return 0;
 }
@@ -717,7 +805,7 @@ static long vhost_net_set_backend(struct
 	if (r)
 		goto err;
 
-	if (index >= VHOST_NET_VQ_MAX) {
+	if (index >= n->dev.nvqs) {
 		r = -ENOBUFS;
 		goto err;
 	}
@@ -762,22 +850,25 @@ err:
 
 static long vhost_net_reset_owner(struct vhost_net *n)
 {
-	struct socket *tx_sock = NULL;
-	struct socket *rx_sock = NULL;
 	long err;
+	int i;
+
 	mutex_lock(&n->dev.mutex);
 	err = vhost_dev_check_owner(&n->dev);
-	if (err)
-		goto done;
-	vhost_net_stop(n, &tx_sock, &rx_sock);
+	if (err) {
+		mutex_unlock(&n->dev.mutex);
+		return err;
+	}
+
+	vhost_net_stop(n);
 	vhost_net_flush(n);
 	err = vhost_dev_reset_owner(&n->dev);
-done:
 	mutex_unlock(&n->dev.mutex);
-	if (tx_sock)
-		fput(tx_sock->file);
-	if (rx_sock)
-		fput(rx_sock->file);
+
+	for (i = n->dev.nvqs - 1; i >= 0; i--)
+		if (n->socks[i])
+			fput(n->socks[i]->file);
+
 	return err;
 }
 
@@ -806,7 +897,7 @@ static int vhost_net_set_features(struct
 	}
 	n->dev.acked_features = features;
 	smp_wmb();
-	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+	for (i = 0; i < n->dev.nvqs; ++i) {
 		mutex_lock(&n->vqs[i].mutex);
 		n->vqs[i].vhost_hlen = vhost_hlen;
 		n->vqs[i].sock_hlen = sock_hlen;
diff -ruNp org2/drivers/vhost/vhost.c tx_only2/drivers/vhost/vhost.c
--- org2/drivers/vhost/vhost.c	2010-09-10 16:34:07.000000000 +0530
+++ tx_only2/drivers/vhost/vhost.c	2010-09-16 16:35:29.000000000 +0530
@@ -71,12 +71,12 @@ static void vhost_work_init(struct vhost
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-		     unsigned long mask, struct vhost_dev *dev)
+		     unsigned long mask, struct vhost_virtqueue *vq)
 {
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
-	poll->dev = dev;
+	poll->vq = vq;
 
 	vhost_work_init(&poll->work, fn);
 }
@@ -98,25 +98,25 @@ void vhost_poll_stop(struct vhost_poll *
 	remove_wait_queue(poll->wqh, &poll->wait);
 }
 
-static void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work)
+static void vhost_work_flush(struct vhost_poll *poll, struct vhost_work *work)
 {
 	unsigned seq;
 	int left;
 	int flushing;
 
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	seq = work->queue_seq;
 	work->flushing++;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	wait_event(work->done, ({
-		   spin_lock_irq(&dev->work_lock);
+		   spin_lock_irq(poll->vq->work_lock);
 		   left = seq - work->done_seq <= 0;
-		   spin_unlock_irq(&dev->work_lock);
+		   spin_unlock_irq(poll->vq->work_lock);
 		   left;
 	}));
-	spin_lock_irq(&dev->work_lock);
+	spin_lock_irq(poll->vq->work_lock);
 	flushing = --work->flushing;
-	spin_unlock_irq(&dev->work_lock);
+	spin_unlock_irq(poll->vq->work_lock);
 	BUG_ON(flushing < 0);
 }
 
@@ -124,26 +124,26 @@ static void vhost_work_flush(struct vhos
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	vhost_work_flush(poll->dev, &poll->work);
+	vhost_work_flush(poll, &poll->work);
 }
 
-static inline void vhost_work_queue(struct vhost_dev *dev,
+static inline void vhost_work_queue(struct vhost_virtqueue *vq,
 				    struct vhost_work *work)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&dev->work_lock, flags);
+	spin_lock_irqsave(vq->work_lock, flags);
 	if (list_empty(&work->node)) {
-		list_add_tail(&work->node, &dev->work_list);
+		list_add_tail(&work->node, vq->work_list);
 		work->queue_seq++;
-		wake_up_process(dev->worker);
+		wake_up_process(vq->worker);
 	}
-	spin_unlock_irqrestore(&dev->work_lock, flags);
+	spin_unlock_irqrestore(vq->work_lock, flags);
 }
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	vhost_work_queue(poll->dev, &poll->work);
+	vhost_work_queue(poll->vq, &poll->work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -174,7 +174,7 @@ static void vhost_vq_reset(struct vhost_
 
 static int vhost_worker(void *data)
 {
-	struct vhost_dev *dev = data;
+	struct vhost_virtqueue *vq = data;
 	struct vhost_work *work = NULL;
 	unsigned uninitialized_var(seq);
 
@@ -182,7 +182,7 @@ static int vhost_worker(void *data)
 		/* mb paired w/ kthread_stop */
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		spin_lock_irq(&dev->work_lock);
+		spin_lock_irq(vq->work_lock);
 		if (work) {
 			work->done_seq = seq;
 			if (work->flushing)
@@ -190,18 +190,18 @@ static int vhost_worker(void *data)
 		}
 
 		if (kthread_should_stop()) {
-			spin_unlock_irq(&dev->work_lock);
+			spin_unlock_irq(vq->work_lock);
 			__set_current_state(TASK_RUNNING);
 			return 0;
 		}
-		if (!list_empty(&dev->work_list)) {
-			work = list_first_entry(&dev->work_list,
+		if (!list_empty(vq->work_list)) {
+			work = list_first_entry(vq->work_list,
 						struct vhost_work, node);
 			list_del_init(&work->node);
 			seq = work->queue_seq;
 		} else
 			work = NULL;
-		spin_unlock_irq(&dev->work_lock);
+		spin_unlock_irq(vq->work_lock);
 
 		if (work) {
 			__set_current_state(TASK_RUNNING);
@@ -212,9 +212,22 @@ static int vhost_worker(void *data)
 	}
 }
 
+/*
+ * Maximum number of vhost threads to handle RX/TX. First thread handles
+ * RX, next threads handle TX[0-n], where the number of threads is bound
+ * by MAX_VHOST_THREADS.
+ */
+#define MAX_VHOST_THREADS	2
+
+int vhost_get_num_threads(int nvqs)
+{
+	return min_t(int, nvqs - 1, MAX_VHOST_THREADS);
+}
+
 long vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue *vqs, int nvqs)
 {
+	int num_threads = vhost_get_num_threads(nvqs);
 	int i;
 
 	dev->vqs = vqs;
@@ -224,17 +237,32 @@ long vhost_dev_init(struct vhost_dev *de
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
-	spin_lock_init(&dev->work_lock);
-	INIT_LIST_HEAD(&dev->work_list);
-	dev->worker = NULL;
 
 	for (i = 0; i < dev->nvqs; ++i) {
-		dev->vqs[i].dev = dev;
-		mutex_init(&dev->vqs[i].mutex);
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+
+		if (i < num_threads) {
+			spin_lock_init(&dev->work_lock[i]);
+			INIT_LIST_HEAD(&dev->work_list[i]);
+
+			vq->work_lock = &dev->work_lock[i];
+			vq->work_list = &dev->work_list[i];
+		} else {
+			int j = i % num_threads;
+
+			/* Share work with another RX/TX thread */
+			vq->work_lock = &dev->work_lock[j];
+			vq->work_list = &dev->work_list[j];
+		}
+
+		vq->worker = NULL;
+		vq->dev = dev;
+		vq->qnum = i;
+		mutex_init(&vq->mutex);
 		vhost_vq_reset(dev, dev->vqs + i);
-		if (dev->vqs[i].handle_kick)
-			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick, POLLIN, dev);
+		if (vq->handle_kick)
+			vhost_poll_init(&vq->poll, vq->handle_kick, POLLIN,
+					vq);
 	}
 
 	return 0;
@@ -260,46 +288,98 @@ static void vhost_attach_cgroups_work(st
         s->ret = cgroup_attach_task_all(s->owner, current);
 }
 
-static int vhost_attach_cgroups(struct vhost_dev *dev)
+static int vhost_attach_cgroups(struct vhost_virtqueue *vq)
 {
         struct vhost_attach_cgroups_struct attach;
         attach.owner = current;
         vhost_work_init(&attach.work, vhost_attach_cgroups_work);
-        vhost_work_queue(dev, &attach.work);
-        vhost_work_flush(dev, &attach.work);
+        vhost_work_queue(vq, &attach.work);
+        vhost_work_flush(&vq->poll, &attach.work);
         return attach.ret;
 }
 
+static void __vhost_stop_workers(struct vhost_dev *dev, int num_threads)
+{
+	int i;
+
+	for (i = 0; i < num_threads; i++) {
+		WARN_ON(!list_empty(dev->vqs[i].work_list));
+		if (dev->vqs[i].worker) {
+			kthread_stop(dev->vqs[i].worker);
+			dev->vqs[i].worker = NULL;
+		}
+	}
+}
+
+static void vhost_stop_workers(struct vhost_dev *dev)
+{
+	__vhost_stop_workers(dev, vhost_get_num_threads(dev->nvqs));
+}
+
+static int vhost_start_workers(struct vhost_dev *dev)
+{
+	int num_threads = vhost_get_num_threads(dev->nvqs);
+	int i, err;
+
+	for (i = 0; i < dev->nvqs; ++i) {
+		struct vhost_virtqueue *vq = &dev->vqs[i];
+
+		if (i < num_threads) {
+			vq->worker = kthread_create(vhost_worker, vq,
+						    "vhost-%d-%d",
+						    current->pid, i);
+			if (IS_ERR(vq->worker)) {
+				i--;	/* no thread to clean at this index */
+				err = PTR_ERR(vq->worker);
+				goto err;
+			}
+
+			wake_up_process(vq->worker);
+
+			/* avoid contributing to loadavg */
+			err = vhost_attach_cgroups(vq);
+			if (err)
+				goto err;
+		} else {
+			int j = i % num_threads;
+			struct vhost_virtqueue *share_vq = &dev->vqs[j];
+
+			vq->worker = share_vq->worker;
+		}
+	}
+	return 0;
+
+err:
+	__vhost_stop_workers(dev, i);
+	return err;
+}
+
 /* Caller should have device mutex */
-static long vhost_dev_set_owner(struct vhost_dev *dev)
+static long vhost_dev_set_owner(struct vhost_dev *dev, int numtxqs)
 {
-	struct task_struct *worker;
 	int err;
 	/* Is there an owner already? */
 	if (dev->mm) {
 		err = -EBUSY;
 		goto err_mm;
 	}
+
+	err = vhost_setup_vqs(dev, numtxqs);
+	if (err)
+		goto err_mm;
+
 	/* No owner, become one */
 	dev->mm = get_task_mm(current);
-	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker)) {
-		err = PTR_ERR(worker);
-		goto err_worker;
-	}
-
-	dev->worker = worker;
-	wake_up_process(worker);	/* avoid contributing to loadavg */
 
-	err = vhost_attach_cgroups(dev);
+	/* Start threads */
+	err =  vhost_start_workers(dev);
 	if (err)
-		goto err_cgroup;
+		goto free_vqs;
 
 	return 0;
-err_cgroup:
-	kthread_stop(worker);
-	dev->worker = NULL;
-err_worker:
+
+free_vqs:
+	vhost_free_vqs(dev);
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
@@ -358,11 +438,7 @@ void vhost_dev_cleanup(struct vhost_dev 
 		mmput(dev->mm);
 	dev->mm = NULL;
 
-	WARN_ON(!list_empty(&dev->work_list));
-	if (dev->worker) {
-		kthread_stop(dev->worker);
-		dev->worker = NULL;
-	}
+	vhost_stop_workers(dev);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -713,7 +789,7 @@ long vhost_dev_ioctl(struct vhost_dev *d
 
 	/* If you are not the owner, you can become one */
 	if (ioctl == VHOST_SET_OWNER) {
-		r = vhost_dev_set_owner(d);
+		r = vhost_dev_set_owner(d, arg);
 		goto done;
 	}
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [v2 RFC PATCH 4/4] qemu changes
  2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (2 preceding siblings ...)
  2010-09-17 10:03 ` [v2 RFC PATCH 3/4] Changes for vhost Krishna Kumar
@ 2010-09-17 10:03 ` Krishna Kumar
  2010-09-17 15:42 ` [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Sridhar Samudrala
  2010-09-19 12:44 ` Michael S. Tsirkin
  5 siblings, 0 replies; 26+ messages in thread
From: Krishna Kumar @ 2010-09-17 10:03 UTC (permalink / raw)
  To: rusty, davem, mst; +Cc: kvm, arnd, netdev, avi, anthony, Krishna Kumar

Changes in qemu to support mq TX.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 hw/vhost.c      |    8 ++-
 hw/vhost.h      |    2 
 hw/vhost_net.c  |   16 +++++--
 hw/vhost_net.h  |    2 
 hw/virtio-net.c |   97 ++++++++++++++++++++++++++++++----------------
 hw/virtio-net.h |    2 
 hw/virtio-pci.c |    2 
 net.c           |   17 ++++++++
 net.h           |    1 
 net/tap.c       |   27 ++++++++++--
 10 files changed, 129 insertions(+), 45 deletions(-)

diff -ruNp org2/hw/vhost.c tx_only.rev2/hw/vhost.c
--- org2/hw/vhost.c	2010-08-09 09:51:58.000000000 +0530
+++ tx_only.rev2/hw/vhost.c	2010-09-16 16:23:56.000000000 +0530
@@ -599,23 +599,27 @@ static void vhost_virtqueue_cleanup(stru
                               0, virtio_queue_get_desc_size(vdev, idx));
 }
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int numtxqs)
 {
     uint64_t features;
     int r;
     if (devfd >= 0) {
         hdev->control = devfd;
+        hdev->nvqs = 2;
     } else {
         hdev->control = open("/dev/vhost-net", O_RDWR);
         if (hdev->control < 0) {
             return -errno;
         }
     }
-    r = ioctl(hdev->control, VHOST_SET_OWNER, NULL);
+
+    r = ioctl(hdev->control, VHOST_SET_OWNER, numtxqs);
     if (r < 0) {
         goto fail;
     }
 
+    hdev->nvqs = numtxqs + 1;
+
     r = ioctl(hdev->control, VHOST_GET_FEATURES, &features);
     if (r < 0) {
         goto fail;
diff -ruNp org2/hw/vhost.h tx_only.rev2/hw/vhost.h
--- org2/hw/vhost.h	2010-07-01 11:42:09.000000000 +0530
+++ tx_only.rev2/hw/vhost.h	2010-09-16 16:23:56.000000000 +0530
@@ -40,7 +40,7 @@ struct vhost_dev {
     unsigned long long log_size;
 };
 
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int nvqs);
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff -ruNp org2/hw/vhost_net.c tx_only.rev2/hw/vhost_net.c
--- org2/hw/vhost_net.c	2010-08-09 09:51:58.000000000 +0530
+++ tx_only.rev2/hw/vhost_net.c	2010-09-16 16:23:56.000000000 +0530
@@ -36,7 +36,8 @@
 
 struct vhost_net {
     struct vhost_dev dev;
-    struct vhost_virtqueue vqs[2];
+    struct vhost_virtqueue *vqs;
+    int nvqs;
     int backend;
     VLANClientState *vc;
 };
@@ -76,7 +77,8 @@ static int vhost_net_get_fd(VLANClientSt
     }
 }
 
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int numtxqs)
 {
     int r;
     struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -93,10 +95,14 @@ struct vhost_net *vhost_net_init(VLANCli
         (1 << VHOST_NET_F_VIRTIO_NET_HDR);
     net->backend = r;
 
-    r = vhost_dev_init(&net->dev, devfd);
+    r = vhost_dev_init(&net->dev, devfd, numtxqs);
     if (r < 0) {
         goto fail;
     }
+
+    net->nvqs = numtxqs + 1;
+    net->vqs = qemu_malloc(net->nvqs * (sizeof *net->vqs));
+
     if (~net->dev.features & net->dev.backend_features) {
         fprintf(stderr, "vhost lacks feature mask %" PRIu64 " for backend\n",
                 (uint64_t)(~net->dev.features & net->dev.backend_features));
@@ -118,7 +124,6 @@ int vhost_net_start(struct vhost_net *ne
     struct vhost_vring_file file = { };
     int r;
 
-    net->dev.nvqs = 2;
     net->dev.vqs = net->vqs;
     r = vhost_dev_start(&net->dev, dev);
     if (r < 0) {
@@ -166,7 +171,8 @@ void vhost_net_cleanup(struct vhost_net 
     qemu_free(net);
 }
 #else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+				 int nvqs)
 {
 	return NULL;
 }
diff -ruNp org2/hw/vhost_net.h tx_only.rev2/hw/vhost_net.h
--- org2/hw/vhost_net.h	2010-07-01 11:42:09.000000000 +0530
+++ tx_only.rev2/hw/vhost_net.h	2010-09-16 16:23:56.000000000 +0530
@@ -6,7 +6,7 @@
 struct vhost_net;
 typedef struct vhost_net VHostNetState;
 
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, int nvqs);
 
 int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
 void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
diff -ruNp org2/hw/virtio-net.c tx_only.rev2/hw/virtio-net.c
--- org2/hw/virtio-net.c	2010-07-19 12:41:28.000000000 +0530
+++ tx_only.rev2/hw/virtio-net.c	2010-09-16 16:23:56.000000000 +0530
@@ -32,17 +32,17 @@ typedef struct VirtIONet
     uint8_t mac[ETH_ALEN];
     uint16_t status;
     VirtQueue *rx_vq;
-    VirtQueue *tx_vq;
+    VirtQueue **tx_vq;
     VirtQueue *ctrl_vq;
     NICState *nic;
-    QEMUTimer *tx_timer;
-    int tx_timer_active;
+    QEMUTimer **tx_timer;
+    int *tx_timer_active;
     uint32_t has_vnet_hdr;
     uint8_t has_ufo;
     struct {
         VirtQueueElement elem;
         ssize_t len;
-    } async_tx;
+    } *async_tx;
     int mergeable_rx_bufs;
     uint8_t promisc;
     uint8_t allmulti;
@@ -61,6 +61,7 @@ typedef struct VirtIONet
     } mac_table;
     uint32_t *vlans;
     DeviceState *qdev;
+    uint16_t numtxqs;
 } VirtIONet;
 
 /* TODO
@@ -78,6 +79,7 @@ static void virtio_net_get_config(VirtIO
     struct virtio_net_config netcfg;
 
     netcfg.status = n->status;
+    netcfg.numtxqs = n->numtxqs;
     memcpy(netcfg.mac, n->mac, ETH_ALEN);
     memcpy(config, &netcfg, sizeof(netcfg));
 }
@@ -162,6 +164,8 @@ static uint32_t virtio_net_get_features(
     VirtIONet *n = to_virtio_net(vdev);
 
     features |= (1 << VIRTIO_NET_F_MAC);
+    if (n->numtxqs > 1)
+        features |= (1 << VIRTIO_NET_F_NUMTXQS);
 
     if (peer_has_vnet_hdr(n)) {
         tap_using_vnet_hdr(n->nic->nc.peer, 1);
@@ -625,13 +629,16 @@ static void virtio_net_tx_complete(VLANC
 {
     VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
 
-    virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
-    virtio_notify(&n->vdev, n->tx_vq);
+    /*
+     * If this function executes, we are single TX and hence use only txq[0]
+     */
+    virtqueue_push(n->tx_vq[0], &n->async_tx[0].elem, n->async_tx[0].len);
+    virtio_notify(&n->vdev, n->tx_vq[0]);
 
-    n->async_tx.elem.out_num = n->async_tx.len = 0;
+    n->async_tx[0].elem.out_num = n->async_tx[0].len = 0;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 /* TX */
@@ -642,8 +649,8 @@ static void virtio_net_flush_tx(VirtIONe
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    if (n->async_tx.elem.out_num) {
-        virtio_queue_set_notification(n->tx_vq, 0);
+    if (n->async_tx[0].elem.out_num) {
+        virtio_queue_set_notification(n->tx_vq[0], 0);
         return;
     }
 
@@ -678,9 +685,9 @@ static void virtio_net_flush_tx(VirtIONe
         ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
                                       virtio_net_tx_complete);
         if (ret == 0) {
-            virtio_queue_set_notification(n->tx_vq, 0);
-            n->async_tx.elem = elem;
-            n->async_tx.len  = len;
+            virtio_queue_set_notification(n->tx_vq[0], 0);
+            n->async_tx[0].elem = elem;
+            n->async_tx[0].len  = len;
             return;
         }
 
@@ -695,15 +702,15 @@ static void virtio_net_handle_tx(VirtIOD
 {
     VirtIONet *n = to_virtio_net(vdev);
 
-    if (n->tx_timer_active) {
+    if (n->tx_timer_active[0]) {
         virtio_queue_set_notification(vq, 1);
-        qemu_del_timer(n->tx_timer);
-        n->tx_timer_active = 0;
+        qemu_del_timer(n->tx_timer[0]);
+        n->tx_timer_active[0] = 0;
         virtio_net_flush_tx(n, vq);
     } else {
-        qemu_mod_timer(n->tx_timer,
+        qemu_mod_timer(n->tx_timer[0],
                        qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
-        n->tx_timer_active = 1;
+        n->tx_timer_active[0] = 1;
         virtio_queue_set_notification(vq, 0);
     }
 }
@@ -712,18 +719,19 @@ static void virtio_net_tx_timer(void *op
 {
     VirtIONet *n = opaque;
 
-    n->tx_timer_active = 0;
+    n->tx_timer_active[0] = 0;
 
     /* Just in case the driver is not ready on more */
     if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
         return;
 
-    virtio_queue_set_notification(n->tx_vq, 1);
-    virtio_net_flush_tx(n, n->tx_vq);
+    virtio_queue_set_notification(n->tx_vq[0], 1);
+    virtio_net_flush_tx(n, n->tx_vq[0]);
 }
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
 {
+    int i;
     VirtIONet *n = opaque;
 
     if (n->vhost_started) {
@@ -735,7 +743,9 @@ static void virtio_net_save(QEMUFile *f,
     virtio_save(&n->vdev, f);
 
     qemu_put_buffer(f, n->mac, ETH_ALEN);
-    qemu_put_be32(f, n->tx_timer_active);
+    qemu_put_be16(f, n->numtxqs);
+    for (i = 0; i < n->numtxqs; i++)
+        qemu_put_be32(f, n->tx_timer_active[i]);
     qemu_put_be32(f, n->mergeable_rx_bufs);
     qemu_put_be16(f, n->status);
     qemu_put_byte(f, n->promisc);
@@ -764,7 +774,9 @@ static int virtio_net_load(QEMUFile *f, 
     virtio_load(&n->vdev, f);
 
     qemu_get_buffer(f, n->mac, ETH_ALEN);
-    n->tx_timer_active = qemu_get_be32(f);
+    n->numtxqs = qemu_get_be16(f);
+    for (i = 0; i < n->numtxqs; i++)
+        n->tx_timer_active[i] = qemu_get_be32(f);
     n->mergeable_rx_bufs = qemu_get_be32(f);
 
     if (version_id >= 3)
@@ -840,9 +852,10 @@ static int virtio_net_load(QEMUFile *f, 
     }
     n->mac_table.first_multi = i;
 
-    if (n->tx_timer_active) {
-        qemu_mod_timer(n->tx_timer,
-                       qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
+    for (i = 0; i < n->numtxqs; i++) {
+        if (n->tx_timer_active[i])
+            qemu_mod_timer(n->tx_timer[i],
+                           qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
     }
     return 0;
 }
@@ -905,12 +918,15 @@ static void virtio_net_vmstate_change(vo
 
 VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
 {
+    int i;
     VirtIONet *n;
 
     n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
                                         sizeof(struct virtio_net_config),
                                         sizeof(VirtIONet));
 
+    n->numtxqs = conf->peer->numtxqs;
+
     n->vdev.get_config = virtio_net_get_config;
     n->vdev.set_config = virtio_net_set_config;
     n->vdev.get_features = virtio_net_get_features;
@@ -918,8 +934,24 @@ VirtIODevice *virtio_net_init(DeviceStat
     n->vdev.bad_features = virtio_net_bad_features;
     n->vdev.reset = virtio_net_reset;
     n->vdev.set_status = virtio_net_set_status;
+
     n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
-    n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+
+    n->tx_vq = qemu_mallocz(n->numtxqs * sizeof(*n->tx_vq));
+    n->tx_timer = qemu_mallocz(n->numtxqs * sizeof(*n->tx_timer));
+    n->tx_timer_active = qemu_mallocz(n->numtxqs * sizeof(*n->tx_timer_active));
+    n->async_tx = qemu_mallocz(n->numtxqs * sizeof(*n->async_tx));
+
+    /* Allocate per tx vq's */
+    for (i = 0; i < n->numtxqs; i++) {
+        n->tx_vq[i] = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+
+        /* setup timer per tx vq */
+        n->tx_timer[i] = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
+        n->tx_timer_active[i] = 0;
+    }
+
+    /* Allocate control vq */
     n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
     qemu_macaddr_default_if_unset(&conf->macaddr);
     memcpy(&n->mac[0], &conf->macaddr, sizeof(n->mac));
@@ -929,8 +961,6 @@ VirtIODevice *virtio_net_init(DeviceStat
 
     qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a);
 
-    n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
-    n->tx_timer_active = 0;
     n->mergeable_rx_bufs = 0;
     n->promisc = 1; /* for compatibility */
 
@@ -948,6 +978,7 @@ VirtIODevice *virtio_net_init(DeviceStat
 
 void virtio_net_exit(VirtIODevice *vdev)
 {
+    int i;
     VirtIONet *n = DO_UPCAST(VirtIONet, vdev, vdev);
     qemu_del_vm_change_state_handler(n->vmstate);
 
@@ -962,8 +993,10 @@ void virtio_net_exit(VirtIODevice *vdev)
     qemu_free(n->mac_table.macs);
     qemu_free(n->vlans);
 
-    qemu_del_timer(n->tx_timer);
-    qemu_free_timer(n->tx_timer);
+    for (i = 0; i < n->numtxqs; i++) {
+        qemu_del_timer(n->tx_timer[i]);
+        qemu_free_timer(n->tx_timer[i]);
+    }
 
     virtio_cleanup(&n->vdev);
     qemu_del_vlan_client(&n->nic->nc);
diff -ruNp org2/hw/virtio-net.h tx_only.rev2/hw/virtio-net.h
--- org2/hw/virtio-net.h	2010-07-01 11:42:09.000000000 +0530
+++ tx_only.rev2/hw/virtio-net.h	2010-09-16 16:23:56.000000000 +0530
@@ -44,6 +44,7 @@
 #define VIRTIO_NET_F_CTRL_RX    18      /* Control channel RX mode support */
 #define VIRTIO_NET_F_CTRL_VLAN  19      /* Control channel VLAN filtering */
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20   /* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS    21      /* Supports multiple TX queues */
 
 #define VIRTIO_NET_S_LINK_UP    1       /* Link is up */
 
@@ -58,6 +59,7 @@ struct virtio_net_config
     uint8_t mac[ETH_ALEN];
     /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
     uint16_t status;
+    uint16_t numtxqs;	/* number of transmit queues */
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
diff -ruNp org2/hw/virtio-pci.c tx_only.rev2/hw/virtio-pci.c
--- org2/hw/virtio-pci.c	2010-09-08 12:46:36.000000000 +0530
+++ tx_only.rev2/hw/virtio-pci.c	2010-09-16 16:23:56.000000000 +0530
@@ -99,6 +99,7 @@ typedef struct {
     uint32_t addr;
     uint32_t class_code;
     uint32_t nvectors;
+    uint32_t mq;
     BlockConf block;
     NICConf nic;
     uint32_t host_features;
@@ -722,6 +723,7 @@ static PCIDeviceInfo virtio_info[] = {
         .romfile    = "pxe-virtio.bin",
         .qdev.props = (Property[]) {
             DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
+	    DEFINE_PROP_UINT32("mq", VirtIOPCIProxy, mq, 1),
             DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
             DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
             DEFINE_PROP_END_OF_LIST(),
diff -ruNp org2/net/tap.c tx_only.rev2/net/tap.c
--- org2/net/tap.c	2010-07-01 11:42:09.000000000 +0530
+++ tx_only.rev2/net/tap.c	2010-09-16 16:23:56.000000000 +0530
@@ -299,13 +299,14 @@ static NetClientInfo net_tap_info = {
 static TAPState *net_tap_fd_init(VLANState *vlan,
                                  const char *model,
                                  const char *name,
-                                 int fd,
+                                 int fd, int numtxqs,
                                  int vnet_hdr)
 {
     VLANClientState *nc;
     TAPState *s;
 
     nc = qemu_new_net_client(&net_tap_info, vlan, NULL, model, name);
+    nc->numtxqs = numtxqs;
 
     s = DO_UPCAST(TAPState, nc, nc);
 
@@ -403,6 +404,24 @@ int net_init_tap(QemuOpts *opts, Monitor
 {
     TAPState *s;
     int fd, vnet_hdr = 0;
+    int vhost;
+    int numtxqs = 1;
+
+    vhost = qemu_opt_get_bool(opts, "vhost", 0);
+
+    /*
+     * We support multiple tx queues if:
+     *      1. smp > 1
+     *      2. vhost=on
+     *      3. mq=on
+     * In this case, #txqueues = #cpus. This value can be changed by
+     * using the "numtxqs" option.
+     */
+    if (vhost && smp_cpus > 1) {
+        if (qemu_opt_get_bool(opts, "mq", 0)) {
+            numtxqs = qemu_opt_get_number(opts, "numtxqs", smp_cpus);
+        }
+    }
 
     if (qemu_opt_get(opts, "fd")) {
         if (qemu_opt_get(opts, "ifname") ||
@@ -436,7 +455,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    s = net_tap_fd_init(vlan, "tap", name, fd, vnet_hdr);
+    s = net_tap_fd_init(vlan, "tap", name, fd, numtxqs, vnet_hdr);
     if (!s) {
         close(fd);
         return -1;
@@ -465,7 +484,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         }
     }
 
-    if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
+    if (vhost) {
         int vhostfd, r;
         if (qemu_opt_get(opts, "vhostfd")) {
             r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
@@ -476,7 +495,7 @@ int net_init_tap(QemuOpts *opts, Monitor
         } else {
             vhostfd = -1;
         }
-        s->vhost_net = vhost_net_init(&s->nc, vhostfd);
+        s->vhost_net = vhost_net_init(&s->nc, vhostfd, numtxqs);
         if (!s->vhost_net) {
             error_report("vhost-net requested but could not be initialized");
             return -1;
diff -ruNp org2/net.c tx_only.rev2/net.c
--- org2/net.c	2010-09-08 12:46:36.000000000 +0530
+++ tx_only.rev2/net.c	2010-09-16 16:23:56.000000000 +0530
@@ -814,6 +814,15 @@ static int net_init_nic(QemuOpts *opts,
         return -1;
     }
 
+    if (nd->netdev->numtxqs > 1 && nd->nvectors == DEV_NVECTORS_UNSPECIFIED) {
+        /*
+         * User specified mq for guest, but no "vectors=", tune
+         * it automatically to 'numtxqs' TX + 1 RX + 1 controlq.
+         */
+        nd->nvectors = nd->netdev->numtxqs + 1 + 1;
+        monitor_printf(mon, "nvectors tuned to %d\n", nd->nvectors);
+    }
+
     nd->used = 1;
     nb_nics++;
 
@@ -957,6 +966,14 @@ static const struct {
             },
 #ifndef _WIN32
             {
+                .name = "mq",
+                .type = QEMU_OPT_BOOL,
+                .help = "enable multiqueue on network i/f",
+            }, {
+                .name = "numtxqs",
+                .type = QEMU_OPT_NUMBER,
+                .help = "optional number of TX queues, if mq is enabled",
+            }, {
                 .name = "fd",
                 .type = QEMU_OPT_STRING,
                 .help = "file descriptor of an already opened tap",
diff -ruNp org2/net.h tx_only.rev2/net.h
--- org2/net.h	2010-07-01 11:42:09.000000000 +0530
+++ tx_only.rev2/net.h	2010-09-16 16:23:56.000000000 +0530
@@ -62,6 +62,7 @@ struct VLANClientState {
     struct VLANState *vlan;
     VLANClientState *peer;
     NetQueue *send_queue;
+    int numtxqs;
     char *model;
     char *name;
     char info_str[256];

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 2/4] Changes for virtio-net
  2010-09-17 10:03 ` [v2 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
@ 2010-09-17 10:25   ` Eric Dumazet
  2010-09-17 12:27     ` Krishna Kumar2
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-09-17 10:25 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, davem, mst, kvm, arnd, netdev, avi, anthony

Le vendredi 17 septembre 2010 à 15:33 +0530, Krishna Kumar a écrit :
> Implement mq virtio-net driver. 
> 
> Though struct virtio_net_config changes, it works with old
> qemu's since the last element is not accessed, unless qemu
> sets VIRTIO_NET_F_NUMTXQS.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
>  drivers/net/virtio_net.c   |  213 ++++++++++++++++++++++++++---------
>  include/linux/virtio_net.h |    3 
>  2 files changed, 163 insertions(+), 53 deletions(-)
> 
> diff -ruNp org2/include/linux/virtio_net.h tx_only2/include/linux/virtio_net.h
> --- org2/include/linux/virtio_net.h	2010-02-10 13:20:27.000000000 +0530
> +++ tx_only2/include/linux/virtio_net.h	2010-09-16 15:24:01.000000000 +0530
> @@ -26,6 +26,7 @@
>  #define VIRTIO_NET_F_CTRL_RX	18	/* Control channel RX mode support */
>  #define VIRTIO_NET_F_CTRL_VLAN	19	/* Control channel VLAN filtering */
>  #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
> +#define VIRTIO_NET_F_NUMTXQS	21	/* Device supports multiple TX queue */
>  
>  #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
>  
> @@ -34,6 +35,8 @@ struct virtio_net_config {
>  	__u8 mac[6];
>  	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
>  	__u16 status;
> +	/* number of transmit queues */
> +	__u16 numtxqs;
>  } __attribute__((packed));
>  
>  /* This is the first element of the scatter-gather list.  If you don't
> diff -ruNp org2/drivers/net/virtio_net.c tx_only2/drivers/net/virtio_net.c
> --- org2/drivers/net/virtio_net.c	2010-07-08 12:54:32.000000000 +0530
> +++ tx_only2/drivers/net/virtio_net.c	2010-09-16 15:24:01.000000000 +0530
> @@ -40,9 +40,20 @@ module_param(gso, bool, 0444);
>  
>  #define VIRTNET_SEND_COMMAND_SG_MAX    2
>  
> +/* Our representation of a send virtqueue */
> +struct send_queue {
> +	struct virtqueue *svq;
> +
> +	/* TX: fragments + linear part + virtio header */
> +	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> +};

You probably want ____cacheline_aligned_in_smp


> +
>  struct virtnet_info {
>  	struct virtio_device *vdev;
> -	struct virtqueue *rvq, *svq, *cvq;
> +	int numtxqs;			/* Number of tx queues */
> +	struct send_queue *sq;
> +	struct virtqueue *rvq;
> +	struct virtqueue *cvq;
>  	struct net_device *dev;

struct napi will probably be dirtied by RX processing

You should make sure it doesnt dirty cache line of above (read mostly)
fields


>  	struct napi_struct napi;
>  	unsigned int status;
> @@ -62,9 +73,8 @@ struct virtnet_info {
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> -	/* fragments + linear part + virtio header */
> +	/* RX: fragments + linear part + virtio header */
>  	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
> -	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>  };
>  
>  struct skb_vnet_hdr {
> @@ -120,12 +130,13 @@ static struct page *get_a_page(struct vi
>  static void skb_xmit_done(struct virtqueue *svq)
>  {
>  	struct virtnet_info *vi = svq->vdev->priv;
> +	int qnum = svq->queue_index - 1;	/* 0 is RX vq */
>  
>  	/* Suppress further interrupts. */
>  	virtqueue_disable_cb(svq);
>  
>  	/* We were probably waiting for more output buffers. */
> -	netif_wake_queue(vi->dev);
> +	netif_wake_subqueue(vi->dev, qnum);
>  }
>  
>  static void set_skb_frag(struct sk_buff *skb, struct page *page,
> @@ -495,12 +506,13 @@ again:
>  	return received;
>  }
>  
> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
> +				       struct virtqueue *svq)
>  {
>  	struct sk_buff *skb;
>  	unsigned int len, tot_sgs = 0;
>  
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
> @@ -510,7 +522,8 @@ static unsigned int free_old_xmit_skbs(s
>  	return tot_sgs;
>  }
>  
> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
> +		    struct virtqueue *svq, struct scatterlist *tx_sg)
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> @@ -548,12 +561,12 @@ static int xmit_skb(struct virtnet_info 
>  
>  	/* Encode metadata header at front. */
>  	if (vi->mergeable_rx_bufs)
> -		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
> +		sg_set_buf(tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
>  	else
> -		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
> +		sg_set_buf(tx_sg, &hdr->hdr, sizeof hdr->hdr);
>  
> -	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
> +	hdr->num_sg = skb_to_sgvec(skb, tx_sg + 1, 0, skb->len) + 1;
> +	return virtqueue_add_buf(svq, tx_sg, hdr->num_sg,
>  					0, skb);
>  }
>  
> @@ -561,31 +574,34 @@ static netdev_tx_t start_xmit(struct sk_
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
> +	int qnum = skb_get_queue_mapping(skb);
> +	struct virtqueue *svq = vi->sq[qnum].svq;
>  
>  	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> +	free_old_xmit_skbs(vi, svq);
>  
>  	/* Try to transmit */
> -	capacity = xmit_skb(vi, skb);
> +	capacity = xmit_skb(vi, skb, svq, vi->sq[qnum].tx_sg);
>  
>  	/* This can happen with OOM and indirect buffers. */
>  	if (unlikely(capacity < 0)) {
>  		if (net_ratelimit()) {
>  			if (likely(capacity == -ENOMEM)) {
>  				dev_warn(&dev->dev,
> -					 "TX queue failure: out of memory\n");
> +					 "TXQ (%d) failure: out of memory\n",
> +					 qnum);
>  			} else {
>  				dev->stats.tx_fifo_errors++;
>  				dev_warn(&dev->dev,
> -					 "Unexpected TX queue failure: %d\n",
> -					 capacity);
> +					 "Unexpected TXQ (%d) failure: %d\n",
> +					 qnum, capacity);
>  			}
>  		}
>  		dev->stats.tx_dropped++;
>  		kfree_skb(skb);
>  		return NETDEV_TX_OK;
>  	}
> -	virtqueue_kick(vi->svq);
> +	virtqueue_kick(svq);
>  
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
> @@ -594,13 +610,13 @@ static netdev_tx_t start_xmit(struct sk_
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!virtqueue_enable_cb(vi->svq))) {
> +		netif_stop_subqueue(dev, qnum);
> +		if (unlikely(!virtqueue_enable_cb(svq))) {
>  			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> +			capacity += free_old_xmit_skbs(vi, svq);
>  			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				virtqueue_disable_cb(vi->svq);
> +				netif_start_subqueue(dev, qnum);
> +				virtqueue_disable_cb(svq);
>  			}
>  		}
>  	}
> @@ -871,10 +887,10 @@ static void virtnet_update_status(struct
>  
>  	if (vi->status & VIRTIO_NET_S_LINK_UP) {
>  		netif_carrier_on(vi->dev);
> -		netif_wake_queue(vi->dev);
> +		netif_tx_wake_all_queues(vi->dev);
>  	} else {
>  		netif_carrier_off(vi->dev);
> -		netif_stop_queue(vi->dev);
> +		netif_tx_stop_all_queues(vi->dev);
>  	}
>  }
>  
> @@ -885,18 +901,112 @@ static void virtnet_config_changed(struc
>  	virtnet_update_status(vi);
>  }
>  
> +#define MAX_DEVICE_NAME		16
> +static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
> +{
> +	vq_callback_t **callbacks;
> +	struct virtqueue **vqs;
> +	int i, err = -ENOMEM;
> +	int totalvqs;
> +	char **names;
> +
> +	/* Allocate send queues */

no check on numtxqs ? Hmm...

Please then use kcalloc(numtxqs, sizeof(*vi->sq), GFP_KERNEL) so that
some check is done for you ;)

> +	vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
> +	if (!vi->sq)
> +		goto out;
> +
> +	/* setup initial send queue parameters */
> +	for (i = 0; i < numtxqs; i++)
> +		sg_init_table(vi->sq[i].tx_sg, ARRAY_SIZE(vi->sq[i].tx_sg));
> +
> +	/*
> +	 * We expect 1 RX virtqueue followed by 'numtxqs' TX virtqueues, and
> +	 * optionally one control virtqueue.
> +	 */
> +	totalvqs = 1 + numtxqs +
> +		   virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
> +
> +	/* Setup parameters for find_vqs */
> +	vqs = kmalloc(totalvqs * sizeof(*vqs), GFP_KERNEL);
> +	callbacks = kmalloc(totalvqs * sizeof(*callbacks), GFP_KERNEL);
> +	names = kzalloc(totalvqs * sizeof(*names), GFP_KERNEL);
> +	if (!vqs || !callbacks || !names)
> +		goto free_mem;
> +


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 2/4] Changes for virtio-net
  2010-09-17 10:25   ` Eric Dumazet
@ 2010-09-17 12:27     ` Krishna Kumar2
  2010-09-17 13:20       ` Krishna Kumar2
  0 siblings, 1 reply; 26+ messages in thread
From: Krishna Kumar2 @ 2010-09-17 12:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: anthony, arnd, avi, davem, kvm, mst, netdev, rusty

Eric Dumazet <eric.dumazet@gmail.com> wrote on 09/17/2010 03:55:54 PM:

> > +/* Our representation of a send virtqueue */
> > +struct send_queue {
> > +   struct virtqueue *svq;
> > +
> > +   /* TX: fragments + linear part + virtio header */
> > +   struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> > +};
>
> You probably want ____cacheline_aligned_in_smp

I had tried this and mentioned this in Patch 0/4:
"2. Cache-align data structures: I didn't see any BW/SD improvement
   after making the sq's (and similarly for vhost) cache-aligned
   statically:
        struct virtnet_info {
                ...
                struct send_queue sq[16] ____cacheline_aligned_in_smp;
                ...
        };
"

I am not sure why this made no difference?

> > +
> >  struct virtnet_info {
> >     struct virtio_device *vdev;
> > -   struct virtqueue *rvq, *svq, *cvq;
> > +   int numtxqs;         /* Number of tx queues */
> > +   struct send_queue *sq;
> > +   struct virtqueue *rvq;
> > +   struct virtqueue *cvq;
> >     struct net_device *dev;
>
> struct napi will probably be dirtied by RX processing
>
> You should make sure it doesnt dirty cache line of above (read mostly)
> fields

I am changing the layout of napi wrt other pointers in
this patch, though the to-be-submitted RX patch does that.
Should I do something for this TX-only patch?

> > +#define MAX_DEVICE_NAME      16
> > +static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
> > +{
> > +   vq_callback_t **callbacks;
> > +   struct virtqueue **vqs;
> > +   int i, err = -ENOMEM;
> > +   int totalvqs;
> > +   char **names;
> > +
> > +   /* Allocate send queues */
>
> no check on numtxqs ? Hmm...
>
> Please then use kcalloc(numtxqs, sizeof(*vi->sq), GFP_KERNEL) so that
> some check is done for you ;)

Right! I need to re-introduce some limit. Rusty, should I simply
add a check for a constant (like 256) here?

Thanks for your review, Eric!

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 2/4] Changes for virtio-net
  2010-09-17 12:27     ` Krishna Kumar2
@ 2010-09-17 13:20       ` Krishna Kumar2
  0 siblings, 0 replies; 26+ messages in thread
From: Krishna Kumar2 @ 2010-09-17 13:20 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, Eric Dumazet, kvm, mst, netdev, rusty

> Krishna Kumar2/India/IBM@IBMIN
> Sent by: netdev-owner@vger.kernel.org

> > > +
> > >  struct virtnet_info {
> > >     struct virtio_device *vdev;
> > > -   struct virtqueue *rvq, *svq, *cvq;
> > > +   int numtxqs;         /* Number of tx queues */
> > > +   struct send_queue *sq;
> > > +   struct virtqueue *rvq;
> > > +   struct virtqueue *cvq;
> > >     struct net_device *dev;
> >
> > struct napi will probably be dirtied by RX processing
> >
> > You should make sure it doesnt dirty cache line of above (read mostly)
> > fields
>
> I am changing the layout of napi wrt other pointers in
> this patch, though the to-be-submitted RX patch does that.
> Should I do something for this TX-only patch?

Sorry, I think my sentence is not clear! I will make this
change (and also cache-line align the send queues), test
and let you know the result.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (3 preceding siblings ...)
  2010-09-17 10:03 ` [v2 RFC PATCH 4/4] qemu changes Krishna Kumar
@ 2010-09-17 15:42 ` Sridhar Samudrala
  2010-09-19 12:44 ` Michael S. Tsirkin
  5 siblings, 0 replies; 26+ messages in thread
From: Sridhar Samudrala @ 2010-09-17 15:42 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, davem, mst, kvm, arnd, netdev, avi, anthony

On Fri, 2010-09-17 at 15:33 +0530, Krishna Kumar wrote:
> Following patches implement transmit MQ in virtio-net.  Also
> included is the user qemu changes. MQ is disabled by default
> unless qemu specifies it.
> 
> 1. This feature was first implemented with a single vhost.
>    Testing showed 3-8% performance gain for upto 8 netperf
>    sessions (and sometimes 16), but BW dropped with more
>    sessions.  However, adding more vhosts improved BW
>    significantly all the way to 128 sessions. Multiple
>    vhost is implemented in-kernel by passing an argument
>    to SET_OWNER (retaining backward compatibility). The
>    vhost patch adds 173 source lines (incl comments).
> 2. BW -> CPU/SD equation: Average TCP performance increased
>    23% compared to almost 70% for earlier patch (with
>    unrestricted #vhosts).  SD improved -4.2% while it had
>    increased 55% for the earlier patch.  Increasing #vhosts
>    has it's pros and cons, but this patch lays emphasis on
>    reducing CPU utilization.  Another option could be a
>    tunable to select number of vhosts threads.
> 3. Interoperability: Many combinations, but not all, of qemu,
>    host, guest tested together.  Tested with multiple i/f's
>    on guest, with both mq=on/off, vhost=on/off, etc.
> 
>                   Changes from rev1:
>                   ------------------
> 1. Move queue_index from virtio_pci_vq_info to virtqueue,
>    and resulting changes to existing code and to the patch.
> 2. virtio-net probe uses virtio_config_val.
> 3. Remove constants: VIRTIO_MAX_TXQS, MAX_VQS, all arrays
>    allocated on stack, etc.
> 4. Restrict number of vhost threads to 2 - I get much better
>    cpu/sd results (without any tuning) with low number of vhost
>    threads.  Higher vhosts gives better average BW performance
>    (from average of 45%), but SD increases significantly (90%).
> 5. Working of vhost threads changes, eg for numtxqs=4:
>        vhost-0: handles RX
>        vhost-1: handles TX[0]
>        vhost-0: handles TX[1]
>        vhost-1: handles TX[2]
>        vhost-0: handles TX[3]

This doesn't look symmetrical.
TCP flows that go via TX(1,3) use the same vhost thread for RX packets,
whereas flows via TX(0,2) use a different vhost thread.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
                   ` (4 preceding siblings ...)
  2010-09-17 15:42 ` [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Sridhar Samudrala
@ 2010-09-19 12:44 ` Michael S. Tsirkin
  2010-10-05 10:40   ` Krishna Kumar2
  5 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2010-09-19 12:44 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, davem, kvm, arnd, netdev, avi, anthony

On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> for degradation for 1 stream case:

Could you document how exactly do you measure multistream bandwidth:
netperf flags, etc?

>     1. Without any tuning, BW falls -6.5%.

Any idea where does this come from?
Do you see more TX interrupts? RX interrupts? Exits?
Do interrupts bounce more between guest CPUs?


>     2. When vhosts on server were bound to CPU0, BW was as good
>        as with original code.
>     3. When new code was started with numtxqs=1 (or mq=off, which
>        is the default), there was no degradation.
> 
>                        Next steps:
>                        -----------
> 1. MQ RX patch is also complete - plan to submit once TX is OK (as
>    well as after identifying bandwidth degradations for some test
>    cases).
> 2. Cache-align data structures: I didn't see any BW/SD improvement
>    after making the sq's (and similarly for vhost) cache-aligned
>    statically:
>         struct virtnet_info {
>                 ...
>                 struct send_queue sq[16] ____cacheline_aligned_in_smp;
>                 ...
>         };
> 3. Migration is not tested.

4. Identify reasons for single netperf BW regression.

5. Test perf in more scenarious:
   small packets
   host -> guest
   guest <-> external
   in last case:
	 find some other way to measure host CPU utilization,
	 try multiqueue and single queue devices

6. Use above to figure out what is a sane default for numtxqs.

> 
> Review/feedback appreciated.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-09-19 12:44 ` Michael S. Tsirkin
@ 2010-10-05 10:40   ` Krishna Kumar2
  2010-10-05 18:23     ` Michael S. Tsirkin
  2010-10-06 12:19     ` Arnd Bergmann
  0 siblings, 2 replies; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-05 10:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/19/2010 06:14:43 PM:

> Could you document how exactly do you measure multistream bandwidth:
> netperf flags, etc?

All results were without any netperf flags or system tuning:
    for i in $list
    do
        netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i &
    done
    wait
Another script processes the result files.  It also displays the
start time/end time of each iteration to make sure skew due to
parallel netperfs is minimal.

I changed the vhost functionality once more to try to get the
best model, the new model being:
1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX.
2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles
   TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
   queues are handled by vhost threads in round-robin fashion.

Results from here on are with these changes, and only "tuning" is
to set each vhost's affinity to CPUs[0-3] ("taskset -p f <vhost-pids>").

> Any idea where does this come from?
> Do you see more TX interrupts? RX interrupts? Exits?
> Do interrupts bounce more between guest CPUs?
> 4. Identify reasons for single netperf BW regression.

After testing various combinations of #txqs, #vhosts, #netperf
sessions, I think the drop for 1 stream is due to TX and RX for
a flow being processed on different cpus.  I did two more tests:
    1. Pin vhosts to same CPU:
        - BW drop is much lower for 1 stream case (- 5 to -8% range)
        - But performance is not so high for more sessions.
    2. Changed vhost to be single threaded:
          - No degradation for 1 session, and improvement for upto
	      8, sometimes 16 streams (5-12%).
          - BW degrades after that, all the way till 128 netperf sessions.
          - But overall CPU utilization improves.
            Summary of the entire run (for 1-128 sessions):
                txq=4:  BW: (-2.3)      CPU: (-16.5)    RCPU: (-5.3)
                txq=16: BW: (-1.9)      CPU: (-24.9)    RCPU: (-9.6)

I don't see any reasons mentioned above.  However, for higher
number of netperf sessions, I see a big increase in retransmissions:
_______________________________________
#netperf      ORG           NEW
            BW (#retr)    BW (#retr)
_______________________________________
1          70244 (0)     64102 (0)
4          21421 (0)     36570 (416)
8          21746 (0)     38604 (148)
16         21783 (0)     40632 (464)
32         22677 (0)     37163 (1053)
64         23648 (4)     36449 (2197)
128        23251 (2)     31676 (3185)
_______________________________________

Single netperf case didn't have any retransmissions so that is not
the cause for drop.  I tested ixgbe (MQ):
___________________________________________________________
#netperf      ixgbe             ixgbe (pin intrs to cpu#0 on
                                       both server/client)
            BW (#retr)          BW (#retr)
___________________________________________________________
1           3567 (117)          6000 (251)
2           4406 (477)          6298 (725)
4           6119 (1085)         7208 (3387)
8           6595 (4276)         7381 (15296)
16          6651 (11651)        6856 (30394)
___________________________________________________________

> 5. Test perf in more scenarious:
>    small packets

512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
but increases with #sessions:
_______________________________________________________________________________
#       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
_______________________________________________________________________________
1       4043    3800 (-6.0)     50      50 (0)          86      98 (13.9)
2       8358    7485 (-10.4)    153     178 (16.3)      230     264 (14.7)
4       20664   13567 (-34.3)   448     490 (9.3)       530     624 (17.7)
8       25198   17590 (-30.1)   967     1021 (5.5)      1085    1257 (15.8)
16      23791   24057 (1.1)     1904    2220 (16.5)     2156    2578 (19.5)
24      23055   26378 (14.4)    2807    3378 (20.3)     3225    3901 (20.9)
32      22873   27116 (18.5)    3748    4525 (20.7)     4307    5239 (21.6)
40      22876   29106 (27.2)    4705    5717 (21.5)     5388    6591 (22.3)
48      23099   31352 (35.7)    5642    6986 (23.8)     6475    8085 (24.8)
64      22645   30563 (34.9)    7527    9027 (19.9)     8619    10656 (23.6)
80      22497   31922 (41.8)    9375    11390 (21.4)    10736   13485 (25.6)
96      22509   32718 (45.3)    11271   13710 (21.6)    12927   16269 (25.8)
128     22255   32397 (45.5)    15036   18093 (20.3)    17144   21608 (26.0)
_______________________________________________________________________________
SUM:    BW: (16.7)      CPU: (20.6)     RCPU: (24.3)
_______________________________________________________________________________

> host -> guest
_______________________________________________________________________________
#       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
_______________________________________________________________________________
*1      70706   90398 (27.8)    300     327 (9.0)       140     175 (25.0)
2       20951   21937 (4.7)     188     196 (4.2)       93      103 (10.7)
4       19952   25281 (26.7)    397     496 (24.9)      210     304 (44.7)
8       18559   24992 (34.6)    802     1010 (25.9)     439     659 (50.1)
16      18882   25608 (35.6)    1642    2082 (26.7)     953     1454 (52.5)
24      19012   26955 (41.7)    2465    3153 (27.9)     1452    2254 (55.2)
32      19846   26894 (35.5)    3278    4238 (29.2)     1914    3081 (60.9)
40      19704   27034 (37.2)    4104    5303 (29.2)     2409    3866 (60.4)
48      19721   26832 (36.0)    4924    6418 (30.3)     2898    4701 (62.2)
64      19650   26849 (36.6)    6595    8611 (30.5)     3975    6433 (61.8)
80      19432   26823 (38.0)    8244    10817 (31.2)    4985    8165 (63.7)
96      20347   27886 (37.0)    9913    13017 (31.3)    5982    9860 (64.8)
128     19108   27715 (45.0)    13254   17546 (32.3)    8153    13589 (66.6)
_______________________________________________________________________________
SUM:    BW: (32.4)      CPU: (30.4)     RCPU: (62.6)
_______________________________________________________________________________
*: Sum over 7 iterations, remaining test cases are sum over 2 iterations

> guest <-> external

I haven't done this right now since I don't have a setup.  I guess
it would be limited by wire speed and gains may not be there.  I
will try to do this later when I get the setup.

> in last case:
> find some other way to measure host CPU utilization,
> try multiqueue and single queue devices
> 6. Use above to figure out what is a sane default for numtxqs

A. Summary for default I/O (16K):
#txqs=2 (#vhost=3):       BW: (37.6)      CPU: (69.2)     RCPU: (40.8)
#txqs=4 (#vhost=5):       BW: (36.9)      CPU: (60.9)     RCPU: (25.2)
#txqs=8 (#vhost=5):       BW: (41.8)      CPU: (50.0)     RCPU: (15.2)
#txqs=16 (#vhost=5):      BW: (40.4)      CPU: (49.9)     RCPU: (10.0)

B. Summary for 512 byte I/O:
#txqs=2 (#vhost=3):       BW: (31.6)      CPU: (35.7)     RCPU: (28.6)
#txqs=4 (#vhost=5):       BW: (5.7)       CPU: (27.2)     RCPU: (22.7)
#txqs=8 (#vhost=5):       BW: (-.6)       CPU: (25.1)     RCPU: (22.5)
#txqs=16 (#vhost=5):      BW: (-6.6)      CPU: (24.7)     RCPU: (21.7)

Summary:

1. Average BW increase for regular I/O is best for #txq=16 with the
   least CPU utilization increase.
2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
   #txqs, BW increased only after a particular #netperf sessions - in
   my testing that limit was 32 netperf sessions.
3. Multiple txq for guest by itself doesn't seem to have any issues.
   Guest CPU% increase is slightly higher than BW improvement.  I
   think it is true for all mq drivers since more paths run in parallel
   upto the device instead of sleeping and allowing one thread to send
   all packets via qdisc_restart.
4. Having high number of txqs gives better gains and reduces cpu util
   on the guest and the host.
5. MQ is intended for server loads.  MQ should probably not be explicitly
   specified for client systems.
6. No regression with numtxqs=1 (or if mq option is not used) in any
   testing scenario.

I will send the v3 patch within a day after some more testing.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-05 10:40   ` Krishna Kumar2
@ 2010-10-05 18:23     ` Michael S. Tsirkin
  2010-10-06 17:43       ` Krishna Kumar2
  2010-10-06 12:19     ` Arnd Bergmann
  1 sibling, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2010-10-05 18:23 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

On Tue, Oct 05, 2010 at 04:10:00PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/19/2010 06:14:43 PM:
> 
> > Could you document how exactly do you measure multistream bandwidth:
> > netperf flags, etc?
> 
> All results were without any netperf flags or system tuning:
>     for i in $list
>     do
>         netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i &
>     done
>     wait
> Another script processes the result files.  It also displays the
> start time/end time of each iteration to make sure skew due to
> parallel netperfs is minimal.
> 
> I changed the vhost functionality once more to try to get the
> best model, the new model being:
> 1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX.
> 2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles
>    TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
>    queues are handled by vhost threads in round-robin fashion.
> 
> Results from here on are with these changes, and only "tuning" is
> to set each vhost's affinity to CPUs[0-3] ("taskset -p f <vhost-pids>").
> 
> > Any idea where does this come from?
> > Do you see more TX interrupts? RX interrupts? Exits?
> > Do interrupts bounce more between guest CPUs?
> > 4. Identify reasons for single netperf BW regression.
> 
> After testing various combinations of #txqs, #vhosts, #netperf
> sessions, I think the drop for 1 stream is due to TX and RX for
> a flow being processed on different cpus.

Right. Can we fix it?

>  I did two more tests:
>     1. Pin vhosts to same CPU:
>         - BW drop is much lower for 1 stream case (- 5 to -8% range)
>         - But performance is not so high for more sessions.
>     2. Changed vhost to be single threaded:
>           - No degradation for 1 session, and improvement for upto
> 	      8, sometimes 16 streams (5-12%).
>           - BW degrades after that, all the way till 128 netperf sessions.
>           - But overall CPU utilization improves.
>             Summary of the entire run (for 1-128 sessions):
>                 txq=4:  BW: (-2.3)      CPU: (-16.5)    RCPU: (-5.3)
>                 txq=16: BW: (-1.9)      CPU: (-24.9)    RCPU: (-9.6)
> 
> I don't see any reasons mentioned above.  However, for higher
> number of netperf sessions, I see a big increase in retransmissions:

Hmm, ok, and do you see any errors?

> _______________________________________
> #netperf      ORG           NEW
>             BW (#retr)    BW (#retr)
> _______________________________________
> 1          70244 (0)     64102 (0)
> 4          21421 (0)     36570 (416)
> 8          21746 (0)     38604 (148)
> 16         21783 (0)     40632 (464)
> 32         22677 (0)     37163 (1053)
> 64         23648 (4)     36449 (2197)
> 128        23251 (2)     31676 (3185)
> _______________________________________
> 
> Single netperf case didn't have any retransmissions so that is not
> the cause for drop.  I tested ixgbe (MQ):
> ___________________________________________________________
> #netperf      ixgbe             ixgbe (pin intrs to cpu#0 on
>                                        both server/client)
>             BW (#retr)          BW (#retr)
> ___________________________________________________________
> 1           3567 (117)          6000 (251)
> 2           4406 (477)          6298 (725)
> 4           6119 (1085)         7208 (3387)
> 8           6595 (4276)         7381 (15296)
> 16          6651 (11651)        6856 (30394)

Interesting.
You are saying we get much more retransmissions with physical nic as
well?

> ___________________________________________________________
> 
> > 5. Test perf in more scenarious:
> >    small packets
> 
> 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
> but increases with #sessions:
> _______________________________________________________________________________
> #       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
> _______________________________________________________________________________
> 1       4043    3800 (-6.0)     50      50 (0)          86      98 (13.9)
> 2       8358    7485 (-10.4)    153     178 (16.3)      230     264 (14.7)
> 4       20664   13567 (-34.3)   448     490 (9.3)       530     624 (17.7)
> 8       25198   17590 (-30.1)   967     1021 (5.5)      1085    1257 (15.8)
> 16      23791   24057 (1.1)     1904    2220 (16.5)     2156    2578 (19.5)
> 24      23055   26378 (14.4)    2807    3378 (20.3)     3225    3901 (20.9)
> 32      22873   27116 (18.5)    3748    4525 (20.7)     4307    5239 (21.6)
> 40      22876   29106 (27.2)    4705    5717 (21.5)     5388    6591 (22.3)
> 48      23099   31352 (35.7)    5642    6986 (23.8)     6475    8085 (24.8)
> 64      22645   30563 (34.9)    7527    9027 (19.9)     8619    10656 (23.6)
> 80      22497   31922 (41.8)    9375    11390 (21.4)    10736   13485 (25.6)
> 96      22509   32718 (45.3)    11271   13710 (21.6)    12927   16269 (25.8)
> 128     22255   32397 (45.5)    15036   18093 (20.3)    17144   21608 (26.0)
> _______________________________________________________________________________
> SUM:    BW: (16.7)      CPU: (20.6)     RCPU: (24.3)
> _______________________________________________________________________________
> 
> > host -> guest
> _______________________________________________________________________________
> #       BW1     BW2 (%)         CPU1    CPU2 (%)        RCPU1   RCPU2 (%)
> _______________________________________________________________________________
> *1      70706   90398 (27.8)    300     327 (9.0)       140     175 (25.0)
> 2       20951   21937 (4.7)     188     196 (4.2)       93      103 (10.7)
> 4       19952   25281 (26.7)    397     496 (24.9)      210     304 (44.7)
> 8       18559   24992 (34.6)    802     1010 (25.9)     439     659 (50.1)
> 16      18882   25608 (35.6)    1642    2082 (26.7)     953     1454 (52.5)
> 24      19012   26955 (41.7)    2465    3153 (27.9)     1452    2254 (55.2)
> 32      19846   26894 (35.5)    3278    4238 (29.2)     1914    3081 (60.9)
> 40      19704   27034 (37.2)    4104    5303 (29.2)     2409    3866 (60.4)
> 48      19721   26832 (36.0)    4924    6418 (30.3)     2898    4701 (62.2)
> 64      19650   26849 (36.6)    6595    8611 (30.5)     3975    6433 (61.8)
> 80      19432   26823 (38.0)    8244    10817 (31.2)    4985    8165 (63.7)
> 96      20347   27886 (37.0)    9913    13017 (31.3)    5982    9860 (64.8)
> 128     19108   27715 (45.0)    13254   17546 (32.3)    8153    13589 (66.6)
> _______________________________________________________________________________
> SUM:    BW: (32.4)      CPU: (30.4)     RCPU: (62.6)
> _______________________________________________________________________________
> *: Sum over 7 iterations, remaining test cases are sum over 2 iterations
> 
> > guest <-> external
> 
> I haven't done this right now since I don't have a setup.  I guess
> it would be limited by wire speed and gains may not be there.  I
> will try to do this later when I get the setup.

OK but at least need to check that it does not hurt things.

> > in last case:
> > find some other way to measure host CPU utilization,
> > try multiqueue and single queue devices
> > 6. Use above to figure out what is a sane default for numtxqs
> 
> A. Summary for default I/O (16K):
> #txqs=2 (#vhost=3):       BW: (37.6)      CPU: (69.2)     RCPU: (40.8)
> #txqs=4 (#vhost=5):       BW: (36.9)      CPU: (60.9)     RCPU: (25.2)
> #txqs=8 (#vhost=5):       BW: (41.8)      CPU: (50.0)     RCPU: (15.2)
> #txqs=16 (#vhost=5):      BW: (40.4)      CPU: (49.9)     RCPU: (10.0)
> 
> B. Summary for 512 byte I/O:
> #txqs=2 (#vhost=3):       BW: (31.6)      CPU: (35.7)     RCPU: (28.6)
> #txqs=4 (#vhost=5):       BW: (5.7)       CPU: (27.2)     RCPU: (22.7)
> #txqs=8 (#vhost=5):       BW: (-.6)       CPU: (25.1)     RCPU: (22.5)
> #txqs=16 (#vhost=5):      BW: (-6.6)      CPU: (24.7)     RCPU: (21.7)
> 
> Summary:
> 
> 1. Average BW increase for regular I/O is best for #txq=16 with the
>    least CPU utilization increase.
> 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
>    #txqs, BW increased only after a particular #netperf sessions - in
>    my testing that limit was 32 netperf sessions.
> 3. Multiple txq for guest by itself doesn't seem to have any issues.
>    Guest CPU% increase is slightly higher than BW improvement.  I
>    think it is true for all mq drivers since more paths run in parallel
>    upto the device instead of sleeping and allowing one thread to send
>    all packets via qdisc_restart.
> 4. Having high number of txqs gives better gains and reduces cpu util
>    on the guest and the host.
> 5. MQ is intended for server loads.  MQ should probably not be explicitly
>    specified for client systems.
> 6. No regression with numtxqs=1 (or if mq option is not used) in any
>    testing scenario.

Of course txq=1 can be considered a kind of fix, but if we know the
issue is TX/RX flows getting bounced between CPUs, can we fix this?
Workload-specific optimizations can only get us this far.

> 
> I will send the v3 patch within a day after some more testing.
> 
> Thanks,
> 
> - KK

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-05 10:40   ` Krishna Kumar2
  2010-10-05 18:23     ` Michael S. Tsirkin
@ 2010-10-06 12:19     ` Arnd Bergmann
  2010-10-06 17:14       ` Krishna Kumar2
  1 sibling, 1 reply; 26+ messages in thread
From: Arnd Bergmann @ 2010-10-06 12:19 UTC (permalink / raw)
  To: Krishna Kumar2, Ben Greear
  Cc: Michael S. Tsirkin, anthony, avi, davem, kvm, netdev, rusty

On Tuesday 05 October 2010, Krishna Kumar2 wrote:
> After testing various combinations of #txqs, #vhosts, #netperf
> sessions, I think the drop for 1 stream is due to TX and RX for
> a flow being processed on different cpus.  I did two more tests:
>     1. Pin vhosts to same CPU:
>         - BW drop is much lower for 1 stream case (- 5 to -8% range)
>         - But performance is not so high for more sessions.
>     2. Changed vhost to be single threaded:
>           - No degradation for 1 session, and improvement for upto
>               8, sometimes 16 streams (5-12%).
>           - BW degrades after that, all the way till 128 netperf sessions.
>           - But overall CPU utilization improves.
>             Summary of the entire run (for 1-128 sessions):
>                 txq=4:  BW: (-2.3)      CPU: (-16.5)    RCPU: (-5.3)
>                 txq=16: BW: (-1.9)      CPU: (-24.9)    RCPU: (-9.6)
> 
> I don't see any reasons mentioned above.  However, for higher
> number of netperf sessions, I see a big increase in retransmissions:
> _______________________________________
> #netperf      ORG           NEW
>             BW (#retr)    BW (#retr)
> _______________________________________
> 1          70244 (0)     64102 (0)
> 4          21421 (0)     36570 (416)
> 8          21746 (0)     38604 (148)
> 16         21783 (0)     40632 (464)
> 32         22677 (0)     37163 (1053)
> 64         23648 (4)     36449 (2197)
> 128        23251 (2)     31676 (3185)
> _______________________________________


This smells like it could be related to a problem that Ben Greear found
recently (see "macvlan:  Enable qdisc backoff logic"). When the hardware
is busy, used to just drop the packet. With Ben's patch, we return -EAGAIN
to qemu (or vhost-net) to trigger a resend.

I suppose what we really should do is feed that condition back to the
guest network stack and implement the backoff in there.

	Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-06 12:19     ` Arnd Bergmann
@ 2010-10-06 17:14       ` Krishna Kumar2
  2010-10-06 17:50         ` Arnd Bergmann
  0 siblings, 1 reply; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-06 17:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: anthony, avi, davem, Ben Greear, kvm, Michael S. Tsirkin, netdev, rusty

Arnd Bergmann <arnd@arndb.de> wrote on 10/06/2010 05:49:00 PM:

> > I don't see any reasons mentioned above.  However, for higher
> > number of netperf sessions, I see a big increase in retransmissions:
> > _______________________________________
> > #netperf      ORG           NEW
> >             BW (#retr)    BW (#retr)
> > _______________________________________
> > 1          70244 (0)     64102 (0)
> > 4          21421 (0)     36570 (416)
> > 8          21746 (0)     38604 (148)
> > 16         21783 (0)     40632 (464)
> > 32         22677 (0)     37163 (1053)
> > 64         23648 (4)     36449 (2197)
> > 128        23251 (2)     31676 (3185)
> > _______________________________________
>
>
> This smells like it could be related to a problem that Ben Greear found
> recently (see "macvlan:  Enable qdisc backoff logic"). When the hardware
> is busy, used to just drop the packet. With Ben's patch, we return
-EAGAIN
> to qemu (or vhost-net) to trigger a resend.
>
> I suppose what we really should do is feed that condition back to the
> guest network stack and implement the backoff in there.

Thanks for the pointer. I will take a look at this as I hadn't seen
this patch earlier. Is there any way to figure out if this is the
issue?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-05 18:23     ` Michael S. Tsirkin
@ 2010-10-06 17:43       ` Krishna Kumar2
  2010-10-06 19:03         ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-06 17:43 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/05/2010 11:53:23 PM:

> > > Any idea where does this come from?
> > > Do you see more TX interrupts? RX interrupts? Exits?
> > > Do interrupts bounce more between guest CPUs?
> > > 4. Identify reasons for single netperf BW regression.
> >
> > After testing various combinations of #txqs, #vhosts, #netperf
> > sessions, I think the drop for 1 stream is due to TX and RX for
> > a flow being processed on different cpus.
>
> Right. Can we fix it?

I am not sure how to. My initial patch had one thread but gave
small gains and ran into limitations once number of sessions
became large.

> >  I did two more tests:
> >     1. Pin vhosts to same CPU:
> >         - BW drop is much lower for 1 stream case (- 5 to -8% range)
> >         - But performance is not so high for more sessions.
> >     2. Changed vhost to be single threaded:
> >           - No degradation for 1 session, and improvement for upto
> >          8, sometimes 16 streams (5-12%).
> >           - BW degrades after that, all the way till 128 netperf
sessions.
> >           - But overall CPU utilization improves.
> >             Summary of the entire run (for 1-128 sessions):
> >                 txq=4:  BW: (-2.3)      CPU: (-16.5)    RCPU: (-5.3)
> >                 txq=16: BW: (-1.9)      CPU: (-24.9)    RCPU: (-9.6)
> >
> > I don't see any reasons mentioned above.  However, for higher
> > number of netperf sessions, I see a big increase in retransmissions:
>
> Hmm, ok, and do you see any errors?

I haven't seen any in any statistics, messages, etc. Also no
retranmissions for txq=1.

> > Single netperf case didn't have any retransmissions so that is not
> > the cause for drop.  I tested ixgbe (MQ):
> > ___________________________________________________________
> > #netperf      ixgbe             ixgbe (pin intrs to cpu#0 on
> >                                        both server/client)
> >             BW (#retr)          BW (#retr)
> > ___________________________________________________________
> > 1           3567 (117)          6000 (251)
> > 2           4406 (477)          6298 (725)
> > 4           6119 (1085)         7208 (3387)
> > 8           6595 (4276)         7381 (15296)
> > 16          6651 (11651)        6856 (30394)
>
> Interesting.
> You are saying we get much more retransmissions with physical nic as
> well?

Yes, with ixgbe. I re-ran with 16 netperfs running for 15 secs on
both ixgbe and cxgb3 just now to reconfirm:

ixgbe: BW: 6186.85  SD/Remote: 135.711, 339.376  CPU/Remote: 79.99, 200.00,
Retrans: 545
cxgb3: BW: 8051.07  SD/Remote: 144.416, 260.487  CPU/Remote: 110.88,
200.00, Retrans: 0

However 64 netperfs for 30 secs gave:

ixgbe: BW: 6691.12  SD/Remote: 8046.617, 5259.992  CPU/Remote: 1223.86,
799.97, Retrans: 1424
cxgb3: BW: 7799.16  SD/Remote: 2589.875, 4317.013  CPU/Remote: 480.39
800.64, Retrans: 649

# ethtool -i eth4
driver: ixgbe
version: 2.0.84-k2
firmware-version: 0.9-3
bus-info: 0000:1f:00.1

# ifconfig output:
       RX packets:783241 errors:0 dropped:0 overruns:0 frame:0
       TX packets:689533 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000

# lspci output:
1f:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network
Connec
tion (rev 01)
        Subsystem: Intel Corporation Ethernet Server Adapter X520-2
        Flags: bus master, fast devsel, latency 0, IRQ 30
        Memory at 98900000 (64-bit, prefetchable) [size=512K]
        I/O ports at 2020 [size=32]
        Memory at 98a00000 (64-bit, prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-40-4a-b4
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Kernel driver in use: ixgbe
        Kernel modules: ixgbe

> > I haven't done this right now since I don't have a setup.  I guess
> > it would be limited by wire speed and gains may not be there.  I
> > will try to do this later when I get the setup.
>
> OK but at least need to check that it does not hurt things.

Yes, sure.

> > Summary:
> >
> > 1. Average BW increase for regular I/O is best for #txq=16 with the
> >    least CPU utilization increase.
> > 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
> >    #txqs, BW increased only after a particular #netperf sessions - in
> >    my testing that limit was 32 netperf sessions.
> > 3. Multiple txq for guest by itself doesn't seem to have any issues.
> >    Guest CPU% increase is slightly higher than BW improvement.  I
> >    think it is true for all mq drivers since more paths run in parallel
> >    upto the device instead of sleeping and allowing one thread to send
> >    all packets via qdisc_restart.
> > 4. Having high number of txqs gives better gains and reduces cpu util
> >    on the guest and the host.
> > 5. MQ is intended for server loads.  MQ should probably not be
explicitly
> >    specified for client systems.
> > 6. No regression with numtxqs=1 (or if mq option is not used) in any
> >    testing scenario.
>
> Of course txq=1 can be considered a kind of fix, but if we know the
> issue is TX/RX flows getting bounced between CPUs, can we fix this?
> Workload-specific optimizations can only get us this far.

I will test with your patch tomorrow night once I am back.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-06 17:14       ` Krishna Kumar2
@ 2010-10-06 17:50         ` Arnd Bergmann
  0 siblings, 0 replies; 26+ messages in thread
From: Arnd Bergmann @ 2010-10-06 17:50 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, avi, davem, Ben Greear, kvm, Michael S. Tsirkin, netdev, rusty

On Wednesday 06 October 2010 19:14:42 Krishna Kumar2 wrote:
> Arnd Bergmann <arnd@arndb.de> wrote on 10/06/2010 05:49:00 PM:
> 
> > > I don't see any reasons mentioned above.  However, for higher
> > > number of netperf sessions, I see a big increase in retransmissions:
> > > _______________________________________
> > > #netperf      ORG           NEW
> > >             BW (#retr)    BW (#retr)
> > > _______________________________________
> > > 1          70244 (0)     64102 (0)
> > > 4          21421 (0)     36570 (416)
> > > 8          21746 (0)     38604 (148)
> > > 16         21783 (0)     40632 (464)
> > > 32         22677 (0)     37163 (1053)
> > > 64         23648 (4)     36449 (2197)
> > > 128        23251 (2)     31676 (3185)
> > > _______________________________________
> >
> >
> > This smells like it could be related to a problem that Ben Greear found
> > recently (see "macvlan:  Enable qdisc backoff logic"). When the hardware
> > is busy, used to just drop the packet. With Ben's patch, we return
> -EAGAIN
> > to qemu (or vhost-net) to trigger a resend.
> >
> > I suppose what we really should do is feed that condition back to the
> > guest network stack and implement the backoff in there.
> 
> Thanks for the pointer. I will take a look at this as I hadn't seen
> this patch earlier. Is there any way to figure out if this is the
> issue?

I think a good indication would be if this changes with/without the
patch, and if you see -EAGAIN in qemu with the patch applied.

	Arnd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-06 17:43       ` Krishna Kumar2
@ 2010-10-06 19:03         ` Michael S. Tsirkin
  0 siblings, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2010-10-06 19:03 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty, herbert

On Wed, Oct 06, 2010 at 11:13:31PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/05/2010 11:53:23 PM:
> 
> > > > Any idea where does this come from?
> > > > Do you see more TX interrupts? RX interrupts? Exits?
> > > > Do interrupts bounce more between guest CPUs?
> > > > 4. Identify reasons for single netperf BW regression.
> > >
> > > After testing various combinations of #txqs, #vhosts, #netperf
> > > sessions, I think the drop for 1 stream is due to TX and RX for
> > > a flow being processed on different cpus.
> >
> > Right. Can we fix it?
> 
> I am not sure how to. My initial patch had one thread but gave
> small gains and ran into limitations once number of sessions
> became large.

Sure. We will need multiple RX queues, and have a single
thread handle a TX and RX pair. Then we need to make sure packets
from a given flow on TX land on the same thread on RX.
As flows can be hashed differently, for this to work we'll have to
expose this info in host/guest interface.
But since multiqueue implies host/guest ABI changes anyway,
this point is moot.

BTW, an interesting approach could be using bonding
and multiple virtio-net interfaces.
What are the disadvantages of such a setup?  One advantage
is it can be made to work in existing guests.

> > >  I did two more tests:
> > >     1. Pin vhosts to same CPU:
> > >         - BW drop is much lower for 1 stream case (- 5 to -8% range)
> > >         - But performance is not so high for more sessions.
> > >     2. Changed vhost to be single threaded:
> > >           - No degradation for 1 session, and improvement for upto
> > >          8, sometimes 16 streams (5-12%).
> > >           - BW degrades after that, all the way till 128 netperf
> sessions.
> > >           - But overall CPU utilization improves.
> > >             Summary of the entire run (for 1-128 sessions):
> > >                 txq=4:  BW: (-2.3)      CPU: (-16.5)    RCPU: (-5.3)
> > >                 txq=16: BW: (-1.9)      CPU: (-24.9)    RCPU: (-9.6)
> > >
> > > I don't see any reasons mentioned above.  However, for higher
> > > number of netperf sessions, I see a big increase in retransmissions:
> >
> > Hmm, ok, and do you see any errors?
> 
> I haven't seen any in any statistics, messages, etc.

Herbert, could you help out debugging this increase in retransmissions
please?  Older mail on netdev in this thread has some numbers that seem
to imply that we start hitting retransmissions much more as # of flows
goes up.

> Also no
> retranmissions for txq=1.

While it's nice that we have this parameter, the need to choose between
single stream and multi stream performance when you start the vm makes
this patch much less interesting IMHO.


-- 
MST

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
       [not found]           ` <OF0BDA6B3A.F673A449-ON652577BC.00422911-652577BC.0043474B@LocalDomain>
@ 2010-10-14 12:47             ` Krishna Kumar2
  0 siblings, 0 replies; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-14 12:47 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, kvm, Michael S. Tsirkin, netdev, rusty

Krishna Kumar2/India/IBM wrote on 10/14/2010 05:47:54 PM:

Sorry, it should read "txq=8" below.

- KK

> There's a significant reduction in CPU/SD utilization with your
> patch. Following is the performance of ORG vs MQ+mm patch:
>
> _________________________________________________
>                Org vs MQ+mm patch txq=2
> #     BW%     CPU/RCPU%         SD/RSD%
> _________________________________________________
> 1     2.26    -1.16    .27      -20.00  0
> 2     35.07   29.90    21.81     0      -11.11
> 4     55.03   84.57    37.66     26.92  -4.62
> 8     73.16   118.69   49.21     45.63  -.46
> 16    77.43   98.81    47.89     24.07  -7.80
> 24    71.59   105.18   48.44     62.84  18.18
> 32    70.91   102.38   47.15     49.22  8.54
> 40    63.26   90.58    41.00     85.27  37.33
> 48    45.25   45.99    11.23     14.31  -12.91
> 64    42.78   41.82    5.50      .43    -25.12
> 80    31.40   7.31     -18.69    15.78  -11.93
> 96    27.60   7.79     -18.54    17.39  -10.98
> 128   23.46   -11.89   -34.41    -.41   -25.53
> _________________________________________________
> BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6
>
> Following is the performance of MQ vs MQ+mm patch:
> _____________________________________________________
>             MQ vs MQ+mm patch
> #     BW%      CPU%       RCPU%    SD%      RSD%
> _____________________________________________________
> 1      4.98    -.58       .84      -20.00    0
> 2      5.17     2.96      2.29      0       -4.00
> 4     -.18      .25      -.16       3.12     .98
> 8     -5.47    -1.36     -1.98      17.18    16.57
> 16    -1.90    -6.64     -3.54     -14.83   -12.12
> 24    -.01      23.63     14.65     57.61    46.64
> 32     .27     -3.19      -3.11    -22.98   -22.91
> 40    -1.06    -2.96      -2.96    -4.18    -4.10
> 48    -.28     -2.34      -3.71    -2.41    -3.81
> 64     9.71     33.77      30.65    81.44    77.09
> 80    -10.69    -31.07    -31.70   -29.22   -29.88
> 96    -1.14     5.98       .56     -11.57   -16.14
> 128   -.93     -15.60     -18.31   -19.89   -22.65
> _____________________________________________________
>   BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
> _____________________________________________________
>
> Each test case is for 60 secs, sum over two runs (except
> when number of netperf sessions is 1, which has 7 runs
> of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
> other than taskset each vhost to cpus 0-3.
>
> Thanks,
>
> - KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
       [not found]         ` <OFEC86A094.39835EBF-ON652577BC.002F9AAF-652577BC.003186B5@LocalDomain>
@ 2010-10-14 12:17           ` Krishna Kumar2
       [not found]           ` <OF0BDA6B3A.F673A449-ON652577BC.00422911-652577BC.0043474B@LocalDomain>
  1 sibling, 0 replies; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-14 12:17 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, kvm, Michael S. Tsirkin, netdev, rusty

Krishna Kumar2/India/IBM wrote on 10/14/2010 02:34:01 PM:

> void vhost_poll_queue(struct vhost_poll *poll)
> {
>         struct vhost_virtqueue *vq = vhost_find_vq(poll);
>
>         vhost_work_queue(vq, &poll->work);
> }
>
> Since poll batches packets, find_vq does not seem to add much
> to the CPU utilization (or BW). I am sure that code can be
> optimized much better.
>
> The results I sent in my last mail were without your use_mm
> patch, and the only tuning was to make vhost threads run on
> only cpus 0-3 (though the performance is good even without
> that). I will test it later today with the use_mm patch too.

There's a significant reduction in CPU/SD utilization with your
patch. Following is the performance of ORG vs MQ+mm patch:

_________________________________________________
               Org vs MQ+mm patch txq=2
#     BW%     CPU/RCPU%         SD/RSD%
_________________________________________________
1     2.26    -1.16    .27      -20.00  0
2     35.07   29.90    21.81     0      -11.11
4     55.03   84.57    37.66     26.92  -4.62
8     73.16   118.69   49.21     45.63  -.46
16    77.43   98.81    47.89     24.07  -7.80
24    71.59   105.18   48.44     62.84  18.18
32    70.91   102.38   47.15     49.22  8.54
40    63.26   90.58    41.00     85.27  37.33
48    45.25   45.99    11.23     14.31  -12.91
64    42.78   41.82    5.50      .43    -25.12
80    31.40   7.31     -18.69    15.78  -11.93
96    27.60   7.79     -18.54    17.39  -10.98
128   23.46   -11.89   -34.41    -.41   -25.53
_________________________________________________
BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6


Following is the performance of MQ vs MQ+mm patch:
_____________________________________________________
            MQ vs MQ+mm patch
#     BW%      CPU%       RCPU%    SD%      RSD%
_____________________________________________________
1      4.98    -.58       .84      -20.00    0
2      5.17     2.96      2.29      0       -4.00
4     -.18      .25      -.16       3.12     .98
8     -5.47    -1.36     -1.98      17.18    16.57
16    -1.90    -6.64     -3.54     -14.83   -12.12
24    -.01      23.63     14.65     57.61    46.64
32     .27     -3.19      -3.11    -22.98   -22.91
40    -1.06    -2.96      -2.96    -4.18    -4.10
48    -.28     -2.34      -3.71    -2.41    -3.81
64     9.71     33.77      30.65    81.44    77.09
80    -10.69    -31.07    -31.70   -29.22   -29.88
96    -1.14     5.98       .56     -11.57   -16.14
128   -.93     -15.60     -18.31   -19.89   -22.65
_____________________________________________________
  BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
_____________________________________________________

Each test case is for 60 secs, sum over two runs (except
when number of netperf sessions is 1, which has 7 runs
of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
other than taskset each vhost to cpus 0-3.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-14  8:17       ` Michael S. Tsirkin
@ 2010-10-14  9:04         ` Krishna Kumar2
       [not found]         ` <OFEC86A094.39835EBF-ON652577BC.002F9AAF-652577BC.003186B5@LocalDomain>
  1 sibling, 0 replies; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-14  9:04 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

> "Michael S. Tsirkin" <mst@redhat.com>
> > > What other shared TX/RX locks are there?  In your setup, is the same
> > > macvtap socket structure used for RX and TX?  If yes this will create
> > > cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
> > > there might also be contention on the lock in sk_sleep waitqueue.
> > > Anything else?
> >
> > The patch is not introducing any locking (both vhost and virtio-net).
> > The single stream drop is due to different vhost threads handling the
> > RX/TX traffic.
> >
> > I added a heuristic (fuzzy) to determine if more than one flow
> > is being used on the device, and if not, use vhost[0] for both
> > tx and rx (vhost_poll_queue figures this out before waking up
> > the suitable vhost thread).  Testing shows that single stream
> > performance is as good as the original code.
>
> ...
>
> > This approach works nicely for both single and multiple stream.
> > Does this look good?
> >
> > Thanks,
> >
> > - KK
>
> Yes, but I guess it depends on the heuristic :) What's the logic?

I define how recently a txq was used. If 0 or 1 txq's were used
recently, use vq[0] (which also handles rx). Otherwise, use
multiple txq (vq[1-n]). The code is:

/*
 * Algorithm for selecting vq:
 *
 * Condition                                    Return
 * RX vq                                        vq[0]
 * If all txqs unused                           vq[0]
 * If one txq used, and new txq is same         vq[0]
 * If one txq used, and new txq is different    vq[vq->qnum]
 * If > 1 txqs used                             vq[vq->qnum]
 *      Where "used" means the txq was used in the last 'n' jiffies.
 *
 * Note: locking is not required as an update race will only result in
 * a different worker being woken up.
 */
static inline struct vhost_virtqueue *vhost_find_vq(struct vhost_poll
*poll)
{
	if (poll->vq->qnum) {
		struct vhost_dev *dev = poll->vq->dev;
		struct vhost_virtqueue *vq = &dev->vqs[0];
		unsigned long max_time = jiffies - 5; /* Some macro needed */
		unsigned long *table = dev->jiffies;
		int i, used = 0;

		for (i = 0; i < dev->nvqs - 1; i++) {
			if (time_after_eq(table[i], max_time) && ++used > 1) {
				vq = poll->vq;
				break;
			}
		}
		table[poll->vq->qnum - 1] = jiffies;
		return vq;
	}

	/* RX is handled by the same worker thread */
	return poll->vq;
}

void vhost_poll_queue(struct vhost_poll *poll)
{
        struct vhost_virtqueue *vq = vhost_find_vq(poll);

        vhost_work_queue(vq, &poll->work);
}

Since poll batches packets, find_vq does not seem to add much
to the CPU utilization (or BW). I am sure that code can be
optimized much better.

The results I sent in my last mail were without your use_mm
patch, and the only tuning was to make vhost threads run on
only cpus 0-3 (though the performance is good even without
that). I will test it later today with the use_mm patch too.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-14  7:58     ` Krishna Kumar2
@ 2010-10-14  8:17       ` Michael S. Tsirkin
  2010-10-14  9:04         ` Krishna Kumar2
       [not found]         ` <OFEC86A094.39835EBF-ON652577BC.002F9AAF-652577BC.003186B5@LocalDomain>
  0 siblings, 2 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2010-10-14  8:17 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

On Thu, Oct 14, 2010 at 01:28:58PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/12/2010 10:39:07 PM:
> 
> > > Sorry for the delay, I was sick last couple of days. The results
> > > with your patch are (%'s over original code):
> > >
> > > Code               BW%       CPU%       RemoteCPU
> > > MQ     (#txq=16)   31.4%     38.42%     6.41%
> > > MQ+MST (#txq=16)   28.3%     18.9%      -10.77%
> > >
> > > The patch helps CPU utilization but didn't help single stream
> > > drop.
> > >
> > > Thanks,
> >
> > What other shared TX/RX locks are there?  In your setup, is the same
> > macvtap socket structure used for RX and TX?  If yes this will create
> > cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
> > there might also be contention on the lock in sk_sleep waitqueue.
> > Anything else?
> 
> The patch is not introducing any locking (both vhost and virtio-net).
> The single stream drop is due to different vhost threads handling the
> RX/TX traffic.
> 
> I added a heuristic (fuzzy) to determine if more than one flow
> is being used on the device, and if not, use vhost[0] for both
> tx and rx (vhost_poll_queue figures this out before waking up
> the suitable vhost thread).  Testing shows that single stream
> performance is as good as the original code.

...

> This approach works nicely for both single and multiple stream.
> Does this look good?
> 
> Thanks,
> 
> - KK

Yes, but I guess it depends on the heuristic :) What's the logic?

-- 
MST

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-12 17:09   ` Michael S. Tsirkin
@ 2010-10-14  7:58     ` Krishna Kumar2
  2010-10-14  8:17       ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-14  7:58 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/12/2010 10:39:07 PM:

> > Sorry for the delay, I was sick last couple of days. The results
> > with your patch are (%'s over original code):
> >
> > Code               BW%       CPU%       RemoteCPU
> > MQ     (#txq=16)   31.4%     38.42%     6.41%
> > MQ+MST (#txq=16)   28.3%     18.9%      -10.77%
> >
> > The patch helps CPU utilization but didn't help single stream
> > drop.
> >
> > Thanks,
>
> What other shared TX/RX locks are there?  In your setup, is the same
> macvtap socket structure used for RX and TX?  If yes this will create
> cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
> there might also be contention on the lock in sk_sleep waitqueue.
> Anything else?

The patch is not introducing any locking (both vhost and virtio-net).
The single stream drop is due to different vhost threads handling the
RX/TX traffic.

I added a heuristic (fuzzy) to determine if more than one flow
is being used on the device, and if not, use vhost[0] for both
tx and rx (vhost_poll_queue figures this out before waking up
the suitable vhost thread).  Testing shows that single stream
performance is as good as the original code.

__________________________________________________________________________
		       #txqs = 2 (#vhosts = 3)
#     BW1     BW2   (%)       CPU1    CPU2 (%)       RCPU1   RCPU2 (%)
__________________________________________________________________________
1     77344   74973 (-3.06)   172     143 (-16.86)   358     324 (-9.49)
2     20924   21107 (.87)     107     103 (-3.73)    220     217 (-1.36)
4     21629   32911 (52.16)   214     391 (82.71)    446     616 (38.11)
8     21678   34359 (58.49)   428     845 (97.42)    892     1286 (44.17)
16    22046   34401 (56.04)   841     1677 (99.40)   1785    2585 (44.81)
24    22396   35117 (56.80)   1272    2447 (92.37)   2667    3863 (44.84)
32    22750   35158 (54.54)   1719    3233 (88.07)   3569    5143 (44.10)
40    23041   35345 (53.40)   2219    3970 (78.90)   4478    6410 (43.14)
48    23209   35219 (51.74)   2707    4685 (73.06)   5386    7684 (42.66)
64    23215   35209 (51.66)   3639    6195 (70.23)   7206    10218 (41.79)
80    23443   35179 (50.06)   4633    7625 (64.58)   9051    12745 (40.81)
96    24006   36108 (50.41)   5635    9096 (61.41)   10864   15283 (40.67)
128   23601   35744 (51.45)   7475    12104 (61.92)  14495   20405 (40.77)
__________________________________________________________________________
SUM:     BW: (37.6)     CPU: (69.0)     RCPU: (41.2)

__________________________________________________________________________
		       #txqs = 8 (#vhosts = 5)
#     BW1     BW2    (%)      CPU1     CPU2 (%)      RCPU1     RCPU2 (%)
__________________________________________________________________________
1     77344   75341 (-2.58)   172     171 (-.58)     358     356 (-.55)
2     20924   26872 (28.42)   107     135 (26.16)    220     262 (19.09)
4     21629   33594 (55.31)   214     394 (84.11)    446     615 (37.89)
8     21678   39714 (83.19)   428     949 (121.72)   892     1358 (52.24)
16    22046   39879 (80.88)   841     1791 (112.96)  1785    2737 (53.33)
24    22396   38436 (71.61)   1272    2111 (65.95)   2667    3453 (29.47)
32    22750   38776 (70.44)   1719    3594 (109.07)  3569    5421 (51.89)
40    23041   38023 (65.02)   2219    4358 (96.39)   4478    6507 (45.31)
48    23209   33811 (45.68)   2707    4047 (49.50)   5386    6222 (15.52)
64    23215   30212 (30.13)   3639    3858 (6.01)    7206    5819 (-19.24)
80    23443   34497 (47.15)   4633    7214 (55.70)   9051    10776 (19.05)
96    24006   30990 (29.09)   5635    5731 (1.70)    10864   8799 (-19.00)
128   23601   29413 (24.62)   7475    7804 (4.40)    14495   11638 (-19.71)
__________________________________________________________________________
SUM:     BW: (40.1)     CPU: (35.7)     RCPU: (4.1)
_______________________________________________________________________________


The SD numbers are also good (same table as before, but SD
instead of CPU:

__________________________________________________________________________
		       #txqs = 2 (#vhosts = 3)
#     BW%       SD1     SD2 (%)        RSD1     RSD2 (%)
__________________________________________________________________________
1     -3.06)    5       4 (-20.00)     21       19 (-9.52)
2     .87       6       6 (0)          27       27 (0)
4     52.16     26      32 (23.07)     108      103 (-4.62)
8     58.49     103     146 (41.74)    431      445 (3.24)
16    56.04     407     514 (26.28)    1729     1586 (-8.27)
24    56.80     934     1161 (24.30)   3916     3665 (-6.40)
32    54.54     1668    2160 (29.49)   6925     6872 (-.76)
40    53.40     2655    3317 (24.93)   10712    10707 (-.04)
48    51.74     3920    4486 (14.43)   15598    14715 (-5.66)
64    51.66     7096    8250 (16.26)   28099    27211 (-3.16)
80    50.06     11240   12586 (11.97)  43913    42070 (-4.19)
96    50.41     16342   16976 (3.87)   63017    57048 (-9.47)
128   51.45     29254   32069 (9.62)   113451   108113 (-4.70)
__________________________________________________________________________
SUM:     BW: (37.6)     SD: (10.9)     RSD: (-5.3)

__________________________________________________________________________
		       #txqs = 8 (#vhosts = 5)
#     BW%       SD1     SD2 (%)         RSD1     RSD2 (%)
__________________________________________________________________________
1     -2.58     5       5 (0)           21       21 (0)
2     28.42     6       6 (0)           27       25 (-7.40)
4     55.31     26      32 (23.07)      108      102 (-5.55)
8     83.19     103     128 (24.27)     431      368 (-14.61)
16    80.88     407     593 (45.70)     1729     1814 (4.91)
24    71.61     934     965 (3.31)      3916     3156 (-19.40)
32    70.44     1668    3232 (93.76)    6925     9752 (40.82)
40    65.02     2655    5134 (93.37)    10712    15340 (43.20)
48    45.68     3920    4592 (17.14)    15598    14122 (-9.46)
64    30.13     7096    3928 (-44.64)   28099    11880 (-57.72)
80    47.15     11240   18389 (63.60)   43913    55154 (25.59)
96    29.09     16342   21695 (32.75)   63017    66892 (6.14)
128   24.62     29254   36371 (24.32)   113451   109219 (-3.73)
__________________________________________________________________________
SUM:     BW: (40.1)     SD: (29.0)     RSD: (0)

This approach works nicely for both single and multiple stream.
Does this look good?

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-11  7:21 ` Krishna Kumar2
@ 2010-10-12 17:09   ` Michael S. Tsirkin
  2010-10-14  7:58     ` Krishna Kumar2
  0 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2010-10-12 17:09 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

On Mon, Oct 11, 2010 at 12:51:27PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/06/2010 07:04:31 PM:
> 
> > On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > > for degradation for 1 stream case:
> >
> > I thought about possible RX/TX contention reasons, and I realized that
> > we get/put the mm counter all the time.  So I write the following: I
> > haven't seen any performance gain from this in a single queue case, but
> > maybe this will help multiqueue?
> 
> Sorry for the delay, I was sick last couple of days. The results
> with your patch are (%'s over original code):
> 
> Code               BW%       CPU%       RemoteCPU
> MQ     (#txq=16)   31.4%     38.42%     6.41%
> MQ+MST (#txq=16)   28.3%     18.9%      -10.77%
> 
> The patch helps CPU utilization but didn't help single stream
> drop.
> 
> Thanks,

What other shared TX/RX locks are there?  In your setup, is the same
macvtap socket structure used for RX and TX?  If yes this will create
cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
there might also be contention on the lock in sk_sleep waitqueue.
Anything else?

-- 
MST

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-06 13:34 Michael S. Tsirkin
  2010-10-06 17:02 ` Krishna Kumar2
@ 2010-10-11  7:21 ` Krishna Kumar2
  2010-10-12 17:09   ` Michael S. Tsirkin
  1 sibling, 1 reply; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-11  7:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/06/2010 07:04:31 PM:

> On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > for degradation for 1 stream case:
>
> I thought about possible RX/TX contention reasons, and I realized that
> we get/put the mm counter all the time.  So I write the following: I
> haven't seen any performance gain from this in a single queue case, but
> maybe this will help multiqueue?

Sorry for the delay, I was sick last couple of days. The results
with your patch are (%'s over original code):

Code               BW%       CPU%       RemoteCPU
MQ     (#txq=16)   31.4%     38.42%     6.41%
MQ+MST (#txq=16)   28.3%     18.9%      -10.77%

The patch helps CPU utilization but didn't help single stream
drop.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
  2010-10-06 13:34 Michael S. Tsirkin
@ 2010-10-06 17:02 ` Krishna Kumar2
  2010-10-11  7:21 ` Krishna Kumar2
  1 sibling, 0 replies; 26+ messages in thread
From: Krishna Kumar2 @ 2010-10-06 17:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty

"Michael S. Tsirkin" <mst@redhat.com> wrote on 10/06/2010 07:04:31 PM:

> "Michael S. Tsirkin" <mst@redhat.com>
> 10/06/2010 07:04 PM
>
> To
>
> Krishna Kumar2/India/IBM@IBMIN
>
> cc
>
> rusty@rustcorp.com.au, davem@davemloft.net, kvm@vger.kernel.org,
> arnd@arndb.de, netdev@vger.kernel.org, avi@redhat.com,
anthony@codemonkey.ws
>
> Subject
>
> Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
>
> On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > for degradation for 1 stream case:
>
> I thought about possible RX/TX contention reasons, and I realized that
> we get/put the mm counter all the time.  So I write the following: I
> haven't seen any performance gain from this in a single queue case, but
> maybe this will help multiqueue?

Great! I am on vacation tomorrow, but will test with this patch
tomorrow night.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
@ 2010-10-06 13:34 Michael S. Tsirkin
  2010-10-06 17:02 ` Krishna Kumar2
  2010-10-11  7:21 ` Krishna Kumar2
  0 siblings, 2 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2010-10-06 13:34 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: rusty, davem, kvm, arnd, netdev, avi, anthony

On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> for degradation for 1 stream case:

I thought about possible RX/TX contention reasons, and I realized that
we get/put the mm counter all the time.  So I write the following: I
haven't seen any performance gain from this in a single queue case, but
maybe this will help multiqueue?

Thanks,

Michael S. Tsirkin (2):
  vhost: put mm after thread stop
  vhost-net: batch use/unuse mm

 drivers/vhost/net.c   |    7 -------
 drivers/vhost/vhost.c |   16 ++++++++++------
 2 files changed, 10 insertions(+), 13 deletions(-)

-- 
1.7.3-rc1

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-10-14 12:47 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-17 10:03 [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
2010-09-17 10:03 ` [v2 RFC PATCH 1/4] Change virtqueue structure Krishna Kumar
2010-09-17 10:03 ` [v2 RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
2010-09-17 10:25   ` Eric Dumazet
2010-09-17 12:27     ` Krishna Kumar2
2010-09-17 13:20       ` Krishna Kumar2
2010-09-17 10:03 ` [v2 RFC PATCH 3/4] Changes for vhost Krishna Kumar
2010-09-17 10:03 ` [v2 RFC PATCH 4/4] qemu changes Krishna Kumar
2010-09-17 15:42 ` [v2 RFC PATCH 0/4] Implement multiqueue virtio-net Sridhar Samudrala
2010-09-19 12:44 ` Michael S. Tsirkin
2010-10-05 10:40   ` Krishna Kumar2
2010-10-05 18:23     ` Michael S. Tsirkin
2010-10-06 17:43       ` Krishna Kumar2
2010-10-06 19:03         ` Michael S. Tsirkin
2010-10-06 12:19     ` Arnd Bergmann
2010-10-06 17:14       ` Krishna Kumar2
2010-10-06 17:50         ` Arnd Bergmann
2010-10-06 13:34 Michael S. Tsirkin
2010-10-06 17:02 ` Krishna Kumar2
2010-10-11  7:21 ` Krishna Kumar2
2010-10-12 17:09   ` Michael S. Tsirkin
2010-10-14  7:58     ` Krishna Kumar2
2010-10-14  8:17       ` Michael S. Tsirkin
2010-10-14  9:04         ` Krishna Kumar2
     [not found]         ` <OFEC86A094.39835EBF-ON652577BC.002F9AAF-652577BC.003186B5@LocalDomain>
2010-10-14 12:17           ` Krishna Kumar2
     [not found]           ` <OF0BDA6B3A.F673A449-ON652577BC.00422911-652577BC.0043474B@LocalDomain>
2010-10-14 12:47             ` Krishna Kumar2

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.