* [RFC PATCH 0/4] Implement multiqueue virtio-net
@ 2010-09-08 7:28 Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
` (6 more replies)
0 siblings, 7 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08 7:28 UTC (permalink / raw)
To: rusty, davem; +Cc: netdev, kvm, anthony, Krishna Kumar, mst
Following patches implement Transmit mq in virtio-net. Also
included is the user qemu changes.
1. This feature was first implemented with a single vhost.
Testing showed 3-8% performance gain for upto 8 netperf
sessions (and sometimes 16), but BW dropped with more
sessions. However, implementing per-txq vhost improved
BW significantly all the way to 128 sessions.
2. For this mq TX patch, 1 daemon is created for RX and 'n'
daemons for the 'n' TXQ's, for a total of (n+1) daemons.
The (subsequent) RX mq patch changes that to a total of
'n' daemons, where RX and TX vq's share 1 daemon.
3. Service Demand increases for TCP, but significantly
improves for UDP.
4. Interoperability: Many combinations, but not all, of
qemu, host, guest tested together.
Enabling mq on virtio:
-----------------------
When following options are passed to qemu:
- smp > 1
- vhost=on
- mq=on (new option, default:off)
then #txqueues = #cpus. The #txqueues can be changed by using
an optional 'numtxqs' option. e.g. for a smp=4 guest:
vhost=on,mq=on -> #txqueues = 4
vhost=on,mq=on,numtxqs=8 -> #txqueues = 8
vhost=on,mq=on,numtxqs=2 -> #txqueues = 2
Performance (guest -> local host):
-----------------------------------
System configuration:
Host: 8 Intel Xeon, 8 GB memory
Guest: 4 cpus, 2 GB memory
All testing without any tuning, and TCP netperf with 64K I/O
_______________________________________________________________________________
TCP (#numtxqs=2)
N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%)
_______________________________________________________________________________
4 26387 40716 (54.30) 20 28 (40.00) 86i 85 (-1.16)
8 24356 41843 (71.79) 88 129 (46.59) 372 362 (-2.68)
16 23587 40546 (71.89) 375 564 (50.40) 1558 1519 (-2.50)
32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722 (-14.52)
48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552 (-14.35)
64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173 (-9.66)
96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74)
_______________________________________________________________________________
UDP (#numtxqs=8)
N# BW1 BW2 (%) SD1 SD2 (%)
__________________________________________________________
4 29836 56761 (90.24) 67 63 (-5.97)
8 27666 63767 (130.48) 326 265 (-18.71)
16 25452 60665 (138.35) 1396 1269 (-9.09)
32 26172 63491 (142.59) 5617 4202 (-25.19)
48 26146 64629 (147.18) 12813 9316 (-27.29)
64 25575 65448 (155.90) 23063 16346 (-29.12)
128 26454 63772 (141.06) 91054 85051 (-6.59)
__________________________________________________________
N#: Number of netperf sessions, 90 sec runs
BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
SD for original code
BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
SD for new code. e.g. BW2=40716 means average BW2 was
20358 mbps.
Next steps:
-----------
1. mq RX patch is also complete - plan to submit once TX is OK.
2. Cache-align data structures: I didn't see any BW/SD improvement
after making the sq's (and similarly for vhost) cache-aligned
statically:
struct virtnet_info {
...
struct send_queue sq[16] ____cacheline_aligned_in_smp;
...
};
Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts
CPU0 CPU1 CPU2 CPU3
40: 0 0 0 0 PCI-MSI-edge virtio0-config
41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
Review/feedback appreciated.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
@ 2010-09-08 7:29 ` Krishna Kumar
2010-09-09 3:49 ` Rusty Russell
2010-09-08 7:29 ` [RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
` (5 subsequent siblings)
6 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08 7:29 UTC (permalink / raw)
To: rusty, davem; +Cc: netdev, Krishna Kumar, anthony, kvm, mst
Add virtio_get_queue_index() to get the queue index of a
vq. This is needed by the cb handler to locate the queue
that should be processed.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
drivers/virtio/virtio_pci.c | 9 +++++++++
include/linux/virtio.h | 3 +++
2 files changed, 12 insertions(+)
diff -ruNp org/include/linux/virtio.h tx_only/include/linux/virtio.h
--- org/include/linux/virtio.h 2010-09-03 16:33:51.000000000 +0530
+++ tx_only/include/linux/virtio.h 2010-09-08 10:23:36.000000000 +0530
@@ -136,4 +136,7 @@ struct virtio_driver {
int register_virtio_driver(struct virtio_driver *drv);
void unregister_virtio_driver(struct virtio_driver *drv);
+
+/* return the internal queue index associated with the virtqueue */
+extern int virtio_get_queue_index(struct virtqueue *vq);
#endif /* _LINUX_VIRTIO_H */
diff -ruNp org/drivers/virtio/virtio_pci.c tx_only/drivers/virtio/virtio_pci.c
--- org/drivers/virtio/virtio_pci.c 2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/virtio/virtio_pci.c 2010-09-08 10:23:16.000000000 +0530
@@ -359,6 +359,15 @@ static int vp_request_intx(struct virtio
return err;
}
+/* Return the internal queue index associated with the virtqueue */
+int virtio_get_queue_index(struct virtqueue *vq)
+{
+ struct virtio_pci_vq_info *info = vq->priv;
+
+ return info->queue_index;
+}
+EXPORT_SYMBOL(virtio_get_queue_index);
+
static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
void (*callback)(struct virtqueue *vq),
const char *name,
^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC PATCH 2/4] Changes for virtio-net
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
@ 2010-09-08 7:29 ` Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 3/4] Changes for vhost Krishna Kumar
` (4 subsequent siblings)
6 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08 7:29 UTC (permalink / raw)
To: rusty, davem; +Cc: netdev, mst, anthony, Krishna Kumar, kvm
Implement mq virtio-net driver.
Though struct virtio_net_config changes, it works with old
qemu's since the last element is not accessed, unless qemu
sets VIRTIO_NET_F_NUMTXQS.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
drivers/net/virtio_net.c | 213 ++++++++++++++++++++++++++---------
include/linux/virtio_net.h | 6
2 files changed, 166 insertions(+), 53 deletions(-)
diff -ruNp org/include/linux/virtio_net.h tx_only/include/linux/virtio_net.h
--- org/include/linux/virtio_net.h 2010-09-03 16:33:51.000000000 +0530
+++ tx_only/include/linux/virtio_net.h 2010-09-08 10:39:22.000000000 +0530
@@ -7,6 +7,9 @@
#include <linux/virtio_config.h>
#include <linux/if_ether.h>
+/* The maximum of transmit queues supported */
+#define VIRTIO_MAX_TXQS 16
+
/* The feature bitmap for virtio net */
#define VIRTIO_NET_F_CSUM 0 /* Host handles pkts w/ partial csum */
#define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
@@ -26,6 +29,7 @@
#define VIRTIO_NET_F_CTRL_RX 18 /* Control channel RX mode support */
#define VIRTIO_NET_F_CTRL_VLAN 19 /* Control channel VLAN filtering */
#define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS 21 /* Device supports multiple TX queue */
#define VIRTIO_NET_S_LINK_UP 1 /* Link is up */
@@ -34,6 +38,8 @@ struct virtio_net_config {
__u8 mac[6];
/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
__u16 status;
+ /* number of transmit queues */
+ __u16 numtxqs;
} __attribute__((packed));
/* This is the first element of the scatter-gather list. If you don't
diff -ruNp org/drivers/net/virtio_net.c tx_only/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c 2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/net/virtio_net.c 2010-09-08 12:14:19.000000000 +0530
@@ -40,9 +40,20 @@ module_param(gso, bool, 0444);
#define VIRTNET_SEND_COMMAND_SG_MAX 2
+/* Our representation of a send virtqueue */
+struct send_queue {
+ struct virtqueue *svq;
+
+ /* TX: fragments + linear part + virtio header */
+ struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
+};
+
struct virtnet_info {
struct virtio_device *vdev;
- struct virtqueue *rvq, *svq, *cvq;
+ int numtxqs; /* Number of tx queues */
+ struct send_queue *sq;
+ struct virtqueue *rvq;
+ struct virtqueue *cvq;
struct net_device *dev;
struct napi_struct napi;
unsigned int status;
@@ -62,9 +73,8 @@ struct virtnet_info {
/* Chain pages by the private ptr. */
struct page *pages;
- /* fragments + linear part + virtio header */
+ /* RX: fragments + linear part + virtio header */
struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
- struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
};
struct skb_vnet_hdr {
@@ -120,12 +130,13 @@ static struct page *get_a_page(struct vi
static void skb_xmit_done(struct virtqueue *svq)
{
struct virtnet_info *vi = svq->vdev->priv;
+ int qnum = virtio_get_queue_index(svq) - 1; /* 0 is RX vq */
/* Suppress further interrupts. */
virtqueue_disable_cb(svq);
/* We were probably waiting for more output buffers. */
- netif_wake_queue(vi->dev);
+ netif_wake_subqueue(vi->dev, qnum);
}
static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -495,12 +506,13 @@ again:
return received;
}
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+ struct virtqueue *svq)
{
struct sk_buff *skb;
unsigned int len, tot_sgs = 0;
- while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+ while ((skb = virtqueue_get_buf(svq, &len)) != NULL) {
pr_debug("Sent skb %p\n", skb);
vi->dev->stats.tx_bytes += skb->len;
vi->dev->stats.tx_packets++;
@@ -510,7 +522,8 @@ static unsigned int free_old_xmit_skbs(s
return tot_sgs;
}
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+ struct virtqueue *svq, struct scatterlist *tx_sg)
{
struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -548,12 +561,12 @@ static int xmit_skb(struct virtnet_info
/* Encode metadata header at front. */
if (vi->mergeable_rx_bufs)
- sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+ sg_set_buf(tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
else
- sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+ sg_set_buf(tx_sg, &hdr->hdr, sizeof hdr->hdr);
- hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
- return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+ hdr->num_sg = skb_to_sgvec(skb, tx_sg + 1, 0, skb->len) + 1;
+ return virtqueue_add_buf(svq, tx_sg, hdr->num_sg,
0, skb);
}
@@ -561,31 +574,34 @@ static netdev_tx_t start_xmit(struct sk_
{
struct virtnet_info *vi = netdev_priv(dev);
int capacity;
+ int qnum = skb_get_queue_mapping(skb);
+ struct virtqueue *svq = vi->sq[qnum].svq;
/* Free up any pending old buffers before queueing new ones. */
- free_old_xmit_skbs(vi);
+ free_old_xmit_skbs(vi, svq);
/* Try to transmit */
- capacity = xmit_skb(vi, skb);
+ capacity = xmit_skb(vi, skb, svq, vi->sq[qnum].tx_sg);
/* This can happen with OOM and indirect buffers. */
if (unlikely(capacity < 0)) {
if (net_ratelimit()) {
if (likely(capacity == -ENOMEM)) {
dev_warn(&dev->dev,
- "TX queue failure: out of memory\n");
+ "TXQ (%d) failure: out of memory\n",
+ qnum);
} else {
dev->stats.tx_fifo_errors++;
dev_warn(&dev->dev,
- "Unexpected TX queue failure: %d\n",
- capacity);
+ "Unexpected TXQ (%d) failure: %d\n",
+ qnum, capacity);
}
}
dev->stats.tx_dropped++;
kfree_skb(skb);
return NETDEV_TX_OK;
}
- virtqueue_kick(vi->svq);
+ virtqueue_kick(svq);
/* Don't wait up for transmitted skbs to be freed. */
skb_orphan(skb);
@@ -594,13 +610,13 @@ static netdev_tx_t start_xmit(struct sk_
/* Apparently nice girls don't return TX_BUSY; stop the queue
* before it gets out of hand. Naturally, this wastes entries. */
if (capacity < 2+MAX_SKB_FRAGS) {
- netif_stop_queue(dev);
- if (unlikely(!virtqueue_enable_cb(vi->svq))) {
+ netif_stop_subqueue(dev, qnum);
+ if (unlikely(!virtqueue_enable_cb(svq))) {
/* More just got used, free them then recheck. */
- capacity += free_old_xmit_skbs(vi);
+ capacity += free_old_xmit_skbs(vi, svq);
if (capacity >= 2+MAX_SKB_FRAGS) {
- netif_start_queue(dev);
- virtqueue_disable_cb(vi->svq);
+ netif_start_subqueue(dev, qnum);
+ virtqueue_disable_cb(svq);
}
}
}
@@ -871,10 +887,10 @@ static void virtnet_update_status(struct
if (vi->status & VIRTIO_NET_S_LINK_UP) {
netif_carrier_on(vi->dev);
- netif_wake_queue(vi->dev);
+ netif_tx_wake_all_queues(vi->dev);
} else {
netif_carrier_off(vi->dev);
- netif_stop_queue(vi->dev);
+ netif_tx_stop_all_queues(vi->dev);
}
}
@@ -885,18 +901,112 @@ static void virtnet_config_changed(struc
virtnet_update_status(vi);
}
+#define MAX_DEVICE_NAME 16
+static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
+{
+ vq_callback_t **callbacks;
+ struct virtqueue **vqs;
+ int i, err = -ENOMEM;
+ int totalvqs;
+ char **names;
+
+ /* Allocate send queues */
+ vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
+ if (!vi->sq)
+ goto out;
+
+ /* setup initial send queue parameters */
+ for (i = 0; i < numtxqs; i++)
+ sg_init_table(vi->sq[i].tx_sg, ARRAY_SIZE(vi->sq[i].tx_sg));
+
+ /*
+ * We expect 1 RX virtqueue followed by 'numtxqs' TX virtqueues, and
+ * optionally one control virtqueue.
+ */
+ totalvqs = 1 + numtxqs +
+ virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+ /* Setup parameters for find_vqs */
+ vqs = kmalloc(totalvqs * sizeof(*vqs), GFP_KERNEL);
+ callbacks = kmalloc(totalvqs * sizeof(*callbacks), GFP_KERNEL);
+ names = kzalloc(totalvqs * sizeof(*names), GFP_KERNEL);
+ if (!vqs || !callbacks || !names)
+ goto free_mem;
+
+ /* Parameters for recv virtqueue */
+ callbacks[0] = skb_recv_done;
+ names[0] = "input";
+
+ /* Parameters for send virtqueues */
+ for (i = 1; i <= numtxqs; i++) {
+ callbacks[i] = skb_xmit_done;
+ names[i] = kmalloc(MAX_DEVICE_NAME * sizeof(*names[i]),
+ GFP_KERNEL);
+ if (!names[i])
+ goto free_mem;
+ sprintf(names[i], "output.%d", i - 1);
+ }
+
+ /* Parameters for control virtqueue, if any */
+ if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+ callbacks[i] = NULL;
+ names[i] = "control";
+ }
+
+ err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs, callbacks,
+ (const char **)names);
+ if (err)
+ goto free_mem;
+
+ vi->rvq = vqs[0];
+ for (i = 0; i < numtxqs; i++)
+ vi->sq[i].svq = vqs[i + 1];
+
+ if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+ vi->cvq = vqs[i + 1];
+
+ if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+ vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+ }
+
+free_mem:
+ if (names) {
+ for (i = 1; i <= numtxqs; i++)
+ kfree(names[i]);
+ kfree(names);
+ }
+
+ kfree(callbacks);
+ kfree(vqs);
+
+ if (err)
+ kfree(vi->sq);
+
+out:
+ return err;
+}
+
static int virtnet_probe(struct virtio_device *vdev)
{
int err;
+ u16 numtxqs = 1;
struct net_device *dev;
struct virtnet_info *vi;
- struct virtqueue *vqs[3];
- vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
- const char *names[] = { "input", "output", "control" };
- int nvqs;
+
+ /* Find how many transmit queues the device supports */
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_NUMTXQS)) {
+ vdev->config->get(vdev,
+ offsetof(struct virtio_net_config, numtxqs),
+ &numtxqs, sizeof(numtxqs));
+ if (numtxqs < 1 || numtxqs > VIRTIO_MAX_TXQS) {
+ printk(KERN_WARNING "%s: Invalid numtxqs: %d\n",
+ __func__, numtxqs);
+ return -EINVAL;
+ }
+ }
/* Allocate ourselves a network device with room for our info */
- dev = alloc_etherdev(sizeof(struct virtnet_info));
+ dev = alloc_etherdev_mq(sizeof(struct virtnet_info), numtxqs);
if (!dev)
return -ENOMEM;
@@ -940,9 +1050,9 @@ static int virtnet_probe(struct virtio_d
vi->vdev = vdev;
vdev->priv = vi;
vi->pages = NULL;
+ vi->numtxqs = numtxqs;
INIT_DELAYED_WORK(&vi->refill, refill_work);
sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
- sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
/* If we can receive ANY GSO packets, we must allocate large ones. */
if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -953,23 +1063,10 @@ static int virtnet_probe(struct virtio_d
if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
vi->mergeable_rx_bufs = true;
- /* We expect two virtqueues, receive then send,
- * and optionally control. */
- nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
-
- err = vdev->config->find_vqs(vdev, nvqs, vqs, callbacks, names);
+ /* Initialize our rx/tx queue parameters, and invoke find_vqs */
+ err = initialize_vqs(vi, numtxqs);
if (err)
- goto free;
-
- vi->rvq = vqs[0];
- vi->svq = vqs[1];
-
- if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
- vi->cvq = vqs[2];
-
- if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
- dev->features |= NETIF_F_HW_VLAN_FILTER;
- }
+ goto free_netdev;
err = register_netdev(dev);
if (err) {
@@ -986,6 +1083,9 @@ static int virtnet_probe(struct virtio_d
goto unregister;
}
+ dev_info(&dev->dev, "(virtio-net) Allocated 1 RX and %d TX vq's\n",
+ numtxqs);
+
vi->status = VIRTIO_NET_S_LINK_UP;
virtnet_update_status(vi);
netif_carrier_on(dev);
@@ -998,7 +1098,8 @@ unregister:
cancel_delayed_work_sync(&vi->refill);
free_vqs:
vdev->config->del_vqs(vdev);
-free:
+ kfree(vi->sq);
+free_netdev:
free_netdev(dev);
return err;
}
@@ -1006,11 +1107,17 @@ free:
static void free_unused_bufs(struct virtnet_info *vi)
{
void *buf;
- while (1) {
- buf = virtqueue_detach_unused_buf(vi->svq);
- if (!buf)
- break;
- dev_kfree_skb(buf);
+ int i;
+
+ for (i = 0; i < vi->numtxqs; i++) {
+ struct virtqueue *svq = vi->sq[i].svq;
+
+ while (1) {
+ buf = virtqueue_detach_unused_buf(svq);
+ if (!buf)
+ break;
+ dev_kfree_skb(buf);
+ }
}
while (1) {
buf = virtqueue_detach_unused_buf(vi->rvq);
@@ -1059,7 +1166,7 @@ static unsigned int features[] = {
VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
- VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
+ VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, VIRTIO_NET_F_NUMTXQS,
};
static struct virtio_driver virtio_net_driver = {
^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC PATCH 3/4] Changes for vhost
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
@ 2010-09-08 7:29 ` Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 4/4] qemu changes Krishna Kumar
` (3 subsequent siblings)
6 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08 7:29 UTC (permalink / raw)
To: rusty, davem; +Cc: netdev, kvm, Krishna Kumar, anthony, mst
Changes for mq vhost.
vhost_net_open is changed to allocate a vhost_net and
return. The remaining initializations are delayed till
SET_OWNER. SET_OWNER is changed so that the argument
is used to figure out how many txqs to use. Unmodified
qemu's will pass NULL, so this is recognized and handled
as numtxqs=1.
Besides changing handle_tx to use 'vq', this patch also
changes handle_rx to take vq as parameter. The mq RX
patch requires this change, but till then it is consistent
(and less confusing) to make the interfaces for handling
rx and tx similar.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
drivers/vhost/net.c | 272 ++++++++++++++++++++++++++--------------
drivers/vhost/vhost.c | 152 ++++++++++++++--------
drivers/vhost/vhost.h | 15 +-
3 files changed, 289 insertions(+), 150 deletions(-)
diff -ruNp org/drivers/vhost/net.c tx_only/drivers/vhost/net.c
--- org/drivers/vhost/net.c 2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/vhost/net.c 2010-09-08 10:20:54.000000000 +0530
@@ -33,12 +33,6 @@
* Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000
-enum {
- VHOST_NET_VQ_RX = 0,
- VHOST_NET_VQ_TX = 1,
- VHOST_NET_VQ_MAX = 2,
-};
-
enum vhost_net_poll_state {
VHOST_NET_POLL_DISABLED = 0,
VHOST_NET_POLL_STARTED = 1,
@@ -47,12 +41,12 @@ enum vhost_net_poll_state {
struct vhost_net {
struct vhost_dev dev;
- struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
- struct vhost_poll poll[VHOST_NET_VQ_MAX];
+ struct vhost_virtqueue *vqs;
+ struct vhost_poll *poll;
/* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
- enum vhost_net_poll_state tx_poll_state;
+ enum vhost_net_poll_state *tx_poll_state;
};
/* Pop first len bytes from iovec. Return number of segments used. */
@@ -92,28 +86,28 @@ static void copy_iovec_hdr(const struct
}
/* Caller must have TX VQ lock */
-static void tx_poll_stop(struct vhost_net *net)
+static void tx_poll_stop(struct vhost_net *net, int qnum)
{
- if (likely(net->tx_poll_state != VHOST_NET_POLL_STARTED))
+ if (likely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STARTED))
return;
- vhost_poll_stop(net->poll + VHOST_NET_VQ_TX);
- net->tx_poll_state = VHOST_NET_POLL_STOPPED;
+ vhost_poll_stop(&net->poll[qnum]);
+ net->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
}
/* Caller must have TX VQ lock */
-static void tx_poll_start(struct vhost_net *net, struct socket *sock)
+static void tx_poll_start(struct vhost_net *net, struct socket *sock, int qnum)
{
- if (unlikely(net->tx_poll_state != VHOST_NET_POLL_STOPPED))
+ if (unlikely(net->tx_poll_state[qnum] != VHOST_NET_POLL_STOPPED))
return;
- vhost_poll_start(net->poll + VHOST_NET_VQ_TX, sock->file);
- net->tx_poll_state = VHOST_NET_POLL_STARTED;
+ vhost_poll_start(&net->poll[qnum], sock->file);
+ net->tx_poll_state[qnum] = VHOST_NET_POLL_STARTED;
}
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
-static void handle_tx(struct vhost_net *net)
+static void handle_tx(struct vhost_virtqueue *vq)
{
- struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+ struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
unsigned out, in, s;
int head;
struct msghdr msg = {
@@ -134,7 +128,7 @@ static void handle_tx(struct vhost_net *
wmem = atomic_read(&sock->sk->sk_wmem_alloc);
if (wmem >= sock->sk->sk_sndbuf) {
mutex_lock(&vq->mutex);
- tx_poll_start(net, sock);
+ tx_poll_start(net, sock, vq->qnum);
mutex_unlock(&vq->mutex);
return;
}
@@ -144,7 +138,7 @@ static void handle_tx(struct vhost_net *
vhost_disable_notify(vq);
if (wmem < sock->sk->sk_sndbuf / 2)
- tx_poll_stop(net);
+ tx_poll_stop(net, vq->qnum);
hdr_size = vq->vhost_hlen;
for (;;) {
@@ -159,7 +153,7 @@ static void handle_tx(struct vhost_net *
if (head == vq->num) {
wmem = atomic_read(&sock->sk->sk_wmem_alloc);
if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
- tx_poll_start(net, sock);
+ tx_poll_start(net, sock, vq->qnum);
set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
break;
}
@@ -189,7 +183,7 @@ static void handle_tx(struct vhost_net *
err = sock->ops->sendmsg(NULL, sock, &msg, len);
if (unlikely(err < 0)) {
vhost_discard_vq_desc(vq, 1);
- tx_poll_start(net, sock);
+ tx_poll_start(net, sock, vq->qnum);
break;
}
if (err != len)
@@ -282,9 +276,9 @@ err:
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
-static void handle_rx_big(struct vhost_net *net)
+static void handle_rx_big(struct vhost_virtqueue *vq,
+ struct vhost_net *net)
{
- struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
unsigned out, in, log, s;
int head;
struct vhost_log *vq_log;
@@ -393,9 +387,9 @@ static void handle_rx_big(struct vhost_n
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
-static void handle_rx_mergeable(struct vhost_net *net)
+static void handle_rx_mergeable(struct vhost_virtqueue *vq,
+ struct vhost_net *net)
{
- struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
unsigned uninitialized_var(in), log;
struct vhost_log *vq_log;
struct msghdr msg = {
@@ -500,96 +494,179 @@ static void handle_rx_mergeable(struct v
unuse_mm(net->dev.mm);
}
-static void handle_rx(struct vhost_net *net)
+static void handle_rx(struct vhost_virtqueue *vq)
{
+ struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF))
- handle_rx_mergeable(net);
+ handle_rx_mergeable(vq, net);
else
- handle_rx_big(net);
+ handle_rx_big(vq, net);
}
static void handle_tx_kick(struct vhost_work *work)
{
struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
poll.work);
- struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
- handle_tx(net);
+ handle_tx(vq);
}
static void handle_rx_kick(struct vhost_work *work)
{
struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
poll.work);
- struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
- handle_rx(net);
+ handle_rx(vq);
}
static void handle_tx_net(struct vhost_work *work)
{
- struct vhost_net *net = container_of(work, struct vhost_net,
- poll[VHOST_NET_VQ_TX].work);
- handle_tx(net);
+ struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+ work)->vq;
+
+ handle_tx(vq);
}
static void handle_rx_net(struct vhost_work *work)
{
- struct vhost_net *net = container_of(work, struct vhost_net,
- poll[VHOST_NET_VQ_RX].work);
- handle_rx(net);
+ struct vhost_virtqueue *vq = container_of(work, struct vhost_poll,
+ work)->vq;
+
+ handle_rx(vq);
}
-static int vhost_net_open(struct inode *inode, struct file *f)
+void vhost_free_vqs(struct vhost_dev *dev)
{
- struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
- struct vhost_dev *dev;
- int r;
+ struct vhost_net *n = container_of(dev, struct vhost_net, dev);
- if (!n)
- return -ENOMEM;
+ kfree(dev->work_list);
+ kfree(dev->work_lock);
+ kfree(n->tx_poll_state);
+ kfree(n->poll);
+ kfree(n->vqs);
- dev = &n->dev;
- n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
- n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
- r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
- if (r < 0) {
- kfree(n);
- return r;
+ /*
+ * Reset so that vhost_net_release (after vhost_dev_set_owner call)
+ * will notice.
+ */
+ n->vqs = NULL;
+ n->poll = NULL;
+ n->tx_poll_state = NULL;
+ dev->work_lock = NULL;
+ dev->work_list = NULL;
+}
+
+/* Upper limit of how many vq's we support - 1 RX and VIRTIO_MAX_TXQS TX vq's */
+#define MAX_VQS (1 + VIRTIO_MAX_TXQS)
+
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs)
+{
+ struct vhost_net *n = container_of(dev, struct vhost_net, dev);
+ int i, nvqs;
+ int ret;
+
+ if (numtxqs < 0 || numtxqs > VIRTIO_MAX_TXQS)
+ return -EINVAL;
+
+ if (numtxqs == 0) {
+ /* Old qemu doesn't pass arguments to set_owner, use 1 txq */
+ numtxqs = 1;
}
- vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
- vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
- n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+ /* Total number of virtqueues is numtxqs + 1 */
+ nvqs = numtxqs + 1;
+
+ n->vqs = kmalloc(nvqs * sizeof(*n->vqs), GFP_KERNEL);
+ n->poll = kmalloc(nvqs * sizeof(*n->poll), GFP_KERNEL);
+
+ /* Allocate 1 more tx_poll_state than required for convenience */
+ n->tx_poll_state = kmalloc(nvqs * sizeof(*n->tx_poll_state),
+ GFP_KERNEL);
+ dev->work_lock = kmalloc(nvqs * sizeof(*dev->work_lock),
+ GFP_KERNEL);
+ dev->work_list = kmalloc(nvqs * sizeof(*dev->work_list),
+ GFP_KERNEL);
+
+ if (!n->vqs || !n->poll || !n->tx_poll_state || !dev->work_lock ||
+ !dev->work_list) {
+ ret = -ENOMEM;
+ goto err;
+ }
- f->private_data = n;
+ /* 1 RX, followed by 'numtxqs' TX queues */
+ n->vqs[0].handle_kick = handle_rx_kick;
+
+ for (i = 1; i < nvqs; i++)
+ n->vqs[i].handle_kick = handle_tx_kick;
+
+ ret = vhost_dev_init(dev, n->vqs, nvqs);
+ if (ret < 0)
+ goto err;
+
+ vhost_poll_init(&n->poll[0], handle_rx_net, POLLIN, &n->vqs[0]);
+
+ for (i = 1; i < nvqs; i++) {
+ vhost_poll_init(&n->poll[i], handle_tx_net, POLLOUT,
+ &n->vqs[i]);
+ n->tx_poll_state[i] = VHOST_NET_POLL_DISABLED;
+ }
return 0;
+
+err:
+ /* Free all pointers that may have been allocated */
+ vhost_free_vqs(dev);
+
+ return ret;
+}
+
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+ struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
+ int ret = ENOMEM;
+
+ if (n) {
+ struct vhost_dev *dev = &n->dev;
+
+ f->private_data = n;
+ mutex_init(&dev->mutex);
+
+ /* Defer all other initialization till user does SET_OWNER */
+ ret = 0;
+ }
+
+ return ret;
}
static void vhost_net_disable_vq(struct vhost_net *n,
struct vhost_virtqueue *vq)
{
+ int qnum = vq->qnum;
+
if (!vq->private_data)
return;
- if (vq == n->vqs + VHOST_NET_VQ_TX) {
- tx_poll_stop(n);
- n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+ if (qnum) { /* TX */
+ tx_poll_stop(n, qnum);
+ n->tx_poll_state[qnum] = VHOST_NET_POLL_DISABLED;
} else
- vhost_poll_stop(n->poll + VHOST_NET_VQ_RX);
+ vhost_poll_stop(&n->poll[qnum]);
}
static void vhost_net_enable_vq(struct vhost_net *n,
struct vhost_virtqueue *vq)
{
struct socket *sock = vq->private_data;
+ int qnum = vq->qnum;
+
if (!sock)
return;
- if (vq == n->vqs + VHOST_NET_VQ_TX) {
- n->tx_poll_state = VHOST_NET_POLL_STOPPED;
- tx_poll_start(n, sock);
+
+ if (qnum) { /* TX */
+ n->tx_poll_state[qnum] = VHOST_NET_POLL_STOPPED;
+ tx_poll_start(n, sock, qnum);
} else
- vhost_poll_start(n->poll + VHOST_NET_VQ_RX, sock->file);
+ vhost_poll_start(&n->poll[qnum], sock->file);
}
static struct socket *vhost_net_stop_vq(struct vhost_net *n,
@@ -605,11 +682,12 @@ static struct socket *vhost_net_stop_vq(
return sock;
}
-static void vhost_net_stop(struct vhost_net *n, struct socket **tx_sock,
- struct socket **rx_sock)
+static void vhost_net_stop(struct vhost_net *n, struct socket **socks)
{
- *tx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_TX);
- *rx_sock = vhost_net_stop_vq(n, n->vqs + VHOST_NET_VQ_RX);
+ int i;
+
+ for (i = n->dev.nvqs - 1; i >= 0; i--)
+ socks[i] = vhost_net_stop_vq(n, &n->vqs[i]);
}
static void vhost_net_flush_vq(struct vhost_net *n, int index)
@@ -620,26 +698,34 @@ static void vhost_net_flush_vq(struct vh
static void vhost_net_flush(struct vhost_net *n)
{
- vhost_net_flush_vq(n, VHOST_NET_VQ_TX);
- vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
+ int i;
+
+ for (i = n->dev.nvqs - 1; i >= 0; i--)
+ vhost_net_flush_vq(n, i);
}
static int vhost_net_release(struct inode *inode, struct file *f)
{
struct vhost_net *n = f->private_data;
- struct socket *tx_sock;
- struct socket *rx_sock;
+ struct vhost_dev *dev = &n->dev;
+ struct socket *socks[MAX_VQS];
+ int i;
- vhost_net_stop(n, &tx_sock, &rx_sock);
+ vhost_net_stop(n, socks);
vhost_net_flush(n);
- vhost_dev_cleanup(&n->dev);
- if (tx_sock)
- fput(tx_sock->file);
- if (rx_sock)
- fput(rx_sock->file);
+ vhost_dev_cleanup(dev);
+
+ for (i = n->dev.nvqs - 1; i >= 0; i--)
+ if (socks[i])
+ fput(socks[i]->file);
+
/* We do an extra flush before freeing memory,
* since jobs can re-queue themselves. */
vhost_net_flush(n);
+
+ /* Free all old pointers */
+ vhost_free_vqs(dev);
+
kfree(n);
return 0;
}
@@ -717,7 +803,7 @@ static long vhost_net_set_backend(struct
if (r)
goto err;
- if (index >= VHOST_NET_VQ_MAX) {
+ if (index >= n->dev.nvqs) {
r = -ENOBUFS;
goto err;
}
@@ -762,22 +848,26 @@ err:
static long vhost_net_reset_owner(struct vhost_net *n)
{
- struct socket *tx_sock = NULL;
- struct socket *rx_sock = NULL;
+ struct socket *socks[MAX_VQS];
long err;
+ int i;
+
mutex_lock(&n->dev.mutex);
err = vhost_dev_check_owner(&n->dev);
- if (err)
- goto done;
- vhost_net_stop(n, &tx_sock, &rx_sock);
+ if (err) {
+ mutex_unlock(&n->dev.mutex);
+ return err;
+ }
+
+ vhost_net_stop(n, socks);
vhost_net_flush(n);
err = vhost_dev_reset_owner(&n->dev);
-done:
mutex_unlock(&n->dev.mutex);
- if (tx_sock)
- fput(tx_sock->file);
- if (rx_sock)
- fput(rx_sock->file);
+
+ for (i = n->dev.nvqs - 1; i >= 0; i--)
+ if (socks[i])
+ fput(socks[i]->file);
+
return err;
}
@@ -806,7 +896,7 @@ static int vhost_net_set_features(struct
}
n->dev.acked_features = features;
smp_wmb();
- for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
+ for (i = 0; i < n->dev.nvqs; ++i) {
mutex_lock(&n->vqs[i].mutex);
n->vqs[i].vhost_hlen = vhost_hlen;
n->vqs[i].sock_hlen = sock_hlen;
diff -ruNp org/drivers/vhost/vhost.c tx_only/drivers/vhost/vhost.c
--- org/drivers/vhost/vhost.c 2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/vhost/vhost.c 2010-09-08 10:20:54.000000000 +0530
@@ -62,14 +62,14 @@ static int vhost_poll_wakeup(wait_queue_
/* Init poll structure */
void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
- unsigned long mask, struct vhost_dev *dev)
+ unsigned long mask, struct vhost_virtqueue *vq)
{
struct vhost_work *work = &poll->work;
init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
init_poll_funcptr(&poll->table, vhost_poll_func);
poll->mask = mask;
- poll->dev = dev;
+ poll->vq = vq;
INIT_LIST_HEAD(&work->node);
work->fn = fn;
@@ -104,35 +104,35 @@ void vhost_poll_flush(struct vhost_poll
int left;
int flushing;
- spin_lock_irq(&poll->dev->work_lock);
+ spin_lock_irq(poll->vq->work_lock);
seq = work->queue_seq;
work->flushing++;
- spin_unlock_irq(&poll->dev->work_lock);
+ spin_unlock_irq(poll->vq->work_lock);
wait_event(work->done, ({
- spin_lock_irq(&poll->dev->work_lock);
+ spin_lock_irq(poll->vq->work_lock);
left = seq - work->done_seq <= 0;
- spin_unlock_irq(&poll->dev->work_lock);
+ spin_unlock_irq(poll->vq->work_lock);
left;
}));
- spin_lock_irq(&poll->dev->work_lock);
+ spin_lock_irq(poll->vq->work_lock);
flushing = --work->flushing;
- spin_unlock_irq(&poll->dev->work_lock);
+ spin_unlock_irq(poll->vq->work_lock);
BUG_ON(flushing < 0);
}
void vhost_poll_queue(struct vhost_poll *poll)
{
- struct vhost_dev *dev = poll->dev;
+ struct vhost_virtqueue *vq = poll->vq;
struct vhost_work *work = &poll->work;
unsigned long flags;
- spin_lock_irqsave(&dev->work_lock, flags);
+ spin_lock_irqsave(vq->work_lock, flags);
if (list_empty(&work->node)) {
- list_add_tail(&work->node, &dev->work_list);
+ list_add_tail(&work->node, vq->work_list);
work->queue_seq++;
- wake_up_process(dev->worker);
+ wake_up_process(vq->worker);
}
- spin_unlock_irqrestore(&dev->work_lock, flags);
+ spin_unlock_irqrestore(vq->work_lock, flags);
}
static void vhost_vq_reset(struct vhost_dev *dev,
@@ -163,7 +163,7 @@ static void vhost_vq_reset(struct vhost_
static int vhost_worker(void *data)
{
- struct vhost_dev *dev = data;
+ struct vhost_virtqueue *vq = data;
struct vhost_work *work = NULL;
unsigned uninitialized_var(seq);
@@ -171,7 +171,7 @@ static int vhost_worker(void *data)
/* mb paired w/ kthread_stop */
set_current_state(TASK_INTERRUPTIBLE);
- spin_lock_irq(&dev->work_lock);
+ spin_lock_irq(vq->work_lock);
if (work) {
work->done_seq = seq;
if (work->flushing)
@@ -179,18 +179,18 @@ static int vhost_worker(void *data)
}
if (kthread_should_stop()) {
- spin_unlock_irq(&dev->work_lock);
+ spin_unlock_irq(vq->work_lock);
__set_current_state(TASK_RUNNING);
return 0;
}
- if (!list_empty(&dev->work_list)) {
- work = list_first_entry(&dev->work_list,
+ if (!list_empty(vq->work_list)) {
+ work = list_first_entry(vq->work_list,
struct vhost_work, node);
list_del_init(&work->node);
seq = work->queue_seq;
} else
work = NULL;
- spin_unlock_irq(&dev->work_lock);
+ spin_unlock_irq(vq->work_lock);
if (work) {
__set_current_state(TASK_RUNNING);
@@ -213,17 +213,24 @@ long vhost_dev_init(struct vhost_dev *de
dev->log_file = NULL;
dev->memory = NULL;
dev->mm = NULL;
- spin_lock_init(&dev->work_lock);
- INIT_LIST_HEAD(&dev->work_list);
- dev->worker = NULL;
for (i = 0; i < dev->nvqs; ++i) {
- dev->vqs[i].dev = dev;
- mutex_init(&dev->vqs[i].mutex);
+ struct vhost_virtqueue *vq = &dev->vqs[i];
+
+ spin_lock_init(&dev->work_lock[i]);
+ INIT_LIST_HEAD(&dev->work_list[i]);
+
+ vq->work_lock = &dev->work_lock[i];
+ vq->work_list = &dev->work_list[i];
+
+ vq->worker = NULL;
+ vq->dev = dev;
+ vq->qnum = i;
+ mutex_init(&vq->mutex);
vhost_vq_reset(dev, dev->vqs + i);
- if (dev->vqs[i].handle_kick)
- vhost_poll_init(&dev->vqs[i].poll,
- dev->vqs[i].handle_kick, POLLIN, dev);
+ if (vq->handle_kick)
+ vhost_poll_init(&vq->poll, vq->handle_kick, POLLIN,
+ vq);
}
return 0;
@@ -236,38 +243,76 @@ long vhost_dev_check_owner(struct vhost_
return dev->mm == current->mm ? 0 : -EPERM;
}
+static void vhost_stop_workers(struct vhost_dev *dev)
+{
+ int i;
+
+ for (i = 0; i < dev->nvqs; i++) {
+ WARN_ON(!list_empty(dev->vqs[i].work_list));
+ kthread_stop(dev->vqs[i].worker);
+ }
+}
+
+static int vhost_start_workers(struct vhost_dev *dev)
+{
+ int i, err = 0;
+
+ for (i = 0; i < dev->nvqs; ++i) {
+ struct vhost_virtqueue *vq = &dev->vqs[i];
+
+ vq->worker = kthread_create(vhost_worker, vq, "vhost-%d-%d",
+ current->pid, i);
+ if (IS_ERR(vq->worker)) {
+ err = PTR_ERR(vq->worker);
+ i--; /* no thread to clean up at this index */
+ goto err;
+ }
+
+ /* avoid contributing to loadavg */
+ err = cgroup_attach_task_current_cg(vq->worker);
+ if (err)
+ goto err;
+
+ wake_up_process(vq->worker);
+ }
+
+ return 0;
+
+err:
+ for (; i >= 0; i--)
+ kthread_stop(dev->vqs[i].worker);
+
+ return err;
+}
+
/* Caller should have device mutex */
-static long vhost_dev_set_owner(struct vhost_dev *dev)
+static long vhost_dev_set_owner(struct vhost_dev *dev, int numtxqs)
{
- struct task_struct *worker;
int err;
/* Is there an owner already? */
if (dev->mm) {
err = -EBUSY;
- goto err_mm;
- }
- /* No owner, become one */
- dev->mm = get_task_mm(current);
- worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
- if (IS_ERR(worker)) {
- err = PTR_ERR(worker);
- goto err_worker;
+ } else {
+ err = vhost_setup_vqs(dev, numtxqs);
+ if (err)
+ goto out;
+
+ /* No owner, become one */
+ dev->mm = get_task_mm(current);
+
+ /* Start daemons */
+ err = vhost_start_workers(dev);
+
+ if (err) {
+ vhost_free_vqs(dev);
+ if (dev->mm) {
+ mmput(dev->mm);
+ dev->mm = NULL;
+ }
+ }
}
- dev->worker = worker;
- err = cgroup_attach_task_current_cg(worker);
- if (err)
- goto err_cgroup;
- wake_up_process(worker); /* avoid contributing to loadavg */
-
- return 0;
-err_cgroup:
- kthread_stop(worker);
-err_worker:
- if (dev->mm)
- mmput(dev->mm);
- dev->mm = NULL;
-err_mm:
+out:
return err;
}
@@ -322,8 +367,7 @@ void vhost_dev_cleanup(struct vhost_dev
mmput(dev->mm);
dev->mm = NULL;
- WARN_ON(!list_empty(&dev->work_list));
- kthread_stop(dev->worker);
+ vhost_stop_workers(dev);
}
static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -674,7 +718,7 @@ long vhost_dev_ioctl(struct vhost_dev *d
/* If you are not the owner, you can become one */
if (ioctl == VHOST_SET_OWNER) {
- r = vhost_dev_set_owner(d);
+ r = vhost_dev_set_owner(d, arg);
goto done;
}
diff -ruNp org/drivers/vhost/vhost.h tx_only/drivers/vhost/vhost.h
--- org/drivers/vhost/vhost.h 2010-09-03 16:33:51.000000000 +0530
+++ tx_only/drivers/vhost/vhost.h 2010-09-08 10:20:54.000000000 +0530
@@ -40,11 +40,11 @@ struct vhost_poll {
wait_queue_t wait;
struct vhost_work work;
unsigned long mask;
- struct vhost_dev *dev;
+ struct vhost_virtqueue *vq; /* points back to vq */
};
void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
- unsigned long mask, struct vhost_dev *dev);
+ unsigned long mask, struct vhost_virtqueue *vq);
void vhost_poll_start(struct vhost_poll *poll, struct file *file);
void vhost_poll_stop(struct vhost_poll *poll);
void vhost_poll_flush(struct vhost_poll *poll);
@@ -110,6 +110,10 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log log[VHOST_NET_MAX_SG];
+ struct task_struct *worker; /* vhost for this vq, shared btwn RX/TX */
+ spinlock_t *work_lock;
+ struct list_head *work_list;
+ int qnum; /* 0 for RX, 1 -> n-1 for TX */
};
struct vhost_dev {
@@ -124,11 +128,12 @@ struct vhost_dev {
int nvqs;
struct file *log_file;
struct eventfd_ctx *log_ctx;
- spinlock_t work_lock;
- struct list_head work_list;
- struct task_struct *worker;
+ spinlock_t *work_lock;
+ struct list_head *work_list;
};
+int vhost_setup_vqs(struct vhost_dev *dev, int numtxqs);
+void vhost_free_vqs(struct vhost_dev *dev);
long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
long vhost_dev_check_owner(struct vhost_dev *);
long vhost_dev_reset_owner(struct vhost_dev *);
^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC PATCH 4/4] qemu changes
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
` (2 preceding siblings ...)
2010-09-08 7:29 ` [RFC PATCH 3/4] Changes for vhost Krishna Kumar
@ 2010-09-08 7:29 ` Krishna Kumar
2010-09-08 7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
` (2 subsequent siblings)
6 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08 7:29 UTC (permalink / raw)
To: rusty, davem; +Cc: anthony, netdev, Krishna Kumar, kvm, mst
Changes in qemu to support mq TX.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
hw/vhost.c | 8 ++-
hw/vhost.h | 2
hw/vhost_net.c | 16 +++++--
hw/vhost_net.h | 2
hw/virtio-net.c | 97 ++++++++++++++++++++++++++++++----------------
hw/virtio-net.h | 5 ++
hw/virtio-pci.c | 2
net.c | 17 ++++++++
net.h | 1
net/tap.c | 61 +++++++++++++++++++++-------
10 files changed, 155 insertions(+), 56 deletions(-)
diff -ruNp org/hw/vhost.c new/hw/vhost.c
--- org/hw/vhost.c 2010-08-09 09:51:58.000000000 +0530
+++ new/hw/vhost.c 2010-09-08 12:54:50.000000000 +0530
@@ -599,23 +599,27 @@ static void vhost_virtqueue_cleanup(stru
0, virtio_queue_get_desc_size(vdev, idx));
}
-int vhost_dev_init(struct vhost_dev *hdev, int devfd)
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int numtxqs)
{
uint64_t features;
int r;
if (devfd >= 0) {
hdev->control = devfd;
+ hdev->nvqs = 2;
} else {
hdev->control = open("/dev/vhost-net", O_RDWR);
if (hdev->control < 0) {
return -errno;
}
}
- r = ioctl(hdev->control, VHOST_SET_OWNER, NULL);
+
+ r = ioctl(hdev->control, VHOST_SET_OWNER, numtxqs);
if (r < 0) {
goto fail;
}
+ hdev->nvqs = numtxqs + 1;
+
r = ioctl(hdev->control, VHOST_GET_FEATURES, &features);
if (r < 0) {
goto fail;
diff -ruNp org/hw/vhost.h new/hw/vhost.h
--- org/hw/vhost.h 2010-07-01 11:42:09.000000000 +0530
+++ new/hw/vhost.h 2010-09-08 12:54:50.000000000 +0530
@@ -40,7 +40,7 @@ struct vhost_dev {
unsigned long long log_size;
};
-int vhost_dev_init(struct vhost_dev *hdev, int devfd);
+int vhost_dev_init(struct vhost_dev *hdev, int devfd, int nvqs);
void vhost_dev_cleanup(struct vhost_dev *hdev);
int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
diff -ruNp org/hw/vhost_net.c new/hw/vhost_net.c
--- org/hw/vhost_net.c 2010-08-09 09:51:58.000000000 +0530
+++ new/hw/vhost_net.c 2010-09-08 12:54:50.000000000 +0530
@@ -36,7 +36,8 @@
struct vhost_net {
struct vhost_dev dev;
- struct vhost_virtqueue vqs[2];
+ struct vhost_virtqueue *vqs;
+ int nvqs;
int backend;
VLANClientState *vc;
};
@@ -76,7 +77,8 @@ static int vhost_net_get_fd(VLANClientSt
}
}
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ int numtxqs)
{
int r;
struct vhost_net *net = qemu_malloc(sizeof *net);
@@ -93,10 +95,14 @@ struct vhost_net *vhost_net_init(VLANCli
(1 << VHOST_NET_F_VIRTIO_NET_HDR);
net->backend = r;
- r = vhost_dev_init(&net->dev, devfd);
+ r = vhost_dev_init(&net->dev, devfd, numtxqs);
if (r < 0) {
goto fail;
}
+
+ net->nvqs = numtxqs + 1;
+ net->vqs = qemu_malloc(net->nvqs * (sizeof *net->vqs));
+
if (~net->dev.features & net->dev.backend_features) {
fprintf(stderr, "vhost lacks feature mask %" PRIu64 " for backend\n",
(uint64_t)(~net->dev.features & net->dev.backend_features));
@@ -118,7 +124,6 @@ int vhost_net_start(struct vhost_net *ne
struct vhost_vring_file file = { };
int r;
- net->dev.nvqs = 2;
net->dev.vqs = net->vqs;
r = vhost_dev_start(&net->dev, dev);
if (r < 0) {
@@ -166,7 +171,8 @@ void vhost_net_cleanup(struct vhost_net
qemu_free(net);
}
#else
-struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd)
+struct vhost_net *vhost_net_init(VLANClientState *backend, int devfd,
+ int nvqs)
{
return NULL;
}
diff -ruNp org/hw/vhost_net.h new/hw/vhost_net.h
--- org/hw/vhost_net.h 2010-07-01 11:42:09.000000000 +0530
+++ new/hw/vhost_net.h 2010-09-08 12:54:50.000000000 +0530
@@ -6,7 +6,7 @@
struct vhost_net;
typedef struct vhost_net VHostNetState;
-VHostNetState *vhost_net_init(VLANClientState *backend, int devfd);
+VHostNetState *vhost_net_init(VLANClientState *backend, int devfd, int nvqs);
int vhost_net_start(VHostNetState *net, VirtIODevice *dev);
void vhost_net_stop(VHostNetState *net, VirtIODevice *dev);
diff -ruNp org/hw/virtio-net.c new/hw/virtio-net.c
--- org/hw/virtio-net.c 2010-07-19 12:41:28.000000000 +0530
+++ new/hw/virtio-net.c 2010-09-08 12:54:50.000000000 +0530
@@ -32,17 +32,17 @@ typedef struct VirtIONet
uint8_t mac[ETH_ALEN];
uint16_t status;
VirtQueue *rx_vq;
- VirtQueue *tx_vq;
+ VirtQueue **tx_vq;
VirtQueue *ctrl_vq;
NICState *nic;
- QEMUTimer *tx_timer;
- int tx_timer_active;
+ QEMUTimer **tx_timer;
+ int *tx_timer_active;
uint32_t has_vnet_hdr;
uint8_t has_ufo;
struct {
VirtQueueElement elem;
ssize_t len;
- } async_tx;
+ } *async_tx;
int mergeable_rx_bufs;
uint8_t promisc;
uint8_t allmulti;
@@ -61,6 +61,7 @@ typedef struct VirtIONet
} mac_table;
uint32_t *vlans;
DeviceState *qdev;
+ uint16_t numtxqs;
} VirtIONet;
/* TODO
@@ -78,6 +79,7 @@ static void virtio_net_get_config(VirtIO
struct virtio_net_config netcfg;
netcfg.status = n->status;
+ netcfg.numtxqs = n->numtxqs;
memcpy(netcfg.mac, n->mac, ETH_ALEN);
memcpy(config, &netcfg, sizeof(netcfg));
}
@@ -162,6 +164,8 @@ static uint32_t virtio_net_get_features(
VirtIONet *n = to_virtio_net(vdev);
features |= (1 << VIRTIO_NET_F_MAC);
+ if (n->numtxqs > 1)
+ features |= (1 << VIRTIO_NET_F_NUMTXQS);
if (peer_has_vnet_hdr(n)) {
tap_using_vnet_hdr(n->nic->nc.peer, 1);
@@ -625,13 +629,16 @@ static void virtio_net_tx_complete(VLANC
{
VirtIONet *n = DO_UPCAST(NICState, nc, nc)->opaque;
- virtqueue_push(n->tx_vq, &n->async_tx.elem, n->async_tx.len);
- virtio_notify(&n->vdev, n->tx_vq);
+ /*
+ * If this function executes, we are single TX and hence use only txq[0]
+ */
+ virtqueue_push(n->tx_vq[0], &n->async_tx[0].elem, n->async_tx[0].len);
+ virtio_notify(&n->vdev, n->tx_vq[0]);
- n->async_tx.elem.out_num = n->async_tx.len = 0;
+ n->async_tx[0].elem.out_num = n->async_tx[0].len = 0;
- virtio_queue_set_notification(n->tx_vq, 1);
- virtio_net_flush_tx(n, n->tx_vq);
+ virtio_queue_set_notification(n->tx_vq[0], 1);
+ virtio_net_flush_tx(n, n->tx_vq[0]);
}
/* TX */
@@ -642,8 +649,8 @@ static void virtio_net_flush_tx(VirtIONe
if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
return;
- if (n->async_tx.elem.out_num) {
- virtio_queue_set_notification(n->tx_vq, 0);
+ if (n->async_tx[0].elem.out_num) {
+ virtio_queue_set_notification(n->tx_vq[0], 0);
return;
}
@@ -678,9 +685,9 @@ static void virtio_net_flush_tx(VirtIONe
ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
virtio_net_tx_complete);
if (ret == 0) {
- virtio_queue_set_notification(n->tx_vq, 0);
- n->async_tx.elem = elem;
- n->async_tx.len = len;
+ virtio_queue_set_notification(n->tx_vq[0], 0);
+ n->async_tx[0].elem = elem;
+ n->async_tx[0].len = len;
return;
}
@@ -695,15 +702,15 @@ static void virtio_net_handle_tx(VirtIOD
{
VirtIONet *n = to_virtio_net(vdev);
- if (n->tx_timer_active) {
+ if (n->tx_timer_active[0]) {
virtio_queue_set_notification(vq, 1);
- qemu_del_timer(n->tx_timer);
- n->tx_timer_active = 0;
+ qemu_del_timer(n->tx_timer[0]);
+ n->tx_timer_active[0] = 0;
virtio_net_flush_tx(n, vq);
} else {
- qemu_mod_timer(n->tx_timer,
+ qemu_mod_timer(n->tx_timer[0],
qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
- n->tx_timer_active = 1;
+ n->tx_timer_active[0] = 1;
virtio_queue_set_notification(vq, 0);
}
}
@@ -712,18 +719,19 @@ static void virtio_net_tx_timer(void *op
{
VirtIONet *n = opaque;
- n->tx_timer_active = 0;
+ n->tx_timer_active[0] = 0;
/* Just in case the driver is not ready on more */
if (!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
return;
- virtio_queue_set_notification(n->tx_vq, 1);
- virtio_net_flush_tx(n, n->tx_vq);
+ virtio_queue_set_notification(n->tx_vq[0], 1);
+ virtio_net_flush_tx(n, n->tx_vq[0]);
}
static void virtio_net_save(QEMUFile *f, void *opaque)
{
+ int i;
VirtIONet *n = opaque;
if (n->vhost_started) {
@@ -735,7 +743,9 @@ static void virtio_net_save(QEMUFile *f,
virtio_save(&n->vdev, f);
qemu_put_buffer(f, n->mac, ETH_ALEN);
- qemu_put_be32(f, n->tx_timer_active);
+ qemu_put_be16(f, n->numtxqs);
+ for (i = 0; i < n->numtxqs; i++)
+ qemu_put_be32(f, n->tx_timer_active[i]);
qemu_put_be32(f, n->mergeable_rx_bufs);
qemu_put_be16(f, n->status);
qemu_put_byte(f, n->promisc);
@@ -764,7 +774,9 @@ static int virtio_net_load(QEMUFile *f,
virtio_load(&n->vdev, f);
qemu_get_buffer(f, n->mac, ETH_ALEN);
- n->tx_timer_active = qemu_get_be32(f);
+ n->numtxqs = qemu_get_be16(f);
+ for (i = 0; i < n->numtxqs; i++)
+ n->tx_timer_active[i] = qemu_get_be32(f);
n->mergeable_rx_bufs = qemu_get_be32(f);
if (version_id >= 3)
@@ -840,9 +852,10 @@ static int virtio_net_load(QEMUFile *f,
}
n->mac_table.first_multi = i;
- if (n->tx_timer_active) {
- qemu_mod_timer(n->tx_timer,
- qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
+ for (i = 0; i < n->numtxqs; i++) {
+ if (n->tx_timer_active[i])
+ qemu_mod_timer(n->tx_timer[i],
+ qemu_get_clock(vm_clock) + TX_TIMER_INTERVAL);
}
return 0;
}
@@ -905,12 +918,15 @@ static void virtio_net_vmstate_change(vo
VirtIODevice *virtio_net_init(DeviceState *dev, NICConf *conf)
{
+ int i;
VirtIONet *n;
n = (VirtIONet *)virtio_common_init("virtio-net", VIRTIO_ID_NET,
sizeof(struct virtio_net_config),
sizeof(VirtIONet));
+ n->numtxqs = conf->peer->numtxqs;
+
n->vdev.get_config = virtio_net_get_config;
n->vdev.set_config = virtio_net_set_config;
n->vdev.get_features = virtio_net_get_features;
@@ -918,8 +934,24 @@ VirtIODevice *virtio_net_init(DeviceStat
n->vdev.bad_features = virtio_net_bad_features;
n->vdev.reset = virtio_net_reset;
n->vdev.set_status = virtio_net_set_status;
+
n->rx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_rx);
- n->tx_vq = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+
+ n->tx_vq = qemu_mallocz(n->numtxqs * sizeof(*n->tx_vq));
+ n->tx_timer = qemu_mallocz(n->numtxqs * sizeof(*n->tx_timer));
+ n->tx_timer_active = qemu_mallocz(n->numtxqs * sizeof(*n->tx_timer_active));
+ n->async_tx = qemu_mallocz(n->numtxqs * sizeof(*n->async_tx));
+
+ /* Allocate per tx vq's */
+ for (i = 0; i < n->numtxqs; i++) {
+ n->tx_vq[i] = virtio_add_queue(&n->vdev, 256, virtio_net_handle_tx);
+
+ /* setup timer per tx vq */
+ n->tx_timer[i] = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
+ n->tx_timer_active[i] = 0;
+ }
+
+ /* Allocate control vq */
n->ctrl_vq = virtio_add_queue(&n->vdev, 64, virtio_net_handle_ctrl);
qemu_macaddr_default_if_unset(&conf->macaddr);
memcpy(&n->mac[0], &conf->macaddr, sizeof(n->mac));
@@ -929,8 +961,6 @@ VirtIODevice *virtio_net_init(DeviceStat
qemu_format_nic_info_str(&n->nic->nc, conf->macaddr.a);
- n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
- n->tx_timer_active = 0;
n->mergeable_rx_bufs = 0;
n->promisc = 1; /* for compatibility */
@@ -948,6 +978,7 @@ VirtIODevice *virtio_net_init(DeviceStat
void virtio_net_exit(VirtIODevice *vdev)
{
+ int i;
VirtIONet *n = DO_UPCAST(VirtIONet, vdev, vdev);
qemu_del_vm_change_state_handler(n->vmstate);
@@ -962,8 +993,10 @@ void virtio_net_exit(VirtIODevice *vdev)
qemu_free(n->mac_table.macs);
qemu_free(n->vlans);
- qemu_del_timer(n->tx_timer);
- qemu_free_timer(n->tx_timer);
+ for (i = 0; i < n->numtxqs; i++) {
+ qemu_del_timer(n->tx_timer[i]);
+ qemu_free_timer(n->tx_timer[i]);
+ }
virtio_cleanup(&n->vdev);
qemu_del_vlan_client(&n->nic->nc);
diff -ruNp org/hw/virtio-net.h new/hw/virtio-net.h
--- org/hw/virtio-net.h 2010-07-01 11:42:09.000000000 +0530
+++ new/hw/virtio-net.h 2010-09-08 12:54:50.000000000 +0530
@@ -22,6 +22,9 @@
/* from Linux's virtio_net.h */
+/* The maximum of transmit (& separate receive) queues supported */
+#define VIRTIO_MAX_TXQS 16
+
/* The ID for virtio_net */
#define VIRTIO_ID_NET 1
@@ -44,6 +47,7 @@
#define VIRTIO_NET_F_CTRL_RX 18 /* Control channel RX mode support */
#define VIRTIO_NET_F_CTRL_VLAN 19 /* Control channel VLAN filtering */
#define VIRTIO_NET_F_CTRL_RX_EXTRA 20 /* Extra RX mode control support */
+#define VIRTIO_NET_F_NUMTXQS 21 /* Supports multiple TX queues */
#define VIRTIO_NET_S_LINK_UP 1 /* Link is up */
@@ -58,6 +62,7 @@ struct virtio_net_config
uint8_t mac[ETH_ALEN];
/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
uint16_t status;
+ uint16_t numtxqs; /* number of transmit queues */
} __attribute__((packed));
/* This is the first element of the scatter-gather list. If you don't
diff -ruNp org/hw/virtio-pci.c new/hw/virtio-pci.c
--- org/hw/virtio-pci.c 2010-09-08 12:46:36.000000000 +0530
+++ new/hw/virtio-pci.c 2010-09-08 12:54:50.000000000 +0530
@@ -99,6 +99,7 @@ typedef struct {
uint32_t addr;
uint32_t class_code;
uint32_t nvectors;
+ uint32_t mq;
BlockConf block;
NICConf nic;
uint32_t host_features;
@@ -722,6 +723,7 @@ static PCIDeviceInfo virtio_info[] = {
.romfile = "pxe-virtio.bin",
.qdev.props = (Property[]) {
DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors, 3),
+ DEFINE_PROP_UINT32("mq", VirtIOPCIProxy, mq, 1),
DEFINE_VIRTIO_NET_FEATURES(VirtIOPCIProxy, host_features),
DEFINE_NIC_PROPERTIES(VirtIOPCIProxy, nic),
DEFINE_PROP_END_OF_LIST(),
diff -ruNp org/net/tap.c new/net/tap.c
--- org/net/tap.c 2010-07-01 11:42:09.000000000 +0530
+++ new/net/tap.c 2010-09-08 12:54:50.000000000 +0530
@@ -249,7 +249,7 @@ void tap_set_offload(VLANClientState *nc
{
TAPState *s = DO_UPCAST(TAPState, nc, nc);
- return tap_fd_set_offload(s->fd, csum, tso4, tso6, ecn, ufo);
+ tap_fd_set_offload(s->fd, csum, tso4, tso6, ecn, ufo);
}
static void tap_cleanup(VLANClientState *nc)
@@ -262,8 +262,9 @@ static void tap_cleanup(VLANClientState
qemu_purge_queued_packets(nc);
- if (s->down_script[0])
+ if (s->down_script[0]) {
launch_script(s->down_script, s->down_script_arg, s->fd);
+ }
tap_read_poll(s, 0);
tap_write_poll(s, 0);
@@ -299,13 +300,14 @@ static NetClientInfo net_tap_info = {
static TAPState *net_tap_fd_init(VLANState *vlan,
const char *model,
const char *name,
- int fd,
+ int fd, int numtxqs,
int vnet_hdr)
{
VLANClientState *nc;
TAPState *s;
nc = qemu_new_net_client(&net_tap_info, vlan, NULL, model, name);
+ nc->numtxqs = numtxqs;
s = DO_UPCAST(TAPState, nc, nc);
@@ -368,6 +370,7 @@ static int net_tap_init(QemuOpts *opts,
int fd, vnet_hdr_required;
char ifname[128] = {0,};
const char *setup_script;
+ int launch = 0;
if (qemu_opt_get(opts, "ifname")) {
pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, "ifname"));
@@ -380,29 +383,57 @@ static int net_tap_init(QemuOpts *opts,
vnet_hdr_required = 0;
}
- TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr, vnet_hdr_required));
- if (fd < 0) {
- return -1;
- }
-
setup_script = qemu_opt_get(opts, "script");
if (setup_script &&
setup_script[0] != '\0' &&
- strcmp(setup_script, "no") != 0 &&
- launch_script(setup_script, ifname, fd)) {
- close(fd);
+ strcmp(setup_script, "no") != 0) {
+ launch = 1;
+ }
+
+ TFR(fd = tap_open(ifname, sizeof(ifname), vnet_hdr,
+ vnet_hdr_required));
+ if (fd < 0) {
return -1;
}
+ if (launch && launch_script(setup_script, ifname, fd))
+ goto err;
+
qemu_opt_set(opts, "ifname", ifname);
return fd;
+
+err:
+ close(fd);
+
+ return -1;
}
int net_init_tap(QemuOpts *opts, Monitor *mon, const char *name, VLANState *vlan)
{
TAPState *s;
int fd, vnet_hdr = 0;
+ int vhost;
+ int numtxqs = 1;
+
+ vhost = qemu_opt_get_bool(opts, "vhost", 0);
+
+ /*
+ * We support multiple tx queues if:
+ * 1. smp > 1
+ * 2. vhost=on
+ * 3. mq=on
+ * In this case, #txqueues = #cpus. This value can be changed by
+ * using the "numtxqs" option.
+ */
+ if (vhost && smp_cpus > 1) {
+ if (qemu_opt_get_bool(opts, "mq", 0)) {
+#define VIRTIO_MAX_TXQS 16
+ int dflt = MIN(smp_cpus, VIRTIO_MAX_TXQS);
+
+ numtxqs = qemu_opt_get_number(opts, "numtxqs", dflt);
+ }
+ }
if (qemu_opt_get(opts, "fd")) {
if (qemu_opt_get(opts, "ifname") ||
@@ -436,14 +467,14 @@ int net_init_tap(QemuOpts *opts, Monitor
}
}
- s = net_tap_fd_init(vlan, "tap", name, fd, vnet_hdr);
+ s = net_tap_fd_init(vlan, "tap", name, fd, numtxqs, vnet_hdr);
if (!s) {
close(fd);
return -1;
}
if (tap_set_sndbuf(s->fd, opts) < 0) {
- return -1;
+ return -1;
}
if (qemu_opt_get(opts, "fd")) {
@@ -465,7 +496,7 @@ int net_init_tap(QemuOpts *opts, Monitor
}
}
- if (qemu_opt_get_bool(opts, "vhost", !!qemu_opt_get(opts, "vhostfd"))) {
+ if (vhost) {
int vhostfd, r;
if (qemu_opt_get(opts, "vhostfd")) {
r = net_handle_fd_param(mon, qemu_opt_get(opts, "vhostfd"));
@@ -476,7 +507,7 @@ int net_init_tap(QemuOpts *opts, Monitor
} else {
vhostfd = -1;
}
- s->vhost_net = vhost_net_init(&s->nc, vhostfd);
+ s->vhost_net = vhost_net_init(&s->nc, vhostfd, numtxqs);
if (!s->vhost_net) {
error_report("vhost-net requested but could not be initialized");
return -1;
diff -ruNp org/net.c new/net.c
--- org/net.c 2010-09-08 12:46:36.000000000 +0530
+++ new/net.c 2010-09-08 12:54:50.000000000 +0530
@@ -814,6 +814,15 @@ static int net_init_nic(QemuOpts *opts,
return -1;
}
+ if (nd->netdev->numtxqs > 1 && nd->nvectors == DEV_NVECTORS_UNSPECIFIED) {
+ /*
+ * User specified mq for guest, but no "vectors=", tune
+ * it automatically to 'numtxqs' TX + 1 RX + 1 controlq.
+ */
+ nd->nvectors = nd->netdev->numtxqs + 1 + 1;
+ monitor_printf(mon, "nvectors tuned to %d\n", nd->nvectors);
+ }
+
nd->used = 1;
nb_nics++;
@@ -957,6 +966,14 @@ static const struct {
},
#ifndef _WIN32
{
+ .name = "mq",
+ .type = QEMU_OPT_BOOL,
+ .help = "enable multiqueue on network i/f",
+ }, {
+ .name = "numtxqs",
+ .type = QEMU_OPT_NUMBER,
+ .help = "optional number of TX queues, if mq is enabled",
+ }, {
.name = "fd",
.type = QEMU_OPT_STRING,
.help = "file descriptor of an already opened tap",
diff -ruNp org/net.h new/net.h
--- org/net.h 2010-07-01 11:42:09.000000000 +0530
+++ new/net.h 2010-09-08 12:54:50.000000000 +0530
@@ -62,6 +62,7 @@ struct VLANClientState {
struct VLANState *vlan;
VLANClientState *peer;
NetQueue *send_queue;
+ int numtxqs;
char *model;
char *name;
char info_str[256];
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
` (3 preceding siblings ...)
2010-09-08 7:29 ` [RFC PATCH 4/4] qemu changes Krishna Kumar
@ 2010-09-08 7:47 ` Avi Kivity
2010-09-08 9:22 ` Krishna Kumar2
2010-09-08 8:10 ` Michael S. Tsirkin
2010-09-08 8:13 ` Michael S. Tsirkin
6 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-09-08 7:47 UTC (permalink / raw)
To: Krishna Kumar; +Cc: rusty, davem, netdev, kvm, anthony, mst
On 09/08/2010 10:28 AM, Krishna Kumar wrote:
> Following patches implement Transmit mq in virtio-net. Also
> included is the user qemu changes.
>
> 1. This feature was first implemented with a single vhost.
> Testing showed 3-8% performance gain for upto 8 netperf
> sessions (and sometimes 16), but BW dropped with more
> sessions. However, implementing per-txq vhost improved
> BW significantly all the way to 128 sessions.
Why were vhost kernel changes required? Can't you just instantiate more
vhost queues?
> 2. For this mq TX patch, 1 daemon is created for RX and 'n'
> daemons for the 'n' TXQ's, for a total of (n+1) daemons.
> The (subsequent) RX mq patch changes that to a total of
> 'n' daemons, where RX and TX vq's share 1 daemon.
> 3. Service Demand increases for TCP, but significantly
> improves for UDP.
> 4. Interoperability: Many combinations, but not all, of
> qemu, host, guest tested together.
Please update the virtio-pci spec @ http://ozlabs.org/~rusty/virtio-spec/.
>
> Enabling mq on virtio:
> -----------------------
>
> When following options are passed to qemu:
> - smp> 1
> - vhost=on
> - mq=on (new option, default:off)
> then #txqueues = #cpus. The #txqueues can be changed by using
> an optional 'numtxqs' option. e.g. for a smp=4 guest:
> vhost=on,mq=on -> #txqueues = 4
> vhost=on,mq=on,numtxqs=8 -> #txqueues = 8
> vhost=on,mq=on,numtxqs=2 -> #txqueues = 2
>
>
> Performance (guest -> local host):
> -----------------------------------
>
> System configuration:
> Host: 8 Intel Xeon, 8 GB memory
> Guest: 4 cpus, 2 GB memory
> All testing without any tuning, and TCP netperf with 64K I/O
> _______________________________________________________________________________
> TCP (#numtxqs=2)
> N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%)
> _______________________________________________________________________________
> 4 26387 40716 (54.30) 20 28 (40.00) 86i 85 (-1.16)
> 8 24356 41843 (71.79) 88 129 (46.59) 372 362 (-2.68)
> 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519 (-2.50)
> 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722 (-14.52)
> 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552 (-14.35)
> 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173 (-9.66)
> 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74)
> _______________________________________________________________________________
> UDP (#numtxqs=8)
> N# BW1 BW2 (%) SD1 SD2 (%)
> __________________________________________________________
> 4 29836 56761 (90.24) 67 63 (-5.97)
> 8 27666 63767 (130.48) 326 265 (-18.71)
> 16 25452 60665 (138.35) 1396 1269 (-9.09)
> 32 26172 63491 (142.59) 5617 4202 (-25.19)
> 48 26146 64629 (147.18) 12813 9316 (-27.29)
> 64 25575 65448 (155.90) 23063 16346 (-29.12)
> 128 26454 63772 (141.06) 91054 85051 (-6.59)
Impressive results.
> __________________________________________________________
> N#: Number of netperf sessions, 90 sec runs
> BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
> SD for original code
> BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
> SD for new code. e.g. BW2=40716 means average BW2 was
> 20358 mbps.
>
>
> Next steps:
> -----------
>
> 1. mq RX patch is also complete - plan to submit once TX is OK.
> 2. Cache-align data structures: I didn't see any BW/SD improvement
> after making the sq's (and similarly for vhost) cache-aligned
> statically:
> struct virtnet_info {
> ...
> struct send_queue sq[16] ____cacheline_aligned_in_smp;
> ...
> };
>
> Guest interrupts for a 4 TXQ device after a 5 min test:
> # egrep "virtio0|CPU" /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 40: 0 0 0 0 PCI-MSI-edge virtio0-config
> 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
> 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
> 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
> 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
> 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
How are vhost threads and host interrupts distributed? We need to move
vhost queue threads to be colocated with the related vcpu threads (if no
extra cores are available) or on the same socket (if extra cores are
available). Similarly, move device interrupts to the same core as the
vhost thread.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
` (4 preceding siblings ...)
2010-09-08 7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
@ 2010-09-08 8:10 ` Michael S. Tsirkin
2010-09-08 9:23 ` Krishna Kumar2
` (2 more replies)
2010-09-08 8:13 ` Michael S. Tsirkin
6 siblings, 3 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-08 8:10 UTC (permalink / raw)
To: Krishna Kumar; +Cc: rusty, davem, netdev, kvm, anthony
On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> Following patches implement Transmit mq in virtio-net. Also
> included is the user qemu changes.
>
> 1. This feature was first implemented with a single vhost.
> Testing showed 3-8% performance gain for upto 8 netperf
> sessions (and sometimes 16), but BW dropped with more
> sessions. However, implementing per-txq vhost improved
> BW significantly all the way to 128 sessions.
> 2. For this mq TX patch, 1 daemon is created for RX and 'n'
> daemons for the 'n' TXQ's, for a total of (n+1) daemons.
> The (subsequent) RX mq patch changes that to a total of
> 'n' daemons, where RX and TX vq's share 1 daemon.
> 3. Service Demand increases for TCP, but significantly
> improves for UDP.
> 4. Interoperability: Many combinations, but not all, of
> qemu, host, guest tested together.
>
>
> Enabling mq on virtio:
> -----------------------
>
> When following options are passed to qemu:
> - smp > 1
> - vhost=on
> - mq=on (new option, default:off)
> then #txqueues = #cpus. The #txqueues can be changed by using
> an optional 'numtxqs' option. e.g. for a smp=4 guest:
> vhost=on,mq=on -> #txqueues = 4
> vhost=on,mq=on,numtxqs=8 -> #txqueues = 8
> vhost=on,mq=on,numtxqs=2 -> #txqueues = 2
>
>
> Performance (guest -> local host):
> -----------------------------------
>
> System configuration:
> Host: 8 Intel Xeon, 8 GB memory
> Guest: 4 cpus, 2 GB memory
> All testing without any tuning, and TCP netperf with 64K I/O
> _______________________________________________________________________________
> TCP (#numtxqs=2)
> N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%)
> _______________________________________________________________________________
> 4 26387 40716 (54.30) 20 28 (40.00) 86i 85 (-1.16)
> 8 24356 41843 (71.79) 88 129 (46.59) 372 362 (-2.68)
> 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519 (-2.50)
> 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722 (-14.52)
> 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552 (-14.35)
> 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173 (-9.66)
> 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74)
That's a significant hit in TCP SD. Is it caused by the imbalance between
number of queues for TX and RX? Since you mention RX is complete,
maybe measure with a balanced TX/RX?
> _______________________________________________________________________________
> UDP (#numtxqs=8)
> N# BW1 BW2 (%) SD1 SD2 (%)
> __________________________________________________________
> 4 29836 56761 (90.24) 67 63 (-5.97)
> 8 27666 63767 (130.48) 326 265 (-18.71)
> 16 25452 60665 (138.35) 1396 1269 (-9.09)
> 32 26172 63491 (142.59) 5617 4202 (-25.19)
> 48 26146 64629 (147.18) 12813 9316 (-27.29)
> 64 25575 65448 (155.90) 23063 16346 (-29.12)
> 128 26454 63772 (141.06) 91054 85051 (-6.59)
> __________________________________________________________
> N#: Number of netperf sessions, 90 sec runs
> BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
> SD for original code
> BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
> SD for new code. e.g. BW2=40716 means average BW2 was
> 20358 mbps.
>
What happens with a single netperf?
host -> guest performance with TCP and small packet speed
are also worth measuring.
> Next steps:
> -----------
>
> 1. mq RX patch is also complete - plan to submit once TX is OK.
> 2. Cache-align data structures: I didn't see any BW/SD improvement
> after making the sq's (and similarly for vhost) cache-aligned
> statically:
> struct virtnet_info {
> ...
> struct send_queue sq[16] ____cacheline_aligned_in_smp;
> ...
> };
>
At some level, host/guest communication is easy in that we don't really
care which queue is used. I would like to give some thought (and
testing) to how is this going to work with a real NIC card and packet
steering at the backend.
Any idea?
> Guest interrupts for a 4 TXQ device after a 5 min test:
> # egrep "virtio0|CPU" /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 40: 0 0 0 0 PCI-MSI-edge virtio0-config
> 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
> 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
> 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
> 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
> 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
Does this mean each interrupt is constantly bouncing between CPUs?
> Review/feedback appreciated.
>
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> ---
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
` (5 preceding siblings ...)
2010-09-08 8:10 ` Michael S. Tsirkin
@ 2010-09-08 8:13 ` Michael S. Tsirkin
2010-09-08 9:28 ` Krishna Kumar2
6 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-08 8:13 UTC (permalink / raw)
To: Krishna Kumar; +Cc: rusty, davem, netdev, kvm, anthony
On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> 1. mq RX patch is also complete - plan to submit once TX is OK.
It's good that you split patches, I think it would be interesting to see
the RX patches at least once to complete the picture.
You could make it a separate patchset, tag them as RFC.
--
MST
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
@ 2010-09-08 9:22 ` Krishna Kumar2
2010-09-08 9:28 ` Avi Kivity
0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 9:22 UTC (permalink / raw)
To: Avi Kivity; +Cc: anthony, davem, kvm, mst, netdev, rusty
Avi Kivity <avi@redhat.com> wrote on 09/08/2010 01:17:34 PM:
> On 09/08/2010 10:28 AM, Krishna Kumar wrote:
> > Following patches implement Transmit mq in virtio-net. Also
> > included is the user qemu changes.
> >
> > 1. This feature was first implemented with a single vhost.
> > Testing showed 3-8% performance gain for upto 8 netperf
> > sessions (and sometimes 16), but BW dropped with more
> > sessions. However, implementing per-txq vhost improved
> > BW significantly all the way to 128 sessions.
>
> Why were vhost kernel changes required? Can't you just instantiate more
> vhost queues?
I did try using a single thread processing packets from multiple
vq's on host, but the BW dropped beyond a certain number of
sessions. I don't have the code and performance numbers for that
right now since it is a bit ancient, I can try to resuscitate
that if you want.
> > Guest interrupts for a 4 TXQ device after a 5 min test:
> > # egrep "virtio0|CPU" /proc/interrupts
> > CPU0 CPU1 CPU2 CPU3
> > 40: 0 0 0 0 PCI-MSI-edge virtio0-config
> > 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
> > 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
> > 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
> > 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
> > 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
>
> How are vhost threads and host interrupts distributed? We need to move
> vhost queue threads to be colocated with the related vcpu threads (if no
> extra cores are available) or on the same socket (if extra cores are
> available). Similarly, move device interrupts to the same core as the
> vhost thread.
All my testing was without any tuning, including binding netperf &
netserver (irqbalance is also off). I assume (maybe wrongly) that
the above might give better results? Are you suggesting this
combination:
IRQ on guest:
40: CPU0
41: CPU1
42: CPU2
43: CPU3 (all CPUs are on socket #0)
vhost:
thread #0: CPU0
thread #1: CPU1
thread #2: CPU2
thread #3: CPU3
qemu:
thread #0: CPU4
thread #1: CPU5
thread #2: CPU6
thread #3: CPU7 (all CPUs are on socket#1)
netperf/netserver:
Run on CPUs 0-4 on both sides
The reason I did not optimize anything from user space is because
I felt showing the default works reasonably well is important.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 8:10 ` Michael S. Tsirkin
@ 2010-09-08 9:23 ` Krishna Kumar2
2010-09-08 10:48 ` Michael S. Tsirkin
2010-09-08 16:47 ` Krishna Kumar2
[not found] ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
2 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 9:23 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty, rick.jones2
"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:40:11 PM:
>
_______________________________________________________________________________
> > TCP (#numtxqs=2)
> > N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2
(%)
> >
>
_______________________________________________________________________________
> > 4 26387 40716 (54.30) 20 28 (40.00) 86i 85
(-1.16)
> > 8 24356 41843 (71.79) 88 129 (46.59) 372 362
(-2.68)
> > 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519
(-2.50)
> > 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722
(-14.52)
> > 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552
(-14.35)
> > 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173
(-9.66)
> > 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031
(10.74)
>
> That's a significant hit in TCP SD. Is it caused by the imbalance between
> number of queues for TX and RX? Since you mention RX is complete,
> maybe measure with a balanced TX/RX?
Yes, I am not sure why it is so high. I found the same with #RX=#TX
too. As a hack, I tried ixgbe without MQ (set "indices=1" before
calling alloc_etherdev_mq, not sure if that is entirely correct) -
here too SD worsened by around 40%. I can't explain it, since the
virtio-net driver runs lock free once sch_direct_xmit gets
HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly
correct since
more threads are now running parallel and load is higher? Eg, if you
compare SD between
#netperfs = 8 vs 16 for original code (cut-n-paste relevant columns
only) ...
N# BW SD
8 24356 88
16 23587 375
... SD has increased more than 4 times for the same BW.
> What happens with a single netperf?
> host -> guest performance with TCP and small packet speed
> are also worth measuring.
OK, I will do this and send the results later today.
> At some level, host/guest communication is easy in that we don't really
> care which queue is used. I would like to give some thought (and
> testing) to how is this going to work with a real NIC card and packet
> steering at the backend.
> Any idea?
I have done a little testing with guest -> remote server both
using a bridge and with macvtap (mq is required only for rx).
I didn't understand what you mean by packet steering though,
is it whether packets go out of the NIC on different queues?
If so, I verified that is the case by putting a counter and
displaying through /debug interface on the host. dev_queue_xmit
on the host handles it by calling dev_pick_tx().
> > Guest interrupts for a 4 TXQ device after a 5 min test:
> > # egrep "virtio0|CPU" /proc/interrupts
> > CPU0 CPU1 CPU2 CPU3
> > 40: 0 0 0 0 PCI-MSI-edge virtio0-config
> > 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
> > 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
> > 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
> > 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
> > 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
>
> Does this mean each interrupt is constantly bouncing between CPUs?
Yes. I didn't do *any* tuning for the tests. The only "tuning"
was to use 64K IO size with netperf. When I ran default netperf
(16K), I got a little lesser improvement in BW and worse(!) SD
than with 64K.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 9:22 ` Krishna Kumar2
@ 2010-09-08 9:28 ` Avi Kivity
2010-09-08 10:17 ` Krishna Kumar2
0 siblings, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-09-08 9:28 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, mst, netdev, rusty
On 09/08/2010 12:22 PM, Krishna Kumar2 wrote:
> Avi Kivity<avi@redhat.com> wrote on 09/08/2010 01:17:34 PM:
>
>> On 09/08/2010 10:28 AM, Krishna Kumar wrote:
>>> Following patches implement Transmit mq in virtio-net. Also
>>> included is the user qemu changes.
>>>
>>> 1. This feature was first implemented with a single vhost.
>>> Testing showed 3-8% performance gain for upto 8 netperf
>>> sessions (and sometimes 16), but BW dropped with more
>>> sessions. However, implementing per-txq vhost improved
>>> BW significantly all the way to 128 sessions.
>> Why were vhost kernel changes required? Can't you just instantiate more
>> vhost queues?
> I did try using a single thread processing packets from multiple
> vq's on host, but the BW dropped beyond a certain number of
> sessions.
Oh - so the interface has not changed (which can be seen from the
patch). That was my concern, I remembered that we planned for vhost-net
to be multiqueue-ready.
The new guest and qemu code work with old vhost-net, just with reduced
performance, yes?
> I don't have the code and performance numbers for that
> right now since it is a bit ancient, I can try to resuscitate
> that if you want.
No need.
>>> Guest interrupts for a 4 TXQ device after a 5 min test:
>>> # egrep "virtio0|CPU" /proc/interrupts
>>> CPU0 CPU1 CPU2 CPU3
>>> 40: 0 0 0 0 PCI-MSI-edge virtio0-config
>>> 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
>>> 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
>>> 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
>>> 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
>>> 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
>> How are vhost threads and host interrupts distributed? We need to move
>> vhost queue threads to be colocated with the related vcpu threads (if no
>> extra cores are available) or on the same socket (if extra cores are
>> available). Similarly, move device interrupts to the same core as the
>> vhost thread.
> All my testing was without any tuning, including binding netperf&
> netserver (irqbalance is also off). I assume (maybe wrongly) that
> the above might give better results?
I hope so!
> Are you suggesting this
> combination:
> IRQ on guest:
> 40: CPU0
> 41: CPU1
> 42: CPU2
> 43: CPU3 (all CPUs are on socket #0)
> vhost:
> thread #0: CPU0
> thread #1: CPU1
> thread #2: CPU2
> thread #3: CPU3
> qemu:
> thread #0: CPU4
> thread #1: CPU5
> thread #2: CPU6
> thread #3: CPU7 (all CPUs are on socket#1)
May be better to put vcpu threads and vhost threads on the same socket.
Also need to affine host interrupts.
> netperf/netserver:
> Run on CPUs 0-4 on both sides
>
> The reason I did not optimize anything from user space is because
> I felt showing the default works reasonably well is important.
Definitely. Heavy tuning is not a useful path for general end users.
We need to make sure the the scheduler is able to arrive at the optimal
layout without pinning (but perhaps with hints).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 8:13 ` Michael S. Tsirkin
@ 2010-09-08 9:28 ` Krishna Kumar2
0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 9:28 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty
Hi Michael,
"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:43:26 PM:
> On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> > 1. mq RX patch is also complete - plan to submit once TX is OK.
>
> It's good that you split patches, I think it would be interesting to see
> the RX patches at least once to complete the picture.
> You could make it a separate patchset, tag them as RFC.
OK, I need to re-do some parts of it, since I started the TX only
branch a couple of weeks earlier and the RX side is outdated. I
will try to send that out in the next couple of days, as you say
it will help to complete the picture. Reasons to send it only TX
now:
- Reduce size of patch and complexity
- I didn't get much improvement on multiple RX patch (netperf from
host -> guest), so needed some time to figure out the reason and
fix it.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 9:28 ` Avi Kivity
@ 2010-09-08 10:17 ` Krishna Kumar2
2010-09-08 14:12 ` Arnd Bergmann
0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 10:17 UTC (permalink / raw)
To: Avi Kivity; +Cc: anthony, davem, kvm, mst, netdev, rusty
Avi Kivity <avi@redhat.com> wrote on 09/08/2010 02:58:21 PM:
> >>> 1. This feature was first implemented with a single vhost.
> >>> Testing showed 3-8% performance gain for upto 8 netperf
> >>> sessions (and sometimes 16), but BW dropped with more
> >>> sessions. However, implementing per-txq vhost improved
> >>> BW significantly all the way to 128 sessions.
> >> Why were vhost kernel changes required? Can't you just instantiate
more
> >> vhost queues?
> > I did try using a single thread processing packets from multiple
> > vq's on host, but the BW dropped beyond a certain number of
> > sessions.
>
> Oh - so the interface has not changed (which can be seen from the
> patch). That was my concern, I remembered that we planned for vhost-net
> to be multiqueue-ready.
>
> The new guest and qemu code work with old vhost-net, just with reduced
> performance, yes?
Yes, I have tested new guest/qemu with old vhost but using
#numtxqs=1 (or not passing any arguments at all to qemu to
enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
since vhost_net_set_backend in the unmodified vhost checks
for boundary overflow.
I have also tested running an unmodified guest with new
vhost/qemu, but qemu should not specify numtxqs>1.
> > Are you suggesting this
> > combination:
> > IRQ on guest:
> > 40: CPU0
> > 41: CPU1
> > 42: CPU2
> > 43: CPU3 (all CPUs are on socket #0)
> > vhost:
> > thread #0: CPU0
> > thread #1: CPU1
> > thread #2: CPU2
> > thread #3: CPU3
> > qemu:
> > thread #0: CPU4
> > thread #1: CPU5
> > thread #2: CPU6
> > thread #3: CPU7 (all CPUs are on socket#1)
>
> May be better to put vcpu threads and vhost threads on the same socket.
>
> Also need to affine host interrupts.
>
> > netperf/netserver:
> > Run on CPUs 0-4 on both sides
> >
> > The reason I did not optimize anything from user space is because
> > I felt showing the default works reasonably well is important.
>
> Definitely. Heavy tuning is not a useful path for general end users.
> We need to make sure the the scheduler is able to arrive at the optimal
> layout without pinning (but perhaps with hints).
OK, I will see if I can get results with this.
Thanks for your suggestions,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 9:23 ` Krishna Kumar2
@ 2010-09-08 10:48 ` Michael S. Tsirkin
2010-09-08 12:19 ` Krishna Kumar2
0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-08 10:48 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, rusty, rick.jones2
On Wed, Sep 08, 2010 at 02:53:03PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:40:11 PM:
>
> >
> _______________________________________________________________________________
>
> > > TCP (#numtxqs=2)
> > > N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2
> (%)
> > >
> >
> _______________________________________________________________________________
>
> > > 4 26387 40716 (54.30) 20 28 (40.00) 86i 85
> (-1.16)
> > > 8 24356 41843 (71.79) 88 129 (46.59) 372 362
> (-2.68)
> > > 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519
> (-2.50)
> > > 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722
> (-14.52)
> > > 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552
> (-14.35)
> > > 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173
> (-9.66)
> > > 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031
> (10.74)
> >
> > That's a significant hit in TCP SD. Is it caused by the imbalance between
> > number of queues for TX and RX? Since you mention RX is complete,
> > maybe measure with a balanced TX/RX?
>
> Yes, I am not sure why it is so high.
Any errors at higher levels? Are any packets reordered?
> I found the same with #RX=#TX
> too. As a hack, I tried ixgbe without MQ (set "indices=1" before
> calling alloc_etherdev_mq, not sure if that is entirely correct) -
> here too SD worsened by around 40%. I can't explain it, since the
> virtio-net driver runs lock free once sch_direct_xmit gets
> HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly
> correct since
> more threads are now running parallel and load is higher? Eg, if you
> compare SD between
> #netperfs = 8 vs 16 for original code (cut-n-paste relevant columns
> only) ...
>
> N# BW SD
> 8 24356 88
> 16 23587 375
>
> ... SD has increased more than 4 times for the same BW.
>
> > What happens with a single netperf?
> > host -> guest performance with TCP and small packet speed
> > are also worth measuring.
>
> OK, I will do this and send the results later today.
>
> > At some level, host/guest communication is easy in that we don't really
> > care which queue is used. I would like to give some thought (and
> > testing) to how is this going to work with a real NIC card and packet
> > steering at the backend.
> > Any idea?
>
> I have done a little testing with guest -> remote server both
> using a bridge and with macvtap (mq is required only for rx).
> I didn't understand what you mean by packet steering though,
> is it whether packets go out of the NIC on different queues?
> If so, I verified that is the case by putting a counter and
> displaying through /debug interface on the host. dev_queue_xmit
> on the host handles it by calling dev_pick_tx().
>
> > > Guest interrupts for a 4 TXQ device after a 5 min test:
> > > # egrep "virtio0|CPU" /proc/interrupts
> > > CPU0 CPU1 CPU2 CPU3
> > > 40: 0 0 0 0 PCI-MSI-edge virtio0-config
> > > 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input
> > > 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0
> > > 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1
> > > 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2
> > > 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3
> >
> > Does this mean each interrupt is constantly bouncing between CPUs?
>
> Yes. I didn't do *any* tuning for the tests. The only "tuning"
> was to use 64K IO size with netperf. When I ran default netperf
> (16K), I got a little lesser improvement in BW and worse(!) SD
> than with 64K.
>
> Thanks,
>
> - KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 10:48 ` Michael S. Tsirkin
@ 2010-09-08 12:19 ` Krishna Kumar2
0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 12:19 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rick.jones2, rusty
"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 04:18:33 PM:
>
_______________________________________________________________________________
> >
> > > > TCP (#numtxqs=2)
> > > > N# BW1 BW2 (%) SD1 SD2 (%) RSD1
RSD2
> > (%)
> > > >
> > >
> >
>
_______________________________________________________________________________
> >
> > > > 4 26387 40716 (54.30) 20 28 (40.00) 86i 85
> > (-1.16)
> > > > 8 24356 41843 (71.79) 88 129 (46.59) 372 362
> > (-2.68)
> > > > 16 23587 40546 (71.89) 375 564 (50.40) 1558
1519
> > (-2.50)
> > > > 32 22927 39490 (72.24) 1617 2171 (34.26) 6694
5722
> > (-14.52)
> > > > 48 23067 39238 (70.10) 3931 5170 (31.51) 15823
13552
> > (-14.35)
> > > > 64 22927 38750 (69.01) 7142 9914 (38.81) 28972
26173
> > (-9.66)
> > > > 96 22568 38520 (70.68) 16258 27844 (71.26) 65944
73031
> > (10.74)
> > >
> > > That's a significant hit in TCP SD. Is it caused by the imbalance
between
> > > number of queues for TX and RX? Since you mention RX is complete,
> > > maybe measure with a balanced TX/RX?
> >
> > Yes, I am not sure why it is so high.
>
> Any errors at higher levels? Are any packets reordered?
I haven't seen any messages logged, and retransmission is similar
to non-mq case. Device also has no errors/dropped packets. Anything
else I should look for?
On the host:
# ifconfig vnet0
vnet0 Link encap:Ethernet HWaddr 9A:9D:99:E1:CA:CE
inet6 addr: fe80::989d:99ff:fee1:cace/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5090371 errors:0 dropped:0 overruns:0 frame:0
TX packets:5054616 errors:0 dropped:0 overruns:65 carrier:0
collisions:0 txqueuelen:500
RX bytes:237793761392 (221.4 GiB) TX bytes:333630070 (318.1 MiB)
# netstat -s |grep -i retrans
1310 segments retransmited
35 times recovered from packet loss due to fast retransmit
1 timeouts after reno fast retransmit
41 fast retransmits
1236 retransmits in slow start
So retranmissions are 0.025% of total packets received from the guest.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 10:17 ` Krishna Kumar2
@ 2010-09-08 14:12 ` Arnd Bergmann
2010-09-08 16:47 ` Krishna Kumar2
0 siblings, 1 reply; 43+ messages in thread
From: Arnd Bergmann @ 2010-09-08 14:12 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: Avi Kivity, anthony, davem, kvm, mst, netdev, rusty
On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > The new guest and qemu code work with old vhost-net, just with reduced
> > performance, yes?
>
> Yes, I have tested new guest/qemu with old vhost but using
> #numtxqs=1 (or not passing any arguments at all to qemu to
> enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> since vhost_net_set_backend in the unmodified vhost checks
> for boundary overflow.
>
> I have also tested running an unmodified guest with new
> vhost/qemu, but qemu should not specify numtxqs>1.
Can you live migrate a new guest from new-qemu/new-kernel
to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
If not, do we need to support all those cases?
Arnd
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 14:12 ` Arnd Bergmann
@ 2010-09-08 16:47 ` Krishna Kumar2
2010-09-09 10:40 ` Arnd Bergmann
0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 16:47 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: anthony, Avi Kivity, davem, kvm, mst, netdev, rusty
> On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > > The new guest and qemu code work with old vhost-net, just with
reduced
> > > performance, yes?
> >
> > Yes, I have tested new guest/qemu with old vhost but using
> > #numtxqs=1 (or not passing any arguments at all to qemu to
> > enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> > since vhost_net_set_backend in the unmodified vhost checks
> > for boundary overflow.
> >
> > I have also tested running an unmodified guest with new
> > vhost/qemu, but qemu should not specify numtxqs>1.
>
> Can you live migrate a new guest from new-qemu/new-kernel
> to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> If not, do we need to support all those cases?
I have not tried this, though I added some minimal code in
virtio_net_load and virtio_net_save. I don't know what needs
to be done exactly at this time. I forgot to put this in the
"Next steps" list of things to do.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 8:10 ` Michael S. Tsirkin
2010-09-08 9:23 ` Krishna Kumar2
@ 2010-09-08 16:47 ` Krishna Kumar2
[not found] ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
2 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-08 16:47 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty
"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/08/2010 01:40:11 PM:
>
_______________________________________________________________________________
> > UDP (#numtxqs=8)
> > N# BW1 BW2 (%) SD1 SD2 (%)
> > __________________________________________________________
> > 4 29836 56761 (90.24) 67 63 (-5.97)
> > 8 27666 63767 (130.48) 326 265 (-18.71)
> > 16 25452 60665 (138.35) 1396 1269 (-9.09)
> > 32 26172 63491 (142.59) 5617 4202 (-25.19)
> > 48 26146 64629 (147.18) 12813 9316 (-27.29)
> > 64 25575 65448 (155.90) 23063 16346 (-29.12)
> > 128 26454 63772 (141.06) 91054 85051 (-6.59)
> > __________________________________________________________
> > N#: Number of netperf sessions, 90 sec runs
> > BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
> > SD for original code
> > BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
> > SD for new code. e.g. BW2=40716 means average BW2 was
> > 20358 mbps.
> >
>
> What happens with a single netperf?
> host -> guest performance with TCP and small packet speed
> are also worth measuring.
Guest -> Host (single netperf):
I am getting a drop of almost 20%. I am trying to figure out
why.
Host -> guest (single netperf):
I am getting an improvement of almost 15%. Again - unexpected.
Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
for runs upto 128 sessions. With fewer netperf (under 8), there
was a drop of 3-7% in #packets, but beyond that, the #packets
improved significantly to give an average improvement of 7.4%.
So it seems that fewer sessions is having negative effect for
some reason on the tx side. The code path in virtio-net has not
changed much, so the drop in some cases is quite unexpected.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-08 7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
@ 2010-09-09 3:49 ` Rusty Russell
2010-09-09 5:23 ` Krishna Kumar2
0 siblings, 1 reply; 43+ messages in thread
From: Rusty Russell @ 2010-09-09 3:49 UTC (permalink / raw)
To: Krishna Kumar; +Cc: davem, netdev, anthony, kvm, mst
On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote:
> Add virtio_get_queue_index() to get the queue index of a
> vq. This is needed by the cb handler to locate the queue
> that should be processed.
This seems a bit weird. I mean, the driver used vdev->config->find_vqs
to find the queues, which returns them (in order). So, can't you put this
into your struct send_queue?
Also, why define VIRTIO_MAX_TXQS? If the driver can't handle all of them,
it should simply not use them...
Thanks!
Rusty.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-09 3:49 ` Rusty Russell
@ 2010-09-09 5:23 ` Krishna Kumar2
2010-09-09 12:14 ` Rusty Russell
0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 5:23 UTC (permalink / raw)
To: Rusty Russell; +Cc: anthony, davem, kvm, mst, netdev
Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 09:19:39 AM:
> On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote:
> > Add virtio_get_queue_index() to get the queue index of a
> > vq. This is needed by the cb handler to locate the queue
> > that should be processed.
>
> This seems a bit weird. I mean, the driver used vdev->config->find_vqs
> to find the queues, which returns them (in order). So, can't you put
this
> into your struct send_queue?
I am saving the vqs in the send_queue, but the cb needs
to locate the device txq from the svq. The only other way
I could think of is to iterate through the send_queue's
and compare svq against sq[i]->svq, but cb's happen quite
a bit. Is there a better way?
static void skb_xmit_done(struct virtqueue *svq)
{
struct virtnet_info *vi = svq->vdev->priv;
int qnum = virtio_get_queue_index(svq) - 1; /* 0 is RX vq */
/* Suppress further interrupts. */
virtqueue_disable_cb(svq);
/* We were probably waiting for more output buffers. */
netif_wake_subqueue(vi->dev, qnum);
}
> Also, why define VIRTIO_MAX_TXQS? If the driver can't handle all of
them,
> it should simply not use them...
The main reason was vhost :) Since vhost_net_release
should not fail (__fput can't handle f_op->release()
failure), I needed a maximum number of socks to
clean up:
#define MAX_VQS (1 + VIRTIO_MAX_TXQS)
static int vhost_net_release(struct inode *inode, struct file *f)
{
struct vhost_net *n = f->private_data;
struct vhost_dev *dev = &n->dev;
struct socket *socks[MAX_VQS];
int i;
vhost_net_stop(n, socks);
vhost_net_flush(n);
vhost_dev_cleanup(dev);
for (i = n->dev.nvqs - 1; i >= 0; i--)
if (socks[i])
fput(socks[i]->file);
...
}
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
[not found] ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
@ 2010-09-09 9:45 ` Krishna Kumar2
2010-09-09 23:00 ` Sridhar Samudrala
2010-09-12 11:40 ` Michael S. Tsirkin
[not found] ` <OF8043B2B7.7048D739-ON65257799.0021A2EE-65257799.00356B3E@LocalDomain>
1 sibling, 2 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 9:45 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, Michael S. Tsirkin, netdev, rusty
> Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:
Some more results and likely cause for single netperf
degradation below.
> Guest -> Host (single netperf):
> I am getting a drop of almost 20%. I am trying to figure out
> why.
>
> Host -> guest (single netperf):
> I am getting an improvement of almost 15%. Again - unexpected.
>
> Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
> for runs upto 128 sessions. With fewer netperf (under 8), there
> was a drop of 3-7% in #packets, but beyond that, the #packets
> improved significantly to give an average improvement of 7.4%.
>
> So it seems that fewer sessions is having negative effect for
> some reason on the tx side. The code path in virtio-net has not
> changed much, so the drop in some cases is quite unexpected.
The drop for the single netperf seems to be due to multiple vhost.
I changed the patch to start *single* vhost:
Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
Guest -> Host (1 netperf) : Latency: -3%, SD: 3.5%
Single vhost performs well but hits the barrier at 16 netperf
sessions:
SINGLE vhost (Guest -> Host):
1 netperf: BW: 10.7% SD: -1.4%
4 netperfs: BW: 3% SD: 1.4%
8 netperfs: BW: 17.7% SD: -10%
16 netperfs: BW: 4.7% SD: -7.0%
32 netperfs: BW: -6.1% SD: -5.7%
BW and SD both improves (guest multiple txqs help). For 32
netperfs, SD improves.
But with multiple vhosts, guest is able to send more packets
and BW increases much more (SD too increases, but I think
that is expected). From the earlier results:
N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%)
_______________________________________________________________________________
4 26387 40716 (54.30) 20 28 (40.00) 86 85
(-1.16)
8 24356 41843 (71.79) 88 129 (46.59) 372 362
(-2.68)
16 23587 40546 (71.89) 375 564 (50.40) 1558 1519
(-2.50)
32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722
(-14.52)
48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552
(-14.35)
64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173
(-9.66)
96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031
(10.74)
_______________________________________________________________________________
(All tests were done without any tuning)
>From my testing:
1. Single vhost improves mq guest performance upto 16
netperfs but degrades after that.
2. Multiple vhost degrades single netperf guest
performance, but significantly improves performance
for any number of netperf sessions.
Likely cause for the 1 stream degradation with multiple
vhost patch:
1. Two vhosts run handling the RX and TX respectively.
I think the issue is related to cache ping-pong esp
since these run on different cpus/sockets.
2. I (re-)modified the patch to share RX with TX[0]. The
performance drop is the same, but the reason is the
guest is not using txq[0] in most cases (dev_pick_tx),
so vhost's rx and tx are running on different threads.
But whenever the guest uses txq[0], only one vhost
runs and the performance is similar to original.
I went back to my *submitted* patch and started a guest
with numtxq=16 and pinned every vhost to cpus #0&1. Now
whether guest used txq[0] or txq[n], the performance is
similar or better (between 10-27% across 10 runs) than
original code. Also, -6% to -24% improvement in SD.
I will start a full test run of original vs submitted
code with minimal tuning (Avi also suggested the same),
and re-send. Please let me know if you need any other
data.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-08 16:47 ` Krishna Kumar2
@ 2010-09-09 10:40 ` Arnd Bergmann
2010-09-09 13:19 ` Krishna Kumar2
0 siblings, 1 reply; 43+ messages in thread
From: Arnd Bergmann @ 2010-09-09 10:40 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, Avi Kivity, davem, kvm, mst, netdev, rusty
On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > > > The new guest and qemu code work with old vhost-net, just with
> reduced
> > > > performance, yes?
> > >
> > > Yes, I have tested new guest/qemu with old vhost but using
> > > #numtxqs=1 (or not passing any arguments at all to qemu to
> > > enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> > > since vhost_net_set_backend in the unmodified vhost checks
> > > for boundary overflow.
> > >
> > > I have also tested running an unmodified guest with new
> > > vhost/qemu, but qemu should not specify numtxqs>1.
> >
> > Can you live migrate a new guest from new-qemu/new-kernel
> > to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> > If not, do we need to support all those cases?
>
> I have not tried this, though I added some minimal code in
> virtio_net_load and virtio_net_save. I don't know what needs
> to be done exactly at this time. I forgot to put this in the
> "Next steps" list of things to do.
I was mostly trying to find out if you think it should work
or if there are specific reasons why it would not.
E.g. when migrating to a machine that has an old qemu, the guest
gets reduced to a single queue, but it's not clear to me how
it can learn about this, or if it can get hidden by the outbound
qemu.
Arnd
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-09 5:23 ` Krishna Kumar2
@ 2010-09-09 12:14 ` Rusty Russell
2010-09-09 13:49 ` Krishna Kumar2
0 siblings, 1 reply; 43+ messages in thread
From: Rusty Russell @ 2010-09-09 12:14 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, mst, netdev
On Thu, 9 Sep 2010 02:53:52 pm Krishna Kumar2 wrote:
> Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 09:19:39 AM:
>
> > On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote:
> > > Add virtio_get_queue_index() to get the queue index of a
> > > vq. This is needed by the cb handler to locate the queue
> > > that should be processed.
> >
> > This seems a bit weird. I mean, the driver used vdev->config->find_vqs
> > to find the queues, which returns them (in order). So, can't you put
> this
> > into your struct send_queue?
>
> I am saving the vqs in the send_queue, but the cb needs
> to locate the device txq from the svq. The only other way
> I could think of is to iterate through the send_queue's
> and compare svq against sq[i]->svq, but cb's happen quite
> a bit. Is there a better way?
Ah, good point. Move the queue index into the struct virtqueue?
> > Also, why define VIRTIO_MAX_TXQS? If the driver can't handle all of
> them,
> > it should simply not use them...
>
> The main reason was vhost :) Since vhost_net_release
> should not fail (__fput can't handle f_op->release()
> failure), I needed a maximum number of socks to
> clean up:
Ah, then it belongs in the vhost headers. The guest shouldn't see such
a restriction if it doesn't apply; it's a host thing.
Oh, and I think you could profitably use virtio_config_val(), too.
Thanks!
Rusty.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
[not found] ` <OF8043B2B7.7048D739-ON65257799.0021A2EE-65257799.00356B3E@LocalDomain>
@ 2010-09-09 13:18 ` Krishna Kumar2
0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 13:18 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty
Krishna Kumar2/India/IBM wrote on 09/09/2010 03:15:53 PM:
> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.
Same patch, only change is that I ran "taskset -p 03
<all vhost threads>", no other tuning on host or guest.
Default netperf without any options. The BW is the sum
across two iterations, each is 60secs. Guest is started
with 2 txqs.
BW1/BW2: BW for org & new in mbps
SD1/SD2: SD for org & new
RSD1/RSD2: Remote SD for org & new
_______________________________________________________________________________
# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%)
_______________________________________________________________________________
1 20903 19422 (-7.08) 1 1 (0) 6 7
(16.66)
2 21963 24330 (10.77) 6 6 (0) 25 25 (0)
4 22042 31841 (44.45) 23 28 (21.73) 102 110 (7.84)
8 21674 32045 (47.84) 97 111 (14.43) 419 421 (.47)
16 22281 31361 (40.75) 379 551 (45.38) 1663 2110
(26.87)
24 22521 31945 (41.84) 857 981 (14.46) 3748 3742 (-.16)
32 22976 32473 (41.33) 1528 1806 (18.19) 6594 6885 (4.41)
40 23197 32594 (40.50) 2390 2755 (15.27) 10239 10450 (2.06)
48 22973 32757 (42.58) 3542 3786 (6.88) 15074 14395
(-4.50)
64 23809 32814 (37.82) 6486 6981 (7.63) 27244 26381
(-3.16)
80 23564 32682 (38.69) 10169 11133 (9.47) 43118 41397
(-3.99)
96 22977 33069 (43.92) 14954 15881 (6.19) 62948 59071
(-6.15)
128 23649 33032 (39.67) 27067 28832 (6.52) 113892 106096
(-6.84)
_______________________________________________________________________________
294534 400371 (35.9) 67504 72858 (7.9) 285077 271096
(-4.9)
_______________________________________________________________________________
I will try more tuning later as Avi suggested, wanted to test
the minimal for now.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-09 10:40 ` Arnd Bergmann
@ 2010-09-09 13:19 ` Krishna Kumar2
0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 13:19 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: anthony, Avi Kivity, davem, kvm, mst, netdev, rusty
Arnd Bergmann <arnd@arndb.de> wrote on 09/09/2010 04:10:27 PM:
> > > Can you live migrate a new guest from new-qemu/new-kernel
> > > to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> > > If not, do we need to support all those cases?
> >
> > I have not tried this, though I added some minimal code in
> > virtio_net_load and virtio_net_save. I don't know what needs
> > to be done exactly at this time. I forgot to put this in the
> > "Next steps" list of things to do.
>
> I was mostly trying to find out if you think it should work
> or if there are specific reasons why it would not.
> E.g. when migrating to a machine that has an old qemu, the guest
> gets reduced to a single queue, but it's not clear to me how
> it can learn about this, or if it can get hidden by the outbound
> qemu.
I agree, I am also not sure how the old guest will handle this.
Sorry about my ignorance on migration :(
Regards,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-09 12:14 ` Rusty Russell
@ 2010-09-09 13:49 ` Krishna Kumar2
2010-09-10 3:33 ` Rusty Russell
2010-09-12 11:46 ` Michael S. Tsirkin
0 siblings, 2 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-09 13:49 UTC (permalink / raw)
To: Rusty Russell; +Cc: anthony, davem, kvm, mst, netdev
Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 05:44:25 PM:
>
> > > This seems a bit weird. I mean, the driver used vdev->config->
find_vqs
> > > to find the queues, which returns them (in order). So, can't you put
> > this
> > > into your struct send_queue?
> >
> > I am saving the vqs in the send_queue, but the cb needs
> > to locate the device txq from the svq. The only other way
> > I could think of is to iterate through the send_queue's
> > and compare svq against sq[i]->svq, but cb's happen quite
> > a bit. Is there a better way?
>
> Ah, good point. Move the queue index into the struct virtqueue?
Is it OK to move the queue_index from virtio_pci_vq_info
to virtqueue? I didn't want to change any data structures
in virtio for this patch, but I can do it either way.
> > > Also, why define VIRTIO_MAX_TXQS? If the driver can't handle all of
> > them,
> > > it should simply not use them...
> >
> > The main reason was vhost :) Since vhost_net_release
> > should not fail (__fput can't handle f_op->release()
> > failure), I needed a maximum number of socks to
> > clean up:
>
> Ah, then it belongs in the vhost headers. The guest shouldn't see such
> a restriction if it doesn't apply; it's a host thing.
>
> Oh, and I think you could profitably use virtio_config_val(), too.
OK, I will make those changes. Thanks for the reference to
virtio_config_val(), I will use it in guest probe instead of
the cumbersome way I am doing now. Unfortunately I need a
constant in vhost for now.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-09 9:45 ` Krishna Kumar2
@ 2010-09-09 23:00 ` Sridhar Samudrala
2010-09-10 5:19 ` Krishna Kumar2
2010-09-12 11:40 ` Michael S. Tsirkin
1 sibling, 1 reply; 43+ messages in thread
From: Sridhar Samudrala @ 2010-09-09 23:00 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, Michael S. Tsirkin, netdev, rusty
On 9/9/2010 2:45 AM, Krishna Kumar2 wrote:
>> Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:
> Some more results and likely cause for single netperf
> degradation below.
>
>
>> Guest -> Host (single netperf):
>> I am getting a drop of almost 20%. I am trying to figure out
>> why.
>>
>> Host -> guest (single netperf):
>> I am getting an improvement of almost 15%. Again - unexpected.
>>
>> Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
>> for runs upto 128 sessions. With fewer netperf (under 8), there
>> was a drop of 3-7% in #packets, but beyond that, the #packets
>> improved significantly to give an average improvement of 7.4%.
>>
>> So it seems that fewer sessions is having negative effect for
>> some reason on the tx side. The code path in virtio-net has not
>> changed much, so the drop in some cases is quite unexpected.
> The drop for the single netperf seems to be due to multiple vhost.
> I changed the patch to start *single* vhost:
>
> Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
> Guest -> Host (1 netperf) : Latency: -3%, SD: 3.5%
I remember seeing similar issue when using a separate vhost thread for
TX and
RX queues. Basically, we should have the same vhost thread process a
TCP flow
in both directions. I guess this allows the data and ACKs to be
processed in sync.
Thanks
Sridhar
> Single vhost performs well but hits the barrier at 16 netperf
> sessions:
>
> SINGLE vhost (Guest -> Host):
> 1 netperf: BW: 10.7% SD: -1.4%
> 4 netperfs: BW: 3% SD: 1.4%
> 8 netperfs: BW: 17.7% SD: -10%
> 16 netperfs: BW: 4.7% SD: -7.0%
> 32 netperfs: BW: -6.1% SD: -5.7%
> BW and SD both improves (guest multiple txqs help). For 32
> netperfs, SD improves.
>
> But with multiple vhosts, guest is able to send more packets
> and BW increases much more (SD too increases, but I think
> that is expected). From the earlier results:
>
> N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%)
> _______________________________________________________________________________
> 4 26387 40716 (54.30) 20 28 (40.00) 86 85
> (-1.16)
> 8 24356 41843 (71.79) 88 129 (46.59) 372 362
> (-2.68)
> 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519
> (-2.50)
> 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722
> (-14.52)
> 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552
> (-14.35)
> 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173
> (-9.66)
> 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031
> (10.74)
> _______________________________________________________________________________
> (All tests were done without any tuning)
>
> From my testing:
>
> 1. Single vhost improves mq guest performance upto 16
> netperfs but degrades after that.
> 2. Multiple vhost degrades single netperf guest
> performance, but significantly improves performance
> for any number of netperf sessions.
>
> Likely cause for the 1 stream degradation with multiple
> vhost patch:
>
> 1. Two vhosts run handling the RX and TX respectively.
> I think the issue is related to cache ping-pong esp
> since these run on different cpus/sockets.
> 2. I (re-)modified the patch to share RX with TX[0]. The
> performance drop is the same, but the reason is the
> guest is not using txq[0] in most cases (dev_pick_tx),
> so vhost's rx and tx are running on different threads.
> But whenever the guest uses txq[0], only one vhost
> runs and the performance is similar to original.
>
> I went back to my *submitted* patch and started a guest
> with numtxq=16 and pinned every vhost to cpus #0&1. Now
> whether guest used txq[0] or txq[n], the performance is
> similar or better (between 10-27% across 10 runs) than
> original code. Also, -6% to -24% improvement in SD.
>
> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.
>
> Thanks,
>
> - KK
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-09 13:49 ` Krishna Kumar2
@ 2010-09-10 3:33 ` Rusty Russell
2010-09-12 11:46 ` Michael S. Tsirkin
1 sibling, 0 replies; 43+ messages in thread
From: Rusty Russell @ 2010-09-10 3:33 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, mst, netdev
On Thu, 9 Sep 2010 11:19:33 pm Krishna Kumar2 wrote:
> Rusty Russell <rusty@rustcorp.com.au> wrote on 09/09/2010 05:44:25 PM:
> > Ah, good point. Move the queue index into the struct virtqueue?
>
> Is it OK to move the queue_index from virtio_pci_vq_info
> to virtqueue? I didn't want to change any data structures
> in virtio for this patch, but I can do it either way.
Yep, it's logical to me.
Thanks!
Rusty.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-09 23:00 ` Sridhar Samudrala
@ 2010-09-10 5:19 ` Krishna Kumar2
0 siblings, 0 replies; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-10 5:19 UTC (permalink / raw)
To: Sridhar Samudrala; +Cc: anthony, davem, kvm, Michael S. Tsirkin, netdev, rusty
Sridhar Samudrala <sri@us.ibm.com> wrote on 09/10/2010 04:30:24 AM:
> I remember seeing similar issue when using a separate vhost thread for
> TX and
> RX queues. Basically, we should have the same vhost thread process a
> TCP flow
> in both directions. I guess this allows the data and ACKs to be
> processed in sync.
I was trying that by sharing threads between rx and tx[0], but
that didn't work either since guest rarely picks txq=0. I was
able to get reasonable single stream performance by pinning
vhosts to the same cpu.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-09 9:45 ` Krishna Kumar2
2010-09-09 23:00 ` Sridhar Samudrala
@ 2010-09-12 11:40 ` Michael S. Tsirkin
2010-09-13 4:12 ` Krishna Kumar2
1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-12 11:40 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, rusty
On Thu, Sep 09, 2010 at 03:15:53PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:
>
> Some more results and likely cause for single netperf
> degradation below.
>
>
> > Guest -> Host (single netperf):
> > I am getting a drop of almost 20%. I am trying to figure out
> > why.
> >
> > Host -> guest (single netperf):
> > I am getting an improvement of almost 15%. Again - unexpected.
> >
> > Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
> > for runs upto 128 sessions. With fewer netperf (under 8), there
> > was a drop of 3-7% in #packets, but beyond that, the #packets
> > improved significantly to give an average improvement of 7.4%.
> >
> > So it seems that fewer sessions is having negative effect for
> > some reason on the tx side. The code path in virtio-net has not
> > changed much, so the drop in some cases is quite unexpected.
>
> The drop for the single netperf seems to be due to multiple vhost.
> I changed the patch to start *single* vhost:
>
> Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
> Guest -> Host (1 netperf) : Latency: -3%, SD: 3.5%
>
> Single vhost performs well but hits the barrier at 16 netperf
> sessions:
>
> SINGLE vhost (Guest -> Host):
> 1 netperf: BW: 10.7% SD: -1.4%
> 4 netperfs: BW: 3% SD: 1.4%
> 8 netperfs: BW: 17.7% SD: -10%
> 16 netperfs: BW: 4.7% SD: -7.0%
> 32 netperfs: BW: -6.1% SD: -5.7%
> BW and SD both improves (guest multiple txqs help). For 32
> netperfs, SD improves.
>
> But with multiple vhosts, guest is able to send more packets
> and BW increases much more (SD too increases, but I think
> that is expected).
Why is this expected?
> From the earlier results:
>
> N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%)
> _______________________________________________________________________________
> 4 26387 40716 (54.30) 20 28 (40.00) 86 85
> (-1.16)
> 8 24356 41843 (71.79) 88 129 (46.59) 372 362
> (-2.68)
> 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519
> (-2.50)
> 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722
> (-14.52)
> 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552
> (-14.35)
> 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173
> (-9.66)
> 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031
> (10.74)
> _______________________________________________________________________________
> (All tests were done without any tuning)
>
> >From my testing:
>
> 1. Single vhost improves mq guest performance upto 16
> netperfs but degrades after that.
> 2. Multiple vhost degrades single netperf guest
> performance, but significantly improves performance
> for any number of netperf sessions.
>
> Likely cause for the 1 stream degradation with multiple
> vhost patch:
>
> 1. Two vhosts run handling the RX and TX respectively.
> I think the issue is related to cache ping-pong esp
> since these run on different cpus/sockets.
Right. With TCP I think we are better off handling
TX and RX for a socket by the same vhost, so that
packet and its ack are handled by the same thread.
Is this what happens with RX multiqueue patch?
How do we select an RX queue to put the packet on?
> 2. I (re-)modified the patch to share RX with TX[0]. The
> performance drop is the same, but the reason is the
> guest is not using txq[0] in most cases (dev_pick_tx),
> so vhost's rx and tx are running on different threads.
> But whenever the guest uses txq[0], only one vhost
> runs and the performance is similar to original.
>
> I went back to my *submitted* patch and started a guest
> with numtxq=16 and pinned every vhost to cpus #0&1. Now
> whether guest used txq[0] or txq[n], the performance is
> similar or better (between 10-27% across 10 runs) than
> original code. Also, -6% to -24% improvement in SD.
>
> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.
>
> Thanks,
>
> - KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-09 13:49 ` Krishna Kumar2
2010-09-10 3:33 ` Rusty Russell
@ 2010-09-12 11:46 ` Michael S. Tsirkin
2010-09-13 4:20 ` Krishna Kumar2
1 sibling, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-12 11:46 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: Rusty Russell, anthony, davem, kvm, netdev
On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> Unfortunately I need a
> constant in vhost for now.
Maybe not even that: you create multiple vhost-net
devices so vhost-net in kernel does not care about these
either, right? So this can be just part of vhost_net.h
in qemu.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-12 11:40 ` Michael S. Tsirkin
@ 2010-09-13 4:12 ` Krishna Kumar2
2010-09-13 11:50 ` Michael S. Tsirkin
0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-13 4:12 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, rusty
"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:10:25 PM:
> > SINGLE vhost (Guest -> Host):
> > 1 netperf: BW: 10.7% SD: -1.4%
> > 4 netperfs: BW: 3% SD: 1.4%
> > 8 netperfs: BW: 17.7% SD: -10%
> > 16 netperfs: BW: 4.7% SD: -7.0%
> > 32 netperfs: BW: -6.1% SD: -5.7%
> > BW and SD both improves (guest multiple txqs help). For 32
> > netperfs, SD improves.
> >
> > But with multiple vhosts, guest is able to send more packets
> > and BW increases much more (SD too increases, but I think
> > that is expected).
>
> Why is this expected?
Results with the original kernel:
_____________________________
# BW SD RSD
______________________________
1 20903 1 6
2 21963 6 25
4 22042 23 102
8 21674 97 419
16 22281 379 1663
24 22521 857 3748
32 22976 1528 6594
40 23197 2390 10239
48 22973 3542 15074
64 23809 6486 27244
80 23564 10169 43118
96 22977 14954 62948
128 23649 27067 113892
________________________________
With higher number of threads running in parallel, SD
increased. In this case most threads run in parallel
only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
higher number of threads run in parallel through
ndo_start_xmit. I *think* the increase in SD is to do
with higher # of threads running for larger code path
>From the numbers I posted with the patch (cut-n-paste
only the % parts), BW increased much more than the SD,
sometimes more than twice the increase in SD.
N# BW% SD% RSD%
4 54.30 40.00 -1.16
8 71.79 46.59 -2.68
16 71.89 50.40 -2.50
32 72.24 34.26 -14.52
48 70.10 31.51 -14.35
64 69.01 38.81 -9.66
96 70.68 71.26 10.74
I also think SD calculation gets skewed for guest->local
host testing. For this test, I ran a guest with numtxqs=16.
The first result below is with my patch, which creates 16
vhosts. The second result is with a modified patch which
creates only 2 vhosts (testing with #netperfs = 64):
#vhosts BW% SD% RSD%
16 20.79 186.01 149.74
2 30.89 34.55 18.44
The remote SD increases with the number of vhost threads,
but that number seems to correlate with guest SD. So though
BW% increased slightly from 20% to 30%, SD fell drastically
from 186% to 34%. I think it could be a calculation skew
with host SD, which also fell from 150% to 18%.
I am planning to submit 2nd patch rev with restricted
number of vhosts.
> > Likely cause for the 1 stream degradation with multiple
> > vhost patch:
> >
> > 1. Two vhosts run handling the RX and TX respectively.
> > I think the issue is related to cache ping-pong esp
> > since these run on different cpus/sockets.
>
> Right. With TCP I think we are better off handling
> TX and RX for a socket by the same vhost, so that
> packet and its ack are handled by the same thread.
> Is this what happens with RX multiqueue patch?
> How do we select an RX queue to put the packet on?
My (unsubmitted) RX patch doesn't do this yet, that is
something I will check.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-12 11:46 ` Michael S. Tsirkin
@ 2010-09-13 4:20 ` Krishna Kumar2
2010-09-13 9:04 ` Michael S. Tsirkin
0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-13 4:20 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, davem, kvm, netdev, Rusty Russell
"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:
> "Michael S. Tsirkin" <mst@redhat.com>
> 09/12/2010 05:16 PM
>
> On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> > Unfortunately I need a
> > constant in vhost for now.
>
> Maybe not even that: you create multiple vhost-net
> devices so vhost-net in kernel does not care about these
> either, right? So this can be just part of vhost_net.h
> in qemu.
Sorry, I didn't understand what you meant.
I can remove all socks[] arrays/constants by pre-allocating
sockets in vhost_setup_vqs. Then I can remove all "socks"
parameters in vhost_net_stop, vhost_net_release and
vhost_net_reset_owner.
Does this make sense?
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-13 4:20 ` Krishna Kumar2
@ 2010-09-13 9:04 ` Michael S. Tsirkin
2010-09-13 15:59 ` Anthony Liguori
0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-13 9:04 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, Rusty Russell
On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:
>
> > "Michael S. Tsirkin" <mst@redhat.com>
> > 09/12/2010 05:16 PM
> >
> > On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> > > Unfortunately I need a
> > > constant in vhost for now.
> >
> > Maybe not even that: you create multiple vhost-net
> > devices so vhost-net in kernel does not care about these
> > either, right? So this can be just part of vhost_net.h
> > in qemu.
>
> Sorry, I didn't understand what you meant.
>
> I can remove all socks[] arrays/constants by pre-allocating
> sockets in vhost_setup_vqs. Then I can remove all "socks"
> parameters in vhost_net_stop, vhost_net_release and
> vhost_net_reset_owner.
>
> Does this make sense?
>
> Thanks,
>
> - KK
Here's what I mean: each vhost device includes 1 TX
and 1 RX VQ. Instead of teaching vhost about multiqueue,
we could simply open /dev/vhost-net multiple times.
How many times would be up to qemu.
--
MST
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-13 4:12 ` Krishna Kumar2
@ 2010-09-13 11:50 ` Michael S. Tsirkin
2010-09-13 16:23 ` Krishna Kumar2
0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-13 11:50 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, davem, kvm, netdev, rusty, avi
On Mon, Sep 13, 2010 at 09:42:22AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/12/2010 05:10:25 PM:
>
> > > SINGLE vhost (Guest -> Host):
> > > 1 netperf: BW: 10.7% SD: -1.4%
> > > 4 netperfs: BW: 3% SD: 1.4%
> > > 8 netperfs: BW: 17.7% SD: -10%
> > > 16 netperfs: BW: 4.7% SD: -7.0%
> > > 32 netperfs: BW: -6.1% SD: -5.7%
> > > BW and SD both improves (guest multiple txqs help). For 32
> > > netperfs, SD improves.
> > >
> > > But with multiple vhosts, guest is able to send more packets
> > > and BW increases much more (SD too increases, but I think
> > > that is expected).
> >
> > Why is this expected?
>
> Results with the original kernel:
> _____________________________
> # BW SD RSD
> ______________________________
> 1 20903 1 6
> 2 21963 6 25
> 4 22042 23 102
> 8 21674 97 419
> 16 22281 379 1663
> 24 22521 857 3748
> 32 22976 1528 6594
> 40 23197 2390 10239
> 48 22973 3542 15074
> 64 23809 6486 27244
> 80 23564 10169 43118
> 96 22977 14954 62948
> 128 23649 27067 113892
> ________________________________
>
> With higher number of threads running in parallel, SD
> increased. In this case most threads run in parallel
> only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> higher number of threads run in parallel through
> ndo_start_xmit. I *think* the increase in SD is to do
> with higher # of threads running for larger code path
> >From the numbers I posted with the patch (cut-n-paste
> only the % parts), BW increased much more than the SD,
> sometimes more than twice the increase in SD.
Service demand is BW/CPU, right? So if BW goes up by 50%
and SD by 40%, this means that CPU more than doubled.
> N# BW% SD% RSD%
> 4 54.30 40.00 -1.16
> 8 71.79 46.59 -2.68
> 16 71.89 50.40 -2.50
> 32 72.24 34.26 -14.52
> 48 70.10 31.51 -14.35
> 64 69.01 38.81 -9.66
> 96 70.68 71.26 10.74
>
> I also think SD calculation gets skewed for guest->local
> host testing.
If it's broken, let's fix it?
> For this test, I ran a guest with numtxqs=16.
> The first result below is with my patch, which creates 16
> vhosts. The second result is with a modified patch which
> creates only 2 vhosts (testing with #netperfs = 64):
My guess is it's not a good idea to have more TX VQs than guest CPUs.
I realize for management it's easier to pass in a single vhost fd, but
just for testing it's probably easier to add code in userspace to open
/dev/vhost multiple times.
>
> #vhosts BW% SD% RSD%
> 16 20.79 186.01 149.74
> 2 30.89 34.55 18.44
>
> The remote SD increases with the number of vhost threads,
> but that number seems to correlate with guest SD. So though
> BW% increased slightly from 20% to 30%, SD fell drastically
> from 186% to 34%. I think it could be a calculation skew
> with host SD, which also fell from 150% to 18%.
I think by default netperf looks in /proc/stat for CPU utilization data:
so host CPU utilization will include the guest CPU, I think?
I would go further and claim that for host/guest TCP
CPU utilization and SD should always be identical.
Makes sense?
>
> I am planning to submit 2nd patch rev with restricted
> number of vhosts.
>
> > > Likely cause for the 1 stream degradation with multiple
> > > vhost patch:
> > >
> > > 1. Two vhosts run handling the RX and TX respectively.
> > > I think the issue is related to cache ping-pong esp
> > > since these run on different cpus/sockets.
> >
> > Right. With TCP I think we are better off handling
> > TX and RX for a socket by the same vhost, so that
> > packet and its ack are handled by the same thread.
> > Is this what happens with RX multiqueue patch?
> > How do we select an RX queue to put the packet on?
>
> My (unsubmitted) RX patch doesn't do this yet, that is
> something I will check.
>
> Thanks,
>
> - KK
You'll want to work on top of net-next, I think there's
RX flow filtering work going on there.
--
MST
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-13 9:04 ` Michael S. Tsirkin
@ 2010-09-13 15:59 ` Anthony Liguori
2010-09-13 16:30 ` Michael S. Tsirkin
0 siblings, 1 reply; 43+ messages in thread
From: Anthony Liguori @ 2010-09-13 15:59 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell
On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
> On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
>
>> "Michael S. Tsirkin"<mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:
>>
>>
>>> "Michael S. Tsirkin"<mst@redhat.com>
>>> 09/12/2010 05:16 PM
>>>
>>> On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
>>>
>>>> Unfortunately I need a
>>>> constant in vhost for now.
>>>>
>>> Maybe not even that: you create multiple vhost-net
>>> devices so vhost-net in kernel does not care about these
>>> either, right? So this can be just part of vhost_net.h
>>> in qemu.
>>>
>> Sorry, I didn't understand what you meant.
>>
>> I can remove all socks[] arrays/constants by pre-allocating
>> sockets in vhost_setup_vqs. Then I can remove all "socks"
>> parameters in vhost_net_stop, vhost_net_release and
>> vhost_net_reset_owner.
>>
>> Does this make sense?
>>
>> Thanks,
>>
>> - KK
>>
> Here's what I mean: each vhost device includes 1 TX
> and 1 RX VQ. Instead of teaching vhost about multiqueue,
> we could simply open /dev/vhost-net multiple times.
> How many times would be up to qemu.
>
Trouble is, each vhost-net device is associated with 1 tun/tap device
which means that each vhost-net device is associated with a transmit and
receive queue.
I don't know if you'll always have an equal number of transmit and
receive queues but there's certainly challenge in terms of flexibility
with this model.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-13 11:50 ` Michael S. Tsirkin
@ 2010-09-13 16:23 ` Krishna Kumar2
2010-09-15 5:33 ` Michael S. Tsirkin
0 siblings, 1 reply; 43+ messages in thread
From: Krishna Kumar2 @ 2010-09-13 16:23 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: anthony, avi, davem, kvm, netdev, rusty, rick.jones2
"Michael S. Tsirkin" <mst@redhat.com> wrote on 09/13/2010 05:20:55 PM:
> > Results with the original kernel:
> > _____________________________
> > # BW SD RSD
> > ______________________________
> > 1 20903 1 6
> > 2 21963 6 25
> > 4 22042 23 102
> > 8 21674 97 419
> > 16 22281 379 1663
> > 24 22521 857 3748
> > 32 22976 1528 6594
> > 40 23197 2390 10239
> > 48 22973 3542 15074
> > 64 23809 6486 27244
> > 80 23564 10169 43118
> > 96 22977 14954 62948
> > 128 23649 27067 113892
> > ________________________________
> >
> > With higher number of threads running in parallel, SD
> > increased. In this case most threads run in parallel
> > only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> > higher number of threads run in parallel through
> > ndo_start_xmit. I *think* the increase in SD is to do
> > with higher # of threads running for larger code path
> > >From the numbers I posted with the patch (cut-n-paste
> > only the % parts), BW increased much more than the SD,
> > sometimes more than twice the increase in SD.
>
> Service demand is BW/CPU, right? So if BW goes up by 50%
> and SD by 40%, this means that CPU more than doubled.
I think the SD calculation might be more complicated,
I think it does it based on adding up averages sampled
and stored during the run. But, I still don't see how CPU
can double?? e.g.
BW: 1000 -> 1500 (50%)
SD: 100 -> 140 (40%)
CPU: 10 -> 10.71 (7.1%)
> > N# BW% SD% RSD%
> > 4 54.30 40.00 -1.16
> > 8 71.79 46.59 -2.68
> > 16 71.89 50.40 -2.50
> > 32 72.24 34.26 -14.52
> > 48 70.10 31.51 -14.35
> > 64 69.01 38.81 -9.66
> > 96 70.68 71.26 10.74
> >
> > I also think SD calculation gets skewed for guest->local
> > host testing.
>
> If it's broken, let's fix it?
>
> > For this test, I ran a guest with numtxqs=16.
> > The first result below is with my patch, which creates 16
> > vhosts. The second result is with a modified patch which
> > creates only 2 vhosts (testing with #netperfs = 64):
>
> My guess is it's not a good idea to have more TX VQs than guest CPUs.
Definitely, I will try to run tomorrow with more reasonable
values, also will test with my second version of the patch
that creates restricted number of vhosts and post results.
> I realize for management it's easier to pass in a single vhost fd, but
> just for testing it's probably easier to add code in userspace to open
> /dev/vhost multiple times.
>
> >
> > #vhosts BW% SD% RSD%
> > 16 20.79 186.01 149.74
> > 2 30.89 34.55 18.44
> >
> > The remote SD increases with the number of vhost threads,
> > but that number seems to correlate with guest SD. So though
> > BW% increased slightly from 20% to 30%, SD fell drastically
> > from 186% to 34%. I think it could be a calculation skew
> > with host SD, which also fell from 150% to 18%.
>
> I think by default netperf looks in /proc/stat for CPU utilization data:
> so host CPU utilization will include the guest CPU, I think?
It appears that way to me too, but the data above seems to
suggest the opposite...
> I would go further and claim that for host/guest TCP
> CPU utilization and SD should always be identical.
> Makes sense?
It makes sense to me, but once again I am not sure how SD
is really done, or whether it is linear to CPU. Cc'ing Rick
in case he can comment....
>
> >
> > I am planning to submit 2nd patch rev with restricted
> > number of vhosts.
> >
> > > > Likely cause for the 1 stream degradation with multiple
> > > > vhost patch:
> > > >
> > > > 1. Two vhosts run handling the RX and TX respectively.
> > > > I think the issue is related to cache ping-pong esp
> > > > since these run on different cpus/sockets.
> > >
> > > Right. With TCP I think we are better off handling
> > > TX and RX for a socket by the same vhost, so that
> > > packet and its ack are handled by the same thread.
> > > Is this what happens with RX multiqueue patch?
> > > How do we select an RX queue to put the packet on?
> >
> > My (unsubmitted) RX patch doesn't do this yet, that is
> > something I will check.
> >
> > Thanks,
> >
> > - KK
>
> You'll want to work on top of net-next, I think there's
> RX flow filtering work going on there.
Thanks Michael, I will follow up on that for the RX patch,
plus your suggestion on tying RX with TX.
Thanks,
- KK
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-13 15:59 ` Anthony Liguori
@ 2010-09-13 16:30 ` Michael S. Tsirkin
2010-09-13 17:00 ` Avi Kivity
2010-09-13 17:40 ` Anthony Liguori
0 siblings, 2 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-13 16:30 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell
On Mon, Sep 13, 2010 at 10:59:34AM -0500, Anthony Liguori wrote:
> On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
> >On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
> >>"Michael S. Tsirkin"<mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:
> >>
> >>>"Michael S. Tsirkin"<mst@redhat.com>
> >>>09/12/2010 05:16 PM
> >>>
> >>>On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> >>>>Unfortunately I need a
> >>>>constant in vhost for now.
> >>>Maybe not even that: you create multiple vhost-net
> >>>devices so vhost-net in kernel does not care about these
> >>>either, right? So this can be just part of vhost_net.h
> >>>in qemu.
> >>Sorry, I didn't understand what you meant.
> >>
> >>I can remove all socks[] arrays/constants by pre-allocating
> >>sockets in vhost_setup_vqs. Then I can remove all "socks"
> >>parameters in vhost_net_stop, vhost_net_release and
> >>vhost_net_reset_owner.
> >>
> >>Does this make sense?
> >>
> >>Thanks,
> >>
> >>- KK
> >Here's what I mean: each vhost device includes 1 TX
> >and 1 RX VQ. Instead of teaching vhost about multiqueue,
> >we could simply open /dev/vhost-net multiple times.
> >How many times would be up to qemu.
>
> Trouble is, each vhost-net device is associated with 1 tun/tap
> device which means that each vhost-net device is associated with a
> transmit and receive queue.
>
> I don't know if you'll always have an equal number of transmit and
> receive queues but there's certainly challenge in terms of
> flexibility with this model.
>
> Regards,
>
> Anthony Liguori
Not really, TX and RX can be mapped to different devices,
or you can only map one of these. What is the trouble?
What other features would you desire in terms of flexibility?
--
MST
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-13 16:30 ` Michael S. Tsirkin
@ 2010-09-13 17:00 ` Avi Kivity
2010-09-15 5:35 ` Michael S. Tsirkin
2010-09-13 17:40 ` Anthony Liguori
1 sibling, 1 reply; 43+ messages in thread
From: Avi Kivity @ 2010-09-13 17:00 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Anthony Liguori, Krishna Kumar2, davem, kvm, netdev, Rusty Russell
On 09/13/2010 06:30 PM, Michael S. Tsirkin wrote:
> Trouble is, each vhost-net device is associated with 1 tun/tap
> device which means that each vhost-net device is associated with a
> transmit and receive queue.
>
> I don't know if you'll always have an equal number of transmit and
> receive queues but there's certainly challenge in terms of
> flexibility with this model.
>
> Regards,
>
> Anthony Liguori
> Not really, TX and RX can be mapped to different devices,
> or you can only map one of these. What is the trouble?
Suppose you have one multiqueue-capable ethernet card. How can you
connect it to multiple rx/tx queues?
tx is in principle doable, but what about rx?
What does "only map one of these" mean? Connect the device with one
queue (presumably rx), and terminate the others?
Will packet classification work (does the current multiqueue proposal
support it)?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-13 16:30 ` Michael S. Tsirkin
2010-09-13 17:00 ` Avi Kivity
@ 2010-09-13 17:40 ` Anthony Liguori
2010-09-15 5:40 ` Michael S. Tsirkin
1 sibling, 1 reply; 43+ messages in thread
From: Anthony Liguori @ 2010-09-13 17:40 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell
On 09/13/2010 11:30 AM, Michael S. Tsirkin wrote:
> On Mon, Sep 13, 2010 at 10:59:34AM -0500, Anthony Liguori wrote:
>
>> On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
>>
>>> On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
>>>
>>>> "Michael S. Tsirkin"<mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:
>>>>
>>>>
>>>>> "Michael S. Tsirkin"<mst@redhat.com>
>>>>> 09/12/2010 05:16 PM
>>>>>
>>>>> On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
>>>>>
>>>>>> Unfortunately I need a
>>>>>> constant in vhost for now.
>>>>>>
>>>>> Maybe not even that: you create multiple vhost-net
>>>>> devices so vhost-net in kernel does not care about these
>>>>> either, right? So this can be just part of vhost_net.h
>>>>> in qemu.
>>>>>
>>>> Sorry, I didn't understand what you meant.
>>>>
>>>> I can remove all socks[] arrays/constants by pre-allocating
>>>> sockets in vhost_setup_vqs. Then I can remove all "socks"
>>>> parameters in vhost_net_stop, vhost_net_release and
>>>> vhost_net_reset_owner.
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>>
>>>> - KK
>>>>
>>> Here's what I mean: each vhost device includes 1 TX
>>> and 1 RX VQ. Instead of teaching vhost about multiqueue,
>>> we could simply open /dev/vhost-net multiple times.
>>> How many times would be up to qemu.
>>>
>> Trouble is, each vhost-net device is associated with 1 tun/tap
>> device which means that each vhost-net device is associated with a
>> transmit and receive queue.
>>
>> I don't know if you'll always have an equal number of transmit and
>> receive queues but there's certainly challenge in terms of
>> flexibility with this model.
>>
>> Regards,
>>
>> Anthony Liguori
>>
> Not really, TX and RX can be mapped to different devices,
>
It's just a little odd. Would you bond multiple tun tap devices to
achieve multi-queue TX? For RX, do you somehow limit RX to only one of
those devices?
If we were doing this in QEMU (and btw, there needs to be userspace
patches before we implement this in the kernel side), I think it would
make more sense to just rely on doing a multithreaded write to a single
tun/tap device and then to hope that in can be made smarter at the
macvtap layer.
Regards,
Anthony Liguori
Regards,
Anthony Liguori
> or you can only map one of these. What is the trouble?
> What other features would you desire in terms of flexibility?
>
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
2010-09-13 16:23 ` Krishna Kumar2
@ 2010-09-15 5:33 ` Michael S. Tsirkin
0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-15 5:33 UTC (permalink / raw)
To: Krishna Kumar2; +Cc: anthony, avi, davem, kvm, netdev, rusty, rick.jones2
On Mon, Sep 13, 2010 at 09:53:40PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 09/13/2010 05:20:55 PM:
>
> > > Results with the original kernel:
> > > _____________________________
> > > # BW SD RSD
> > > ______________________________
> > > 1 20903 1 6
> > > 2 21963 6 25
> > > 4 22042 23 102
> > > 8 21674 97 419
> > > 16 22281 379 1663
> > > 24 22521 857 3748
> > > 32 22976 1528 6594
> > > 40 23197 2390 10239
> > > 48 22973 3542 15074
> > > 64 23809 6486 27244
> > > 80 23564 10169 43118
> > > 96 22977 14954 62948
> > > 128 23649 27067 113892
> > > ________________________________
> > >
> > > With higher number of threads running in parallel, SD
> > > increased. In this case most threads run in parallel
> > > only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> > > higher number of threads run in parallel through
> > > ndo_start_xmit. I *think* the increase in SD is to do
> > > with higher # of threads running for larger code path
> > > >From the numbers I posted with the patch (cut-n-paste
> > > only the % parts), BW increased much more than the SD,
> > > sometimes more than twice the increase in SD.
> >
> > Service demand is BW/CPU, right? So if BW goes up by 50%
> > and SD by 40%, this means that CPU more than doubled.
>
> I think the SD calculation might be more complicated,
> I think it does it based on adding up averages sampled
> and stored during the run. But, I still don't see how CPU
> can double?? e.g.
> BW: 1000 -> 1500 (50%)
> SD: 100 -> 140 (40%)
> CPU: 10 -> 10.71 (7.1%)
Hmm. Time to look at the source. Which netperf version did you use?
> > > N# BW% SD% RSD%
> > > 4 54.30 40.00 -1.16
> > > 8 71.79 46.59 -2.68
> > > 16 71.89 50.40 -2.50
> > > 32 72.24 34.26 -14.52
> > > 48 70.10 31.51 -14.35
> > > 64 69.01 38.81 -9.66
> > > 96 70.68 71.26 10.74
> > >
> > > I also think SD calculation gets skewed for guest->local
> > > host testing.
> >
> > If it's broken, let's fix it?
> >
> > > For this test, I ran a guest with numtxqs=16.
> > > The first result below is with my patch, which creates 16
> > > vhosts. The second result is with a modified patch which
> > > creates only 2 vhosts (testing with #netperfs = 64):
> >
> > My guess is it's not a good idea to have more TX VQs than guest CPUs.
>
> Definitely, I will try to run tomorrow with more reasonable
> values, also will test with my second version of the patch
> that creates restricted number of vhosts and post results.
>
> > I realize for management it's easier to pass in a single vhost fd, but
> > just for testing it's probably easier to add code in userspace to open
> > /dev/vhost multiple times.
> >
> > >
> > > #vhosts BW% SD% RSD%
> > > 16 20.79 186.01 149.74
> > > 2 30.89 34.55 18.44
> > >
> > > The remote SD increases with the number of vhost threads,
> > > but that number seems to correlate with guest SD. So though
> > > BW% increased slightly from 20% to 30%, SD fell drastically
> > > from 186% to 34%. I think it could be a calculation skew
> > > with host SD, which also fell from 150% to 18%.
> >
> > I think by default netperf looks in /proc/stat for CPU utilization data:
> > so host CPU utilization will include the guest CPU, I think?
>
> It appears that way to me too, but the data above seems to
> suggest the opposite...
>
> > I would go further and claim that for host/guest TCP
> > CPU utilization and SD should always be identical.
> > Makes sense?
>
> It makes sense to me, but once again I am not sure how SD
> is really done, or whether it is linear to CPU. Cc'ing Rick
> in case he can comment....
Me neither. I should rephrase: I think we should always
use host CPU utilization always.
> >
> > >
> > > I am planning to submit 2nd patch rev with restricted
> > > number of vhosts.
> > >
> > > > > Likely cause for the 1 stream degradation with multiple
> > > > > vhost patch:
> > > > >
> > > > > 1. Two vhosts run handling the RX and TX respectively.
> > > > > I think the issue is related to cache ping-pong esp
> > > > > since these run on different cpus/sockets.
> > > >
> > > > Right. With TCP I think we are better off handling
> > > > TX and RX for a socket by the same vhost, so that
> > > > packet and its ack are handled by the same thread.
> > > > Is this what happens with RX multiqueue patch?
> > > > How do we select an RX queue to put the packet on?
> > >
> > > My (unsubmitted) RX patch doesn't do this yet, that is
> > > something I will check.
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > You'll want to work on top of net-next, I think there's
> > RX flow filtering work going on there.
>
> Thanks Michael, I will follow up on that for the RX patch,
> plus your suggestion on tying RX with TX.
>
> Thanks,
>
> - KK
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-13 17:00 ` Avi Kivity
@ 2010-09-15 5:35 ` Michael S. Tsirkin
0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-15 5:35 UTC (permalink / raw)
To: Avi Kivity
Cc: Anthony Liguori, Krishna Kumar2, davem, kvm, netdev, Rusty Russell
On Mon, Sep 13, 2010 at 07:00:51PM +0200, Avi Kivity wrote:
> On 09/13/2010 06:30 PM, Michael S. Tsirkin wrote:
> >Trouble is, each vhost-net device is associated with 1 tun/tap
> >device which means that each vhost-net device is associated with a
> >transmit and receive queue.
> >
> >I don't know if you'll always have an equal number of transmit and
> >receive queues but there's certainly challenge in terms of
> >flexibility with this model.
> >
> >Regards,
> >
> >Anthony Liguori
> >Not really, TX and RX can be mapped to different devices,
> >or you can only map one of these. What is the trouble?
>
> Suppose you have one multiqueue-capable ethernet card. How can you
> connect it to multiple rx/tx queues?
> tx is in principle doable, but what about rx?
>
> What does "only map one of these" mean? Connect the device with one
> queue (presumably rx), and terminate the others?
>
>
> Will packet classification work (does the current multiqueue
> proposal support it)?
>
This is a non trivial problem, but
this needs to be handled in tap, not in vhost net.
If tap gives you multiple queues, vhost-net will happily
let you connect vqs to these.
>
> --
> error compiling committee.c: too many arguments to function
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC PATCH 1/4] Add a new API to virtio-pci
2010-09-13 17:40 ` Anthony Liguori
@ 2010-09-15 5:40 ` Michael S. Tsirkin
0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2010-09-15 5:40 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Krishna Kumar2, davem, kvm, netdev, Rusty Russell
On Mon, Sep 13, 2010 at 12:40:11PM -0500, Anthony Liguori wrote:
> On 09/13/2010 11:30 AM, Michael S. Tsirkin wrote:
> >On Mon, Sep 13, 2010 at 10:59:34AM -0500, Anthony Liguori wrote:
> >>On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
> >>>On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
> >>>>"Michael S. Tsirkin"<mst@redhat.com> wrote on 09/12/2010 05:16:37 PM:
> >>>>
> >>>>>"Michael S. Tsirkin"<mst@redhat.com>
> >>>>>09/12/2010 05:16 PM
> >>>>>
> >>>>>On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> >>>>>>Unfortunately I need a
> >>>>>>constant in vhost for now.
> >>>>>Maybe not even that: you create multiple vhost-net
> >>>>>devices so vhost-net in kernel does not care about these
> >>>>>either, right? So this can be just part of vhost_net.h
> >>>>>in qemu.
> >>>>Sorry, I didn't understand what you meant.
> >>>>
> >>>>I can remove all socks[] arrays/constants by pre-allocating
> >>>>sockets in vhost_setup_vqs. Then I can remove all "socks"
> >>>>parameters in vhost_net_stop, vhost_net_release and
> >>>>vhost_net_reset_owner.
> >>>>
> >>>>Does this make sense?
> >>>>
> >>>>Thanks,
> >>>>
> >>>>- KK
> >>>Here's what I mean: each vhost device includes 1 TX
> >>>and 1 RX VQ. Instead of teaching vhost about multiqueue,
> >>>we could simply open /dev/vhost-net multiple times.
> >>>How many times would be up to qemu.
> >>Trouble is, each vhost-net device is associated with 1 tun/tap
> >>device which means that each vhost-net device is associated with a
> >>transmit and receive queue.
> >>
> >>I don't know if you'll always have an equal number of transmit and
> >>receive queues but there's certainly challenge in terms of
> >>flexibility with this model.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >Not really, TX and RX can be mapped to different devices,
>
> It's just a little odd. Would you bond multiple tun tap devices to
> achieve multi-queue TX? For RX, do you somehow limit RX to only one
> of those devices?
Exatly in the way the patches we discuss here do this:
we already have a per-queue fd.
> If we were doing this in QEMU (and btw, there needs to be userspace
> patches before we implement this in the kernel side),
I agree that Feature parity is nice to have, but
I don't see a huge problem with (hopefully temporarily) only
supporting feature X with kernel acceleration, BTW.
This is already the case with checksum offloading features.
> I think it
> would make more sense to just rely on doing a multithreaded write to
> a single tun/tap device and then to hope that in can be made smarter
> at the macvtap layer.
No, an fd serializes access, so you need seperate fds for multithreaded
writes to work. Think about how e.g. select will work.
> Regards,
>
> Anthony Liguori
>
> Regards,
>
> Anthony Liguori
>
> >or you can only map one of these. What is the trouble?
> >What other features would you desire in terms of flexibility?
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2010-09-15 5:46 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08 7:28 [RFC PATCH 0/4] Implement multiqueue virtio-net Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
2010-09-09 3:49 ` Rusty Russell
2010-09-09 5:23 ` Krishna Kumar2
2010-09-09 12:14 ` Rusty Russell
2010-09-09 13:49 ` Krishna Kumar2
2010-09-10 3:33 ` Rusty Russell
2010-09-12 11:46 ` Michael S. Tsirkin
2010-09-13 4:20 ` Krishna Kumar2
2010-09-13 9:04 ` Michael S. Tsirkin
2010-09-13 15:59 ` Anthony Liguori
2010-09-13 16:30 ` Michael S. Tsirkin
2010-09-13 17:00 ` Avi Kivity
2010-09-15 5:35 ` Michael S. Tsirkin
2010-09-13 17:40 ` Anthony Liguori
2010-09-15 5:40 ` Michael S. Tsirkin
2010-09-08 7:29 ` [RFC PATCH 2/4] Changes for virtio-net Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 3/4] Changes for vhost Krishna Kumar
2010-09-08 7:29 ` [RFC PATCH 4/4] qemu changes Krishna Kumar
2010-09-08 7:47 ` [RFC PATCH 0/4] Implement multiqueue virtio-net Avi Kivity
2010-09-08 9:22 ` Krishna Kumar2
2010-09-08 9:28 ` Avi Kivity
2010-09-08 10:17 ` Krishna Kumar2
2010-09-08 14:12 ` Arnd Bergmann
2010-09-08 16:47 ` Krishna Kumar2
2010-09-09 10:40 ` Arnd Bergmann
2010-09-09 13:19 ` Krishna Kumar2
2010-09-08 8:10 ` Michael S. Tsirkin
2010-09-08 9:23 ` Krishna Kumar2
2010-09-08 10:48 ` Michael S. Tsirkin
2010-09-08 12:19 ` Krishna Kumar2
2010-09-08 16:47 ` Krishna Kumar2
[not found] ` <OF70542242.6CAA236A-ON65257798.0044A4E0-65257798.005C0E7C@LocalDomain>
2010-09-09 9:45 ` Krishna Kumar2
2010-09-09 23:00 ` Sridhar Samudrala
2010-09-10 5:19 ` Krishna Kumar2
2010-09-12 11:40 ` Michael S. Tsirkin
2010-09-13 4:12 ` Krishna Kumar2
2010-09-13 11:50 ` Michael S. Tsirkin
2010-09-13 16:23 ` Krishna Kumar2
2010-09-15 5:33 ` Michael S. Tsirkin
[not found] ` <OF8043B2B7.7048D739-ON65257799.0021A2EE-65257799.00356B3E@LocalDomain>
2010-09-09 13:18 ` Krishna Kumar2
2010-09-08 8:13 ` Michael S. Tsirkin
2010-09-08 9:28 ` Krishna Kumar2
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.