[RFC PATCH 0/4] Implement multiqueue virtio-net

* [RFC PATCH 0/4] Implement multiqueue virtio-net
@ 2010-09-08  7:28 Krishna Kumar
  2010-09-08  7:29 ` [RFC PATCH 1/4] Add a new API to virtio-pci Krishna Kumar
                   ` (6 more replies)
  0 siblings, 7 replies; 43+ messages in thread
From: Krishna Kumar @ 2010-09-08  7:28 UTC (permalink / raw)
  To: rusty, davem; +Cc: netdev, kvm, anthony, Krishna Kumar, mst

Following patches implement Transmit mq in virtio-net.  Also
included is the user qemu changes.

1. This feature was first implemented with a single vhost.
   Testing showed 3-8% performance gain for upto 8 netperf
   sessions (and sometimes 16), but BW dropped with more
   sessions.  However, implementing per-txq vhost improved
   BW significantly all the way to 128 sessions.
2. For this mq TX patch, 1 daemon is created for RX and 'n'
   daemons for the 'n' TXQ's, for a total of (n+1) daemons.
   The (subsequent) RX mq patch changes that to a total of
   'n' daemons, where RX and TX vq's share 1 daemon.
3. Service Demand increases for TCP, but significantly
   improves for UDP.
4. Interoperability: Many combinations, but not all, of
   qemu, host, guest tested together.

                  Enabling mq on virtio:
                  -----------------------

When following options are passed to qemu:
        - smp > 1
        - vhost=on
        - mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using
an optional 'numtxqs' option. e.g.  for a smp=4 guest:
        vhost=on,mq=on             ->   #txqueues = 4
        vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
        vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2

                   Performance (guest -> local host):
                   -----------------------------------

System configuration:
        Host:  8 Intel Xeon, 8 GB memory
        Guest: 4 cpus, 2 GB memory
All testing without any tuning, and TCP netperf with 64K I/O
_______________________________________________________________________________
                           TCP (#numtxqs=2)
N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
_______________________________________________________________________________
4       26387   40716 (54.30)   20      28   (40.00)    86i     85     (-1.16)
8       24356   41843 (71.79)   88      129  (46.59)    372     362    (-2.68)
16      23587   40546 (71.89)   375     564  (50.40)    1558    1519   (-2.50)
32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722   (-14.52)
48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552  (-14.35)
64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173  (-9.66)
96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)
_______________________________________________________________________________
                       UDP (#numtxqs=8)
N#      BW1     BW2   (%)      SD1     SD2   (%)
__________________________________________________________
4       29836   56761 (90.24)   67      63    (-5.97)
8       27666   63767 (130.48)  326     265   (-18.71)
16      25452   60665 (138.35)  1396    1269  (-9.09)
32      26172   63491 (142.59)  5617    4202  (-25.19)
48      26146   64629 (147.18)  12813   9316  (-27.29)
64      25575   65448 (155.90)  23063   16346 (-29.12)
128     26454   63772 (141.06)  91054   85051 (-6.59)
__________________________________________________________
N#: Number of netperf sessions, 90 sec runs
BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
              SD for original code
BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
              SD for new code. e.g. BW2=40716 means average BW2 was
              20358 mbps.

                       Next steps:
                       -----------

1. mq RX patch is also complete - plan to submit once TX is OK.
2. Cache-align data structures: I didn't see any BW/SD improvement
   after making the sq's (and similarly for vhost) cache-aligned
   statically:
        struct virtnet_info {
                ...
                struct send_queue sq[16] ____cacheline_aligned_in_smp;
                ...
        };

Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts 
      CPU0     CPU1     CPU2    CPU3       
40:   0        0        0       0        PCI-MSI-edge  virtio0-config
41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

Review/feedback appreciated.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---

^ permalink raw reply	[flat|nested] 43+ messages in thread