linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [net-next RFC V5 0/5] Multiqueue virtio-net
@ 2012-07-05 10:29 Jason Wang
  2012-07-05 10:29 ` [net-next RFC V5 1/5] virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE Jason Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-05 10:29 UTC (permalink / raw)
  To: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
  Cc: akong, kvm, sri, Jason Wang

Hello All:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

Test Environment:
- Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
- Two directed connected 82599

Test Summary:

- Highlights: huge improvements on TCP_RR test
- Lowlights: regression on small packet transmission, higher cpu utilization
             than single queue, need further optimization

Analysis of the performance result:

- I count the number of packets sending/receiving during the test, and
  multiqueue show much more ability in terms of packets per second.

- For the tx regression, multiqueue send about 1-2 times of more packets
  compared to single queue, and the packets size were much smaller than single
  queue does. I suspect tcp does less batching in multiqueue, so I hack the
  tcp_write_xmit() to forece more batching, multiqueue works as well as
  singlequeue for both small transmission and throughput

- I didn't pack the accelerate RFS with virtio-net in this sereis as it still
  need further shaping, for the one that interested in this please see:
  http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html

Changes from V4:
- Add ability to negotiate the number of queues through control virtqueue
- Ethtool -{L|l} support and default the tx/rx queue number to 1
- Expose the API to set irq affinity instead of irq itself

Changes from V3:

- Rebase to the net-next
- Let queue 2 to be the control virtqueue to obey the spec
- Prodives irq affinity
- Choose txq based on processor id

References:

- V4: https://lkml.org/lkml/2012/6/25/120
- V3: http://lwn.net/Articles/467283/

Test result:

1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning

- Guest to External Host TCP STREAM
sessions size throughput1 throughput2   norm1 norm2
1 64 650.55 655.61 100% 24.88 24.86 99%
2 64 1446.81 1309.44 90% 30.49 27.16 89%
4 64 1430.52 1305.59 91% 30.78 26.80 87%
8 64 1450.89 1270.82 87% 30.83 25.95 84%
1 256 1699.45 1779.58 104% 56.75 59.08 104%
2 256 4902.71 3446.59 70% 98.53 62.78 63%
4 256 4803.76 2980.76 62% 97.44 54.68 56%
8 256 5128.88 3158.74 61% 104.68 58.61 55%
1 512 2837.98 2838.42 100% 89.76 90.41 100%
2 512 6742.59 5495.83 81% 155.03 99.07 63%
4 512 9193.70 5900.17 64% 202.84 106.44 52%
8 512 9287.51 7107.79 76% 202.18 129.08 63%
1 1024 4166.42 4224.98 101% 128.55 129.86 101%
2 1024 6196.94 7823.08 126% 181.80 168.81 92%
4 1024 9113.62 9219.49 101% 235.15 190.93 81%
8 1024 9324.25 9402.66 100% 239.10 179.99 75%
1 2048 7441.63 6534.04 87% 248.01 215.63 86%
2 2048 7024.61 7414.90 105% 225.79 219.62 97%
4 2048 8971.49 9269.00 103% 278.94 220.84 79%
8 2048 9314.20 9359.96 100% 268.36 192.23 71%
1 4096 8282.60 8990.08 108% 277.45 320.05 115%
2 4096 9194.80 9293.78 101% 317.02 248.76 78%
4 4096 9340.73 9313.19 99% 300.34 230.35 76%
8 4096 9148.23 9347.95 102% 279.49 199.43 71%
1 16384 8787.89 8766.31 99% 312.38 316.53 101%
2 16384 9306.35 9156.14 98% 319.53 279.83 87%
4 16384 9177.81 9307.50 101% 312.69 230.07 73%
8 16384 9035.82 9188.00 101% 298.32 199.17 66%
- TCP RR
sessions size throughput1 throughput2   norm1 norm2  
50 1 54695.41 84164.98 153% 1957.33 1901.31 97%
100 1 60141.88 88598.94 147% 2157.90 2000.45 92%
250 1 74763.56 135584.22 181% 2541.94 2628.59 103%
50 64 51628.38 82867.50 160% 1872.55 1812.16 96%
100 64 60367.73 84080.60 139% 2215.69 1867.69 84%
250 64 68502.70 124910.59 182% 2321.43 2495.76 107%
50 128 53477.08 77625.07 145% 1905.10 1870.99 98%
100 128 59697.56 74902.37 125% 2230.66 1751.03 78%
250 128 71248.74 133963.55 188% 2453.12 2711.72 110%
50 256 47663.86 67742.63 142% 1880.45 1735.30 92%
100 256 54051.84 68738.57 127% 2123.03 1778.59 83%
250 256 68250.06 124487.90 182% 2321.89 2598.60 111%
- External Host to Guest TCP STRAM
sessions size throughput1 throughput2   norm1 norm2
1 64 847.71 864.83 102% 57.99 57.93 99%
2 64 1690.82 1544.94 91% 80.13 55.09 68%
4 64 3434.98 3455.53 100% 127.17 89.00 69%
8 64 5890.19 6557.35 111% 194.70 146.52 75%
1 256 2094.04 2109.14 100% 130.73 127.14 97%
2 256 5218.13 3731.97 71% 219.15 114.02 52%
4 256 6734.51 9213.47 136% 227.87 208.31 91%
8 256 6452.86 9402.78 145% 224.83 207.77 92%
1 512 3945.07 4203.68 106% 279.72 273.30 97%
2 512 7878.96 8122.55 103% 278.25 231.71 83%
4 512 7645.89 9402.13 122% 252.10 217.42 86%
8 512 6657.06 9403.71 141% 239.81 214.89 89%
1 1024 5729.06 5111.21 89% 289.38 303.09 104%
2 1024 8097.27 8159.67 100% 269.29 242.97 90%
4 1024 7778.93 8919.02 114% 261.28 205.50 78%
8 1024 6458.02 9360.02 144% 221.26 208.09 94%
1 2048 6426.94 5195.59 80% 292.52 307.47 105%
2 2048 8221.90 9025.66 109% 283.80 242.25 85%
4 2048 7364.72 8527.79 115% 248.10 198.36 79%
8 2048 6760.63 9161.07 135% 230.53 205.12 88%
1 4096 7247.02 6874.21 94% 276.23 287.68 104%
2 4096 8346.04 8818.65 105% 281.49 254.81 90%
4 4096 6710.00 9354.59 139% 216.41 210.13 97%
8 4096 6265.69 9406.87 150% 206.69 210.92 102%
1 16384 8159.50 8048.79 98% 266.94 283.11 106%
2 16384 8525.66 8552.41 100% 294.36 239.27 81%
4 16384 6042.24 8447.86 139% 200.21 196.40 98%
8 16384 6432.63 9403.49 146% 211.48 206.13 97%

2) 1 vm 4 vcpu 1q vs 4q, 1 - 1q, 2 - 4q, no pinning

- Guest to External Host TCP STREAM
sessions size throughput1 throughput2   norm1 norm2
1 64 636.93 657.69 103% 23.55 24.42 103%
2 64 1457.46 1268.78 87% 30.97 26.02 84%
4 64 3062.86 2302.43 75% 41.00 29.64 72%
8 64 3107.68 2308.32 74% 41.62 29.07 69%
1 256 1743.50 1750.11 100% 59.00 56.63 95%
2 256 4582.61 2870.31 62% 92.47 51.97 56%
4 256 8440.96 4795.37 56% 135.10 56.39 41%
8 256 9240.31 6654.82 72% 144.76 74.89 51%
1 512 2918.25 2735.26 93% 91.08 86.47 94%
2 512 8978.32 5107.95 56% 200.00 94.97 47%
4 512 8850.39 6864.37 77% 190.32 101.09 53%
8 512 9270.30 8483.01 91% 193.44 118.73 61%
1 1024 4416.10 3679.70 83% 135.54 110.63 81%
2 1024 9085.20 8770.48 96% 242.23 175.59 72%
4 1024 9158.57 9011.56 98% 234.39 159.17 67%
8 1024 9345.89 9067.43 97% 233.35 138.73 59%
1 2048 8455.19 6077.94 71% 338.52 190.16 56%
2 2048 9223.32 8237.73 89% 270.00 198.27 73%
4 2048 9080.75 9257.63 101% 261.30 172.80 66%
8 2048 9177.39 8977.10 97% 256.89 147.50 57%
1 4096 8665.35 8394.78 96% 289.63 289.85 100%
2 4096 7850.73 8857.86 112% 253.33 252.62 99%
4 4096 9332.55 8508.37 91% 289.19 151.29 52%
8 4096 8482.30 9146.80 107% 255.41 156.02 61%
1 16384 8825.72 8778.26 99% 314.60 308.89 98%
2 16384 9283.85 8927.40 96% 316.48 246.98 78%
4 16384 7766.95 8708.06 112% 265.25 155.59 58%
8 16384 8945.55 8940.23 99% 298.45 151.32 50%
- TCP_RR
sessions size throughput1 throughput2   norm1 norm2  
50 1 60848.70 81719.39 134% 2196.86 1551.05 70%
100 1 61886.19 81425.02 131% 2215.76 1517.52 68%
250 1 72058.41 162597.84 225% 2441.84 2278.14 93%
50 64 51646.93 74160.10 143% 1861.07 1322.22 71%
100 64 57574.86 83488.26 145% 2076.54 1479.79 71%
250 64 67583.35 138482.15 204% 2314.46 2022.83 87%
50 128 59931.51 71633.03 119% 2244.60 1309.18 58%
100 128 58329.80 73104.90 125% 2202.98 1329.52 60%
250 128 71021.55 161067.73 226% 2469.11 2205.28 89%
50 256 47509.24 64330.24 135% 1915.75 1269.90 66%
100 256 49293.03 68507.94 138% 1939.75 1263.64 65%
250 256 63169.07 138390.68 219% 2255.47 2098.13 93%
- External Host to Guest TCP STREAM
sessions size throughput1 throughput2   norm1 norm2  
1 64 850.18 854.96 100% 56.94 58.25 102%
2 64 1659.12 1730.25 104% 81.65 67.57 82%
4 64 3254.70 3397.17 104% 118.57 76.21 64%
8 64 6251.97 6389.29 102% 207.68 104.21 50%
1 256 2029.14 2105.18 103% 116.45 119.69 102%
2 256 5412.02 4260.32 78% 240.87 139.73 58%
4 256 7777.28 8743.12 112% 263.20 174.65 66%
8 256 6459.51 9388.93 145% 218.94 158.37 72%
1 512 4566.31 4269.30 93% 274.74 289.83 105%
2 512 7444.52 8240.64 110% 286.24 243.74 85%
4 512 7722.29 9391.16 121% 261.96 180.36 68%
8 512 6228.50 9134.52 146% 209.17 161.00 76%
1 1024 4965.50 4953.68 99% 307.64 280.48 91%
2 1024 8270.08 7733.71 93% 288.32 197.04 68%
4 1024 7551.04 9394.58 124% 268.41 206.62 76%
8 1024 6307.78 9179.03 145% 216.67 159.63 73%
1 2048 5741.12 5948.80 103% 290.34 268.66 92%
2 2048 7932.79 8766.05 110% 262.96 215.90 82%
4 2048 6907.55 9255.97 133% 233.56 203.96 87%
8 2048 6037.22 9399.41 155% 197.14 164.09 83%
1 4096 7131.70 7535.10 105% 279.43 275.12 98%
2 4096 8109.17 9348.04 115% 274.29 211.49 77%
4 4096 6878.92 9319.13 135% 244.21 192.06 78%
8 4096 6265.92 9408.35 150% 211.85 159.26 75%
1 16384 8288.01 8596.39 103% 272.85 290.22 106%
2 16384 8166.29 9280.12 113% 277.04 236.61 85%
4 16384 6446.97 9382.22 145% 222.91 187.24 83%
8 16384 6066.98 9405.51 155% 198.98 157.09 78%

3) 2 vms each with 2 vcpus, 1q vs 2q - pin vhost/vcpu in the same node

- 2 Guests to External Hosts TCP STREAM
sessions size throughput1 throughput2   norm1 norm2  
1 64 1442.07 1475.11 102% 30.82 31.21 101%
2 64 3124.87 2900.93 92% 40.29 35.95 89%
4 64 3166.52 2864.04 90% 40.70 35.47 87%
8 64 3141.45 2848.94 90% 40.38 35.34 87%
1 256 3628.54 3711.73 102% 68.47 70.22 102%
2 256 7806.95 7586.69 97% 111.23 84.38 75%
4 256 8823.65 7612.74 86% 132.92 85.04 63%
8 256 9194.89 9373.41 101% 135.98 119.62 87%
1 512 7106.67 7128.00 100% 124.79 124.30 99%
2 512 9190.22 9397.33 102% 180.84 149.34 82%
4 512 9401.01 9376.67 99% 173.00 140.15 81%
8 512 8572.84 9032.90 105% 150.49 127.58 84%
1 1024 9361.93 9379.24 100% 205.81 202.94 98%
2 1024 9386.69 9389.04 100% 201.78 165.75 82%
4 1024 9403.43 9378.54 99% 195.33 152.06 77%
8 1024 9213.63 9180.64 99% 178.99 141.51 79%
1 2048 9338.95 9384.67 100% 223.22 227.86 102%
2 2048 9389.28 9389.45 100% 202.37 170.08 84%
4 2048 9405.86 9388.71 99% 193.76 161.54 83%
8 2048 9352.40 9384.06 100% 189.16 157.06 83%
1 4096 9380.74 9384.90 100% 239.37 241.56 100%
2 4096 9393.47 9376.74 99% 213.84 195.61 91%
4 4096 9393.85 9381.50 99% 198.06 170.18 85%
8 4096 9400.41 9232.31 98% 192.87 163.56 84%
1 16384 9348.18 9335.55 99% 253.02 254.86 100%
2 16384 9384.97 9359.53 99% 218.56 208.59 95%
4 16384 9326.60 9382.15 100% 206.24 179.72 87%
8 16384 9355.82 9392.85 100% 198.22 172.89 87%
- TCP RR
sessions size throughput1 throughput2   norm1 norm2  
50 1 200340.33 261750.19 130% 2935.27 3018.59 102%
100 1 236141.58 266304.49 112% 3452.16 3071.74 88%
250 1 361574.59 320825.08 88% 4972.98 3705.70 74%
50 64 225748.53 242671.12 107% 3011.48 2869.07 95%
100 64 249885.37 260453.72 104% 3240.21 3063.67 94%
250 64 360341.12 310775.60 86% 4682.42 3657.91 78%
50 128 227995.27 289320.38 126% 2950.92 3479.37 117%
100 128 239491.11 291135.77 121% 3099.55 3508.75 113%
250 128 390390.68 362484.35 92% 5042.30 4368.52 86%
50 256 222604.51 317140.97 142% 3058.08 3839.39 125%
100 256 254770.92 335606.03 131% 3326.16 4046.65 121%
250 256 400584.52 436749.22 109% 5220.79 5278.86 101%
- External Host to 2 Guests
sessions size throughput1 throughput2   norm1 norm2  
1 64 1667.99 1684.50 100% 59.66 60.77 101%
2 64 3338.83 3379.97 101% 83.61 64.82 77%
4 64 6613.65 6619.11 100% 131.00 97.19 74%
8 64 6553.07 6418.31 97% 141.35 98.27 69%
1 256 3938.40 4068.52 103% 125.21 123.76 98%
2 256 9215.57 9210.88 99% 185.31 154.27 83%
4 256 9407.29 9008.13 95% 186.72 150.01 80%
8 256 9377.17 9385.57 100% 190.28 137.59 72%
1 512 7360.19 6984.80 94% 214.09 211.66 98%
2 512 9392.91 9401.88 100% 193.92 173.11 89%
4 512 9382.64 9394.34 100% 189.27 145.80 77%
8 512 9308.60 9094.08 97% 189.70 141.26 74%
1 1024 9153.26 9066.06 99% 223.07 219.95 98%
2 1024 9393.38 9398.43 100% 194.02 173.82 89%
4 1024 9395.92 8960.73 95% 192.61 145.82 75%
8 1024 9388.92 9399.08 100% 191.18 143.87 75%
1 2048 9355.32 9240.63 98% 221.50 223.03 100%
2 2048 9395.68 9399.62 100% 193.31 177.21 91%
4 2048 9397.67 9399.56 100% 195.25 157.53 80%
8 2048 9397.89 9401.70 100% 197.57 146.96 74%
1 4096 9375.84 9381.72 100% 223.06 225.06 100%
2 4096 9389.47 9396.00 100% 193.91 197.13 101%
4 4096 9397.45 9400.11 100% 192.33 163.60 85%
8 4096 9105.40 9415.76 103% 192.71 140.41 72%
1 16384 9381.53 9381.40 99% 223.53 225.66 100%
2 16384 9387.90 9395.44 100% 193.34 177.03 91%
4 16384 9397.92 9410.98 100% 195.04 151.14 77%
8 16384 9259.00 9419.48 101% 194.91 153.48 78%

4) Local vm to vm 2 vcpu 1q vs 2q - pin vcpu/thread in the same numa node

- VM to VM TCP STREAM
sessions size throughput1 throughput2   norm1 norm2  
1 64 576.05 576.14 100% 12.25 12.32 100%
2 64 1266.75 1160.04 91% 19.10 16.05 84%
4 64 1267.34 1123.70 88% 19.08 15.51 81%
8 64 1230.88 1174.70 95% 18.53 15.58 84%
1 256 1311.00 1303.02 99% 25.34 25.35 100%
2 256 5400.26 2794.00 51% 75.92 36.43 47%
4 256 5200.67 2818.88 54% 72.81 33.92 46%
8 256 5234.55 2893.74 55% 73.10 34.97 47%
1 512 3244.09 3263.72 100% 56.48 56.65 100%
2 512 8172.16 4661.15 57% 119.05 67.89 57%
4 512 10567.44 7063.25 66% 147.76 77.27 52%
8 512 10477.87 8471.33 80% 145.94 102.91 70%
1 1024 5432.54 5333.99 98% 93.69 92.38 98%
2 1024 12590.24 9259.97 73% 185.37 135.28 72%
4 1024 15600.53 10731.93 68% 222.20 123.60 55%
8 1024 16222.87 10704.85 65% 227.05 113.81 50%
1 2048 6667.61 7484.37 112% 116.75 129.72 111%
2 2048 8180.43 11500.88 140% 137.84 156.64 113%
4 2048 15127.93 14416.16 95% 227.60 154.59 67%
8 2048 16381.79 14794.10 90% 244.29 158.45 64%
1 4096 7375.63 8948.90 121% 131.97 156.57 118%
2 4096 9321.16 14443.21 154% 161.24 163.74 101%
4 4096 13028.45 15984.94 122% 212.78 171.26 80%
8 4096 15611.28 18810.54 120% 245.15 198.65 81%
1 16384 15304.38 14202.08 92% 259.94 244.04 93%
2 16384 15508.97 15913.09 102% 261.30 244.26 93%
4 16384 14859.98 20164.34 135% 248.29 214.26 86%
8 16384 15594.59 19960.99 127% 253.79 211.27 83%
- TCP RR
sessions size throughput1 throughput2   norm1 norm2  
50 1 54972.51 69820.99 127% 1133.58 1063.58 93%
100 1 55847.16 72407.93 129% 1155.73 1024.35 88%
250 1 60066.23 108266.50 180% 1114.30 1323.55 118%
50 64 48727.63 62378.32 128% 1014.29 888.78 87%
100 64 51804.65 69250.51 133% 1077.78 986.97 91%
250 64 61278.68 100015.78 163% 1076.93 1243.18 115%
50 256 51593.29 62046.22 120% 1069.14 871.08 81%
100 256 51647.00 68197.43 132% 1071.66 958.51 89%
250 256 60433.88 99072.59 163% 1072.41 1199.10 111%
50 512 52177.79 66483.77 127% 1082.65 960.82 88%
100 512 50351.67 62537.63 124% 1041.61 876.41 84%
250 512 60510.14 103856.79 171% 1055.21 1245.17 118%


Jason Wang (4):
  virtio_ring: move queue_index to vring_virtqueue
  virtio: intorduce an API to set affinity for a virtqueue
  virtio_net: multiqueue support
  virtio_net: support negotiating the number of queues through ctrl vq

Krishna Kumar (1):
  virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE

 drivers/net/virtio_net.c      |  792 +++++++++++++++++++++++++++++------------
 drivers/virtio/virtio_mmio.c  |    5 +-
 drivers/virtio/virtio_pci.c   |   58 +++-
 drivers/virtio/virtio_ring.c  |   17 +
 include/linux/virtio.h        |    4 +
 include/linux/virtio_config.h |   21 ++
 include/linux/virtio_net.h    |   10 +
 7 files changed, 677 insertions(+), 230 deletions(-)


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next RFC V5 1/5] virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE
  2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
@ 2012-07-05 10:29 ` Jason Wang
  2012-07-05 10:29 ` [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue Jason Wang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-05 10:29 UTC (permalink / raw)
  To: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
  Cc: akong, kvm, sri, Jason Wang

From: Krishna Kumar <krkumar2@in.ibm.com>

Introduce VIRTIO_NET_F_MULTIQUEUE.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/virtio_net.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 2470f54..1bc7e30 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -51,6 +51,7 @@
 #define VIRTIO_NET_F_CTRL_RX_EXTRA 20	/* Extra RX mode control support */
 #define VIRTIO_NET_F_GUEST_ANNOUNCE 21	/* Guest can announce device on the
 					 * network */
+#define VIRTIO_NET_F_MULTIQUEUE	22	/* Device supports multiple TXQ/RXQ */
 
 #define VIRTIO_NET_S_LINK_UP	1	/* Link is up */
 #define VIRTIO_NET_S_ANNOUNCE	2	/* Announcement is needed */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue
  2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
  2012-07-05 10:29 ` [net-next RFC V5 1/5] virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE Jason Wang
@ 2012-07-05 10:29 ` Jason Wang
  2012-07-05 11:40   ` Sasha Levin
  2012-07-05 10:29 ` [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue Jason Wang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2012-07-05 10:29 UTC (permalink / raw)
  To: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
  Cc: akong, kvm, sri, Jason Wang

Instead of storing the queue index in virtio infos, this patch moves them to
vring_virtqueue and introduces helpers to set and get the value. This would
simplify the management and tracing.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/virtio/virtio_mmio.c |    5 +----
 drivers/virtio/virtio_pci.c  |   12 +++++-------
 drivers/virtio/virtio_ring.c |   17 +++++++++++++++++
 include/linux/virtio.h       |    4 ++++
 4 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index 453db0c..f5432b6 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -131,9 +131,6 @@ struct virtio_mmio_vq_info {
 	/* the number of entries in the queue */
 	unsigned int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -324,7 +321,6 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 		err = -ENOMEM;
 		goto error_kmalloc;
 	}
-	info->queue_index = index;
 
 	/* Allocate pages for the queue - start with a queue as big as
 	 * possible (limited by maximum size allowed by device), drop down
@@ -363,6 +359,7 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
 		goto error_new_virtqueue;
 	}
 
+	virtqueue_set_queue_index(vq, index);
 	vq->priv = info;
 	info->vq = vq;
 
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 2e03d41..adb24f2 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -79,9 +79,6 @@ struct virtio_pci_vq_info
 	/* the number of entries in the queue */
 	int num;
 
-	/* the index of the queue */
-	int queue_index;
-
 	/* the virtual address of the ring queue */
 	void *queue;
 
@@ -202,11 +199,11 @@ static void vp_reset(struct virtio_device *vdev)
 static void vp_notify(struct virtqueue *vq)
 {
 	struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev);
-	struct virtio_pci_vq_info *info = vq->priv;
 
 	/* we write the queue's selector into the notification register to
 	 * signal the other end */
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
+	iowrite16(virtqueue_get_queue_index(vq),
+		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY);
 }
 
 /* Handle a configuration change: Tell driver if it wants to know. */
@@ -402,7 +399,6 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 	if (!info)
 		return ERR_PTR(-ENOMEM);
 
-	info->queue_index = index;
 	info->num = num;
 	info->msix_vector = msix_vec;
 
@@ -425,6 +421,7 @@ static struct virtqueue *setup_vq(struct virtio_device *vdev, unsigned index,
 		goto out_activate_queue;
 	}
 
+	virtqueue_set_queue_index(vq, index);
 	vq->priv = info;
 	info->vq = vq;
 
@@ -467,7 +464,8 @@ static void vp_del_vq(struct virtqueue *vq)
 	list_del(&info->node);
 	spin_unlock_irqrestore(&vp_dev->lock, flags);
 
-	iowrite16(info->queue_index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
+	iowrite16(virtqueue_get_queue_index(vq),
+		vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
 
 	if (vp_dev->msix_enabled) {
 		iowrite16(VIRTIO_MSI_NO_VECTOR,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..9c5aeea 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -106,6 +106,9 @@ struct vring_virtqueue
 	/* How to notify other side. FIXME: commonalize hcalls! */
 	void (*notify)(struct virtqueue *vq);
 
+	/* Index of the queue */
+	int queue_index;
+
 #ifdef DEBUG
 	/* They're supposed to lock for us. */
 	unsigned int in_use;
@@ -171,6 +174,20 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
 	return head;
 }
 
+void virtqueue_set_queue_index(struct virtqueue *_vq, int queue_index)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	vq->queue_index = queue_index;
+}
+EXPORT_SYMBOL_GPL(virtqueue_set_queue_index);
+
+int virtqueue_get_queue_index(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	return vq->queue_index;
+}
+EXPORT_SYMBOL_GPL(virtqueue_get_queue_index);
+
 /**
  * virtqueue_add_buf - expose buffer to other end
  * @vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 8efd28a..0d8ed46 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -50,6 +50,10 @@ void *virtqueue_detach_unused_buf(struct virtqueue *vq);
 
 unsigned int virtqueue_get_vring_size(struct virtqueue *vq);
 
+void virtqueue_set_queue_index(struct virtqueue *vq, int queue_index);
+
+int virtqueue_get_queue_index(struct virtqueue *vq);
+
 /**
  * virtio_device - representation of a device using virtio
  * @index: unique position on the virtio bus
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
  2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
  2012-07-05 10:29 ` [net-next RFC V5 1/5] virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE Jason Wang
  2012-07-05 10:29 ` [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue Jason Wang
@ 2012-07-05 10:29 ` Jason Wang
  2012-07-27 14:38   ` Paolo Bonzini
  2012-08-09 15:13   ` Paolo Bonzini
  2012-07-05 10:29 ` [net-next RFC V5 4/5] virtio_net: multiqueue support Jason Wang
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-05 10:29 UTC (permalink / raw)
  To: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
  Cc: akong, kvm, sri, Jason Wang

Sometimes, virtio device need to configure irq affiniry hint to maximize the
performance. Instead of just exposing the irq of a virtqueue, this patch
introduce an API to set the affinity for a virtqueue.

The api is best-effort, the affinity hint may not be set as expected due to
platform support, irq sharing or irq type. Currently, only pci method were
implemented and we set the affinity according to:

- if device uses INTX, we just ignore the request
- if device has per vq vector, we force the affinity hint
- if the virtqueues share MSI, make the affinity OR over all affinities
 requested

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/virtio/virtio_pci.c   |   46 +++++++++++++++++++++++++++++++++++++++++
 include/linux/virtio_config.h |   21 ++++++++++++++++++
 2 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index adb24f2..2ff0451 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -48,6 +48,7 @@ struct virtio_pci_device
 	int msix_enabled;
 	int intx_enabled;
 	struct msix_entry *msix_entries;
+	cpumask_var_t *msix_affinity_masks;
 	/* Name strings for interrupts. This size should be enough,
 	 * and I'm too lazy to allocate each name separately. */
 	char (*msix_names)[256];
@@ -276,6 +277,10 @@ static void vp_free_vectors(struct virtio_device *vdev)
 	for (i = 0; i < vp_dev->msix_used_vectors; ++i)
 		free_irq(vp_dev->msix_entries[i].vector, vp_dev);
 
+	for (i = 0; i < vp_dev->msix_vectors; i++)
+		if (vp_dev->msix_affinity_masks[i])
+			free_cpumask_var(vp_dev->msix_affinity_masks[i]);
+
 	if (vp_dev->msix_enabled) {
 		/* Disable the vector used for configuration */
 		iowrite16(VIRTIO_MSI_NO_VECTOR,
@@ -293,6 +298,8 @@ static void vp_free_vectors(struct virtio_device *vdev)
 	vp_dev->msix_names = NULL;
 	kfree(vp_dev->msix_entries);
 	vp_dev->msix_entries = NULL;
+	kfree(vp_dev->msix_affinity_masks);
+	vp_dev->msix_affinity_masks = NULL;
 }
 
 static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
@@ -311,6 +318,15 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors,
 				     GFP_KERNEL);
 	if (!vp_dev->msix_names)
 		goto error;
+	vp_dev->msix_affinity_masks
+		= kzalloc(nvectors * sizeof *vp_dev->msix_affinity_masks,
+			  GFP_KERNEL);
+	if (!vp_dev->msix_affinity_masks)
+		goto error;
+	for (i = 0; i < nvectors; ++i)
+		if (!alloc_cpumask_var(&vp_dev->msix_affinity_masks[i],
+					GFP_KERNEL))
+			goto error;
 
 	for (i = 0; i < nvectors; ++i)
 		vp_dev->msix_entries[i].entry = i;
@@ -607,6 +623,35 @@ static const char *vp_bus_name(struct virtio_device *vdev)
 	return pci_name(vp_dev->pci_dev);
 }
 
+/* Setup the affinity for a virtqueue:
+ * - force the affinity for per vq vector
+ * - OR over all affinities for shared MSI
+ * - ignore the affinity request if we're using INTX
+ */
+static int vp_set_vq_affinity(struct virtqueue *vq, int cpu)
+{
+	struct virtio_device *vdev = vq->vdev;
+	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+	struct virtio_pci_vq_info *info = vq->priv;
+	struct cpumask *mask;
+	unsigned int irq;
+
+	if (!vq->callback)
+		return -EINVAL;
+
+	if (vp_dev->msix_enabled) {
+		mask = vp_dev->msix_affinity_masks[info->msix_vector];
+		irq = vp_dev->msix_entries[info->msix_vector].vector;
+		if (cpu == -1)
+			irq_set_affinity_hint(irq, NULL);
+		else {
+			cpumask_set_cpu(cpu, mask);
+			irq_set_affinity_hint(irq, mask);
+		}
+	}
+	return 0;
+}
+
 static struct virtio_config_ops virtio_pci_config_ops = {
 	.get		= vp_get,
 	.set		= vp_set,
@@ -618,6 +663,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
 	.get_features	= vp_get_features,
 	.finalize_features = vp_finalize_features,
 	.bus_name	= vp_bus_name,
+	.set_vq_affinity = vp_set_vq_affinity,
 };
 
 static void virtio_pci_release_dev(struct device *_d)
diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index fc457f4..2c4a989 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -98,6 +98,7 @@
  *	vdev: the virtio_device
  *      This returns a pointer to the bus name a la pci_name from which
  *      the caller can then copy.
+ * @set_vq_affinity: set the affinity for a virtqueue.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -116,6 +117,7 @@ struct virtio_config_ops {
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);
 	const char *(*bus_name)(struct virtio_device *vdev);
+	int (*set_vq_affinity)(struct virtqueue *vq, int cpu);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -190,5 +192,24 @@ const char *virtio_bus_name(struct virtio_device *vdev)
 	return vdev->config->bus_name(vdev);
 }
 
+/**
+ * virtqueue_set_affinity - setting affinity for a virtqueue
+ * @vq: the virtqueue
+ * @cpu: the cpu no.
+ *
+ * Pay attention the function are best-effort: the affinity hint may not be set
+ * due to config support, irq type and sharing.
+ *
+ */
+static inline
+int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
+{
+	struct virtio_device *vdev = vq->vdev;
+	if (vdev->config->set_vq_affinity)
+		return vdev->config->set_vq_affinity(vq, cpu);
+	return 0;
+}
+
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_VIRTIO_CONFIG_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
                   ` (2 preceding siblings ...)
  2012-07-05 10:29 ` [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue Jason Wang
@ 2012-07-05 10:29 ` Jason Wang
  2012-07-05 20:02   ` Amos Kong
  2012-07-20 13:40   ` Michael S. Tsirkin
  2012-07-05 10:29 ` [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq Jason Wang
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-05 10:29 UTC (permalink / raw)
  To: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
  Cc: akong, kvm, sri, Jason Wang

This patch converts virtio_net to a multi queue device. After negotiated
VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
and driver could read the number from config space.

The driver expects the number of rx/tx queue paris is equal to the number of
vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
optimization were introduced:

- Txq selection is based on the processor id in order to avoid contending a lock
  whose owner may exits to host.
- Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
  the queue pairs.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
 include/linux/virtio_net.h |    2 +
 2 files changed, 452 insertions(+), 195 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1db445b..7410187 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include <linux/scatterlist.h>
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
+#include <linux/interrupt.h>
 
 static int napi_weight = 128;
 module_param(napi_weight, int, 0444);
@@ -41,6 +42,8 @@ module_param(gso, bool, 0444);
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
+#define MAX_QUEUES 256
+
 struct virtnet_stats {
 	struct u64_stats_sync tx_syncp;
 	struct u64_stats_sync rx_syncp;
@@ -51,43 +54,69 @@ struct virtnet_stats {
 	u64 rx_packets;
 };
 
-struct virtnet_info {
-	struct virtio_device *vdev;
-	struct virtqueue *rvq, *svq, *cvq;
-	struct net_device *dev;
+/* Internal representation of a send virtqueue */
+struct send_queue {
+	/* Virtqueue associated with this send _queue */
+	struct virtqueue *vq;
+
+	/* TX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
+/* Internal representation of a receive virtqueue */
+struct receive_queue {
+	/* Virtqueue associated with this receive_queue */
+	struct virtqueue *vq;
+
+	/* Back pointer to the virtnet_info */
+	struct virtnet_info *vi;
+
 	struct napi_struct napi;
-	unsigned int status;
 
 	/* Number of input buffers, and max we've ever had. */
 	unsigned int num, max;
 
+	/* Work struct for refilling if we run low on memory. */
+	struct delayed_work refill;
+
+	/* Chain pages by the private ptr. */
+	struct page *pages;
+
+	/* RX: fragments + linear part + virtio header */
+	struct scatterlist sg[MAX_SKB_FRAGS + 2];
+};
+
+struct virtnet_info {
+	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
+
+	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
+	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
+	struct virtqueue *cvq;
+
+	struct virtio_device *vdev;
+	struct net_device *dev;
+	unsigned int status;
+
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
 	/* Host will merge rx buffers for big packets (shake it! shake it!) */
 	bool mergeable_rx_bufs;
 
+	/* Has control virtqueue */
+	bool has_cvq;
+
 	/* enable config space updates */
 	bool config_enable;
 
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
-	/* Work struct for refilling if we run low on memory. */
-	struct delayed_work refill;
-
 	/* Work struct for config space updates */
 	struct work_struct config_work;
 
 	/* Lock for config space updates */
 	struct mutex config_lock;
-
-	/* Chain pages by the private ptr. */
-	struct page *pages;
-
-	/* fragments + linear part + virtio header */
-	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
-	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -108,6 +137,22 @@ struct padded_vnet_hdr {
 	char padding[6];
 };
 
+static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
+{
+	int ret = virtqueue_get_queue_index(vq);
+
+	/* skip ctrl vq */
+	if (vi->has_cvq)
+		return (ret - 1) / 2;
+	else
+		return ret / 2;
+}
+
+static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
+{
+	return virtqueue_get_queue_index(vq) / 2;
+}
+
 static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
 {
 	return (struct skb_vnet_hdr *)skb->cb;
@@ -117,22 +162,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
  */
-static void give_pages(struct virtnet_info *vi, struct page *page)
+static void give_pages(struct receive_queue *rq, struct page *page)
 {
 	struct page *end;
 
 	/* Find end of list, sew whole thing into vi->pages. */
 	for (end = page; end->private; end = (struct page *)end->private);
-	end->private = (unsigned long)vi->pages;
-	vi->pages = page;
+	end->private = (unsigned long)rq->pages;
+	rq->pages = page;
 }
 
-static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
+static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 {
-	struct page *p = vi->pages;
+	struct page *p = rq->pages;
 
 	if (p) {
-		vi->pages = (struct page *)p->private;
+		rq->pages = (struct page *)p->private;
 		/* clear private here, it is used to chain pages */
 		p->private = 0;
 	} else
@@ -140,15 +185,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
 	return p;
 }
 
-static void skb_xmit_done(struct virtqueue *svq)
+static void skb_xmit_done(struct virtqueue *vq)
 {
-	struct virtnet_info *vi = svq->vdev->priv;
+	struct virtnet_info *vi = vq->vdev->priv;
 
 	/* Suppress further interrupts. */
-	virtqueue_disable_cb(svq);
+	virtqueue_disable_cb(vq);
 
 	/* We were probably waiting for more output buffers. */
-	netif_wake_queue(vi->dev);
+	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
 }
 
 static void set_skb_frag(struct sk_buff *skb, struct page *page,
@@ -167,9 +212,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
 }
 
 /* Called from bottom half context */
-static struct sk_buff *page_to_skb(struct virtnet_info *vi,
+static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int len)
 {
+	struct virtnet_info *vi = rq->vi;
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	unsigned int copy, hdr_len, offset;
@@ -225,12 +271,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	}
 
 	if (page)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return skb;
 }
 
-static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
+static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	struct page *page;
@@ -244,7 +290,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 			skb->dev->stats.rx_length_errors++;
 			return -EINVAL;
 		}
-		page = virtqueue_get_buf(vi->rvq, &len);
+		page = virtqueue_get_buf(rq->vq, &len);
 		if (!page) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 skb->dev->name, hdr->mhdr.num_buffers);
@@ -257,13 +303,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
 
 		set_skb_frag(skb, page, 0, &len);
 
-		--vi->num;
+		--rq->num;
 	}
 	return 0;
 }
 
-static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
+static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 {
+	struct net_device *dev = rq->vi->dev;
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 	struct sk_buff *skb;
@@ -274,7 +321,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
 		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
+			give_pages(rq, buf);
 		else
 			dev_kfree_skb(buf);
 		return;
@@ -286,14 +333,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 		skb_trim(skb, len);
 	} else {
 		page = buf;
-		skb = page_to_skb(vi, page, len);
+		skb = page_to_skb(rq, page, len);
 		if (unlikely(!skb)) {
 			dev->stats.rx_dropped++;
-			give_pages(vi, page);
+			give_pages(rq, page);
 			return;
 		}
 		if (vi->mergeable_rx_bufs)
-			if (receive_mergeable(vi, skb)) {
+			if (receive_mergeable(rq, skb)) {
 				dev_kfree_skb(skb);
 				return;
 			}
@@ -363,90 +410,91 @@ frame_err:
 	dev_kfree_skb(skb);
 }
 
-static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
 {
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
 	int err;
 
-	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
+	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
 	if (unlikely(!skb))
 		return -ENOMEM;
 
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
-	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
+	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
+
+	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
 
-	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
 	return err;
 }
 
-static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *first, *list = NULL;
 	char *p;
 	int i, err, offset;
 
-	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
+	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
-		first = get_a_page(vi, gfp);
+		first = get_a_page(rq, gfp);
 		if (!first) {
 			if (list)
-				give_pages(vi, list);
+				give_pages(rq, list);
 			return -ENOMEM;
 		}
-		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
+		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
 
 		/* chain new page in list head to match sg */
 		first->private = (unsigned long)list;
 		list = first;
 	}
 
-	first = get_a_page(vi, gfp);
+	first = get_a_page(rq, gfp);
 	if (!first) {
-		give_pages(vi, list);
+		give_pages(rq, list);
 		return -ENOMEM;
 	}
 	p = page_address(first);
 
-	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
-	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
-	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
+	/* rq->sg[0], rq->sg[1] share the same page */
+	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
+	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
 
-	/* vi->rx_sg[1] for data packet, from offset */
+	/* rq->sg[1] for data packet, from offset */
 	offset = sizeof(struct padded_vnet_hdr);
-	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
+	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
 				first, gfp);
 	if (err < 0)
-		give_pages(vi, first);
+		give_pages(rq, first);
 
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
 	struct page *page;
 	int err;
 
-	page = get_a_page(vi, gfp);
+	page = get_a_page(rq, gfp);
 	if (!page)
 		return -ENOMEM;
 
-	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
+	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
 	if (err < 0)
-		give_pages(vi, page);
+		give_pages(rq, page);
 
 	return err;
 }
@@ -458,97 +506,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
  * before we're receiving packets, or from refill_work which is
  * careful to disable receiving (using napi_disable).
  */
-static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
+static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 {
+	struct virtnet_info *vi = rq->vi;
 	int err;
 	bool oom;
 
 	do {
 		if (vi->mergeable_rx_bufs)
-			err = add_recvbuf_mergeable(vi, gfp);
+			err = add_recvbuf_mergeable(rq, gfp);
 		else if (vi->big_packets)
-			err = add_recvbuf_big(vi, gfp);
+			err = add_recvbuf_big(rq, gfp);
 		else
-			err = add_recvbuf_small(vi, gfp);
+			err = add_recvbuf_small(rq, gfp);
 
 		oom = err == -ENOMEM;
 		if (err < 0)
 			break;
-		++vi->num;
+		++rq->num;
 	} while (err > 0);
-	if (unlikely(vi->num > vi->max))
-		vi->max = vi->num;
-	virtqueue_kick(vi->rvq);
+	if (unlikely(rq->num > rq->max))
+		rq->max = rq->num;
+	virtqueue_kick(rq->vq);
 	return !oom;
 }
 
-static void skb_recv_done(struct virtqueue *rvq)
+static void skb_recv_done(struct virtqueue *vq)
 {
-	struct virtnet_info *vi = rvq->vdev->priv;
+	struct virtnet_info *vi = vq->vdev->priv;
+	struct napi_struct *napi = &vi->rq[rxq_get_qnum(vi, vq)]->napi;
+
 	/* Schedule NAPI, Suppress further interrupts if successful. */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(rvq);
-		__napi_schedule(&vi->napi);
+	if (napi_schedule_prep(napi)) {
+		virtqueue_disable_cb(vq);
+		__napi_schedule(napi);
 	}
 }
 
-static void virtnet_napi_enable(struct virtnet_info *vi)
+static void virtnet_napi_enable(struct receive_queue *rq)
 {
-	napi_enable(&vi->napi);
+	napi_enable(&rq->napi);
 
 	/* If all buffers were filled by other side before we napi_enabled, we
 	 * won't get another interrupt, so process any outstanding packets
 	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
 	 * We synchronize against interrupts via NAPI_STATE_SCHED */
-	if (napi_schedule_prep(&vi->napi)) {
-		virtqueue_disable_cb(vi->rvq);
+	if (napi_schedule_prep(&rq->napi)) {
+		virtqueue_disable_cb(rq->vq);
 		local_bh_disable();
-		__napi_schedule(&vi->napi);
+		__napi_schedule(&rq->napi);
 		local_bh_enable();
 	}
 }
 
 static void refill_work(struct work_struct *work)
 {
-	struct virtnet_info *vi;
+	struct napi_struct *napi;
+	struct receive_queue *rq;
 	bool still_empty;
 
-	vi = container_of(work, struct virtnet_info, refill.work);
-	napi_disable(&vi->napi);
-	still_empty = !try_fill_recv(vi, GFP_KERNEL);
-	virtnet_napi_enable(vi);
+	rq = container_of(work, struct receive_queue, refill.work);
+	napi = &rq->napi;
+
+	napi_disable(napi);
+	still_empty = !try_fill_recv(rq, GFP_KERNEL);
+	virtnet_napi_enable(rq);
 
 	/* In theory, this can happen: if we don't get any buffers in
 	 * we will *never* try to fill again. */
 	if (still_empty)
-		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
+		queue_delayed_work(system_nrt_wq, &rq->refill, HZ/2);
 }
 
 static int virtnet_poll(struct napi_struct *napi, int budget)
 {
-	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
+	struct receive_queue *rq = container_of(napi, struct receive_queue,
+						napi);
 	void *buf;
 	unsigned int len, received = 0;
 
 again:
 	while (received < budget &&
-	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
-		receive_buf(vi->dev, buf, len);
-		--vi->num;
+	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
+		receive_buf(rq, buf, len);
+		--rq->num;
 		received++;
 	}
 
-	if (vi->num < vi->max / 2) {
-		if (!try_fill_recv(vi, GFP_ATOMIC))
-			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	if (rq->num < rq->max / 2) {
+		if (!try_fill_recv(rq, GFP_ATOMIC))
+			queue_delayed_work(system_nrt_wq, &rq->refill, 0);
 	}
 
 	/* Out of packets? */
 	if (received < budget) {
 		napi_complete(napi);
-		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
+		if (unlikely(!virtqueue_enable_cb(rq->vq)) &&
 		    napi_schedule_prep(napi)) {
-			virtqueue_disable_cb(vi->rvq);
+			virtqueue_disable_cb(rq->vq);
 			__napi_schedule(napi);
 			goto again;
 		}
@@ -557,13 +612,14 @@ again:
 	return received;
 }
 
-static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
+static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
+				       struct virtqueue *vq)
 {
 	struct sk_buff *skb;
 	unsigned int len, tot_sgs = 0;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
 
-	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
+	while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
 		u64_stats_update_begin(&stats->tx_syncp);
@@ -577,7 +633,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 	return tot_sgs;
 }
 
-static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
+static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
+		    struct virtqueue *vq, struct scatterlist *sg)
 {
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
@@ -615,44 +672,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
+	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(vq, sg, hdr->num_sg,
 				 0, skb, GFP_ATOMIC);
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int qnum = skb_get_queue_mapping(skb);
+	struct virtqueue *vq = vi->sq[qnum]->vq;
 	int capacity;
 
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(vi);
+	free_old_xmit_skbs(vi, vq);
 
 	/* Try to transmit */
-	capacity = xmit_skb(vi, skb);
+	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
 
 	/* This can happen with OOM and indirect buffers. */
 	if (unlikely(capacity < 0)) {
 		if (likely(capacity == -ENOMEM)) {
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "TX queue failure: out of memory\n");
+					"TXQ (%d) failure: out of memory\n",
+					qnum);
 		} else {
 			dev->stats.tx_fifo_errors++;
 			if (net_ratelimit())
 				dev_warn(&dev->dev,
-					 "Unexpected TX queue failure: %d\n",
-					 capacity);
+					"Unexpected TXQ (%d) failure: %d\n",
+					qnum, capacity);
 		}
 		dev->stats.tx_dropped++;
 		kfree_skb(skb);
 		return NETDEV_TX_OK;
 	}
-	virtqueue_kick(vi->svq);
+	virtqueue_kick(vq);
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	skb_orphan(skb);
@@ -661,13 +721,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (capacity < 2+MAX_SKB_FRAGS) {
-		netif_stop_queue(dev);
-		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
+		netif_stop_subqueue(dev, qnum);
+		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
 			/* More just got used, free them then recheck. */
-			capacity += free_old_xmit_skbs(vi);
+			capacity += free_old_xmit_skbs(vi, vq);
 			if (capacity >= 2+MAX_SKB_FRAGS) {
-				netif_start_queue(dev);
-				virtqueue_disable_cb(vi->svq);
+				netif_start_subqueue(dev, qnum);
+				virtqueue_disable_cb(vq);
 			}
 		}
 	}
@@ -700,7 +760,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 	unsigned int start;
 
 	for_each_possible_cpu(cpu) {
-		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
+		struct virtnet_stats __percpu *stats
+			= per_cpu_ptr(vi->stats, cpu);
 		u64 tpackets, tbytes, rpackets, rbytes;
 
 		do {
@@ -734,20 +795,26 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
 static void virtnet_netpoll(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
-	napi_schedule(&vi->napi);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		napi_schedule(&vi->rq[i]->napi);
 }
 #endif
 
 static int virtnet_open(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
-	/* Make sure we have some buffers: if oom use wq. */
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		/* Make sure we have some buffers: if oom use wq. */
+		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
+			queue_delayed_work(system_nrt_wq,
+					   &vi->rq[i]->refill, 0);
+		virtnet_napi_enable(vi->rq[i]);
+	}
 
-	virtnet_napi_enable(vi);
 	return 0;
 }
 
@@ -809,10 +876,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
 
 	/* Make sure refill_work doesn't re-enable napi! */
-	cancel_delayed_work_sync(&vi->refill);
-	napi_disable(&vi->napi);
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
+		napi_disable(&vi->rq[i]->napi);
+	}
 
 	return 0;
 }
@@ -924,11 +994,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 
-	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
-	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
+	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
+	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
 	ring->rx_pending = ring->rx_max_pending;
 	ring->tx_pending = ring->tx_max_pending;
-
 }
 
 
@@ -961,6 +1030,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+/* To avoid contending a lock hold by a vcpu who would exit to host, select the
+ * txq based on the processor id.
+ */
+static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
+		  smp_processor_id();
+
+	while (unlikely(txq >= dev->real_num_tx_queues))
+		txq -= dev->real_num_tx_queues;
+	return txq;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -972,6 +1054,7 @@ static const struct net_device_ops virtnet_netdev = {
 	.ndo_get_stats64     = virtnet_stats,
 	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
+	.ndo_select_queue     = virtnet_select_queue,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller = virtnet_netpoll,
 #endif
@@ -1007,10 +1090,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
 
 	if (vi->status & VIRTIO_NET_S_LINK_UP) {
 		netif_carrier_on(vi->dev);
-		netif_wake_queue(vi->dev);
+		netif_tx_wake_all_queues(vi->dev);
 	} else {
 		netif_carrier_off(vi->dev);
-		netif_stop_queue(vi->dev);
+		netif_tx_stop_all_queues(vi->dev);
 	}
 done:
 	mutex_unlock(&vi->config_lock);
@@ -1023,41 +1106,217 @@ static void virtnet_config_changed(struct virtio_device *vdev)
 	queue_work(system_nrt_wq, &vi->config_work);
 }
 
-static int init_vqs(struct virtnet_info *vi)
+static void free_receive_bufs(struct virtnet_info *vi)
+{
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		while (vi->rq[i]->pages)
+			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
+	}
+}
+
+/* Free memory allocated for send and receive queues */
+static void virtnet_free_queues(struct virtnet_info *vi)
 {
-	struct virtqueue *vqs[3];
-	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
-	const char *names[] = { "input", "output", "control" };
-	int nvqs, err;
+	int i;
 
-	/* We expect two virtqueues, receive then send,
-	 * and optionally control. */
-	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		kfree(vi->rq[i]);
+		vi->rq[i] = NULL;
+		kfree(vi->sq[i]);
+		vi->sq[i] = NULL;
+	}
+}
 
-	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
-	if (err)
-		return err;
+static void free_unused_bufs(struct virtnet_info *vi)
+{
+	void *buf;
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		struct virtqueue *vq = vi->sq[i]->vq;
+
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
+			dev_kfree_skb(buf);
+	}
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		struct virtqueue *vq = vi->rq[i]->vq;
+
+		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
+			if (vi->mergeable_rx_bufs || vi->big_packets)
+				give_pages(vi->rq[i], buf);
+			else
+				dev_kfree_skb(buf);
+			--vi->rq[i]->num;
+		}
+		BUG_ON(vi->rq[i]->num != 0);
+	}
+}
+
+static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
+{
+	int i;
+
+	if (vi->num_queue_pairs == 1)
+		return;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		int cpu = set ? i : -1;
+		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
+		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
+	}
+	return;
+}
+
+static void virtnet_del_vqs(struct virtnet_info *vi)
+{
+	struct virtio_device *vdev = vi->vdev;
+
+	virtnet_set_affinity(vi, false);
+
+	vdev->config->del_vqs(vdev);
+
+	virtnet_free_queues(vi);
+}
+
+static int virtnet_find_vqs(struct virtnet_info *vi)
+{
+	vq_callback_t **callbacks;
+	struct virtqueue **vqs;
+	int ret = -ENOMEM;
+	int i, total_vqs;
+	char **names;
 
-	vi->rvq = vqs[0];
-	vi->svq = vqs[1];
+	/*
+	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
+	 * possible control virtqueue and followed by the same
+	 * 'vi->num_queue_pairs-1' more times
+	 */
+	total_vqs = vi->num_queue_pairs * 2 +
+		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
+
+	/* Allocate space for find_vqs parameters */
+	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
+	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
+	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
+	if (!vqs || !callbacks || !names)
+		goto err;
+
+	/* Parameters for control virtqueue, if any */
+	if (vi->has_cvq) {
+		callbacks[2] = NULL;
+		names[2] = "control";
+	}
+
+	/* Allocate/initialize parameters for send/receive virtqueues */
+	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
+		int j = (i == 0 ? i : i + vi->has_cvq);
+		callbacks[j] = skb_recv_done;
+		callbacks[j + 1] = skb_xmit_done;
+		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
+		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);
+	}
 
-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
+	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
+					 (const char **)names);
+	if (ret)
+		goto err;
+
+	if (vi->has_cvq)
 		vi->cvq = vqs[2];
 
-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
-			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
+	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
+		int j = i == 0 ? i : i + vi->has_cvq;
+		vi->rq[i / 2]->vq = vqs[j];
+		vi->sq[i / 2]->vq = vqs[j + 1];
 	}
-	return 0;
+
+err:
+	if (ret && names)
+		for (i = 0; i < vi->num_queue_pairs * 2; i++)
+			kfree(names[i]);
+
+	kfree(names);
+	kfree(callbacks);
+	kfree(vqs);
+
+	return ret;
+}
+
+static int virtnet_alloc_queues(struct virtnet_info *vi)
+{
+	int ret = -ENOMEM;
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
+		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
+		if (!vi->rq[i] || !vi->sq[i])
+			goto err;
+	}
+
+	ret = 0;
+
+	/* setup initial receive and send queue parameters */
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		vi->rq[i]->vi = vi;
+		vi->rq[i]->pages = NULL;
+		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
+		netif_napi_add(vi->dev, &vi->rq[i]->napi, virtnet_poll,
+			       napi_weight);
+
+		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
+		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
+	}
+
+err:
+	if (ret)
+		virtnet_free_queues(vi);
+
+	return ret;
+}
+
+static int virtnet_setup_vqs(struct virtnet_info *vi)
+{
+	int ret;
+
+	/* Allocate send & receive queues */
+	ret = virtnet_alloc_queues(vi);
+	if (!ret) {
+		ret = virtnet_find_vqs(vi);
+		if (ret)
+			virtnet_free_queues(vi);
+		else
+			virtnet_set_affinity(vi, true);
+	}
+
+	return ret;
 }
 
 static int virtnet_probe(struct virtio_device *vdev)
 {
-	int err;
+	int i, err;
 	struct net_device *dev;
 	struct virtnet_info *vi;
+	u16 num_queues, num_queue_pairs;
+
+	/* Find if host supports multiqueue virtio_net device */
+	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
+				offsetof(struct virtio_net_config,
+				num_queues), &num_queues);
+
+	/* We need atleast 2 queue's */
+	if (err || num_queues < 2)
+		num_queues = 2;
+	if (num_queues > MAX_QUEUES * 2)
+		num_queues = MAX_QUEUES;
+
+	num_queue_pairs = num_queues / 2;
 
 	/* Allocate ourselves a network device with room for our info */
-	dev = alloc_etherdev(sizeof(struct virtnet_info));
+	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
 	if (!dev)
 		return -ENOMEM;
 
@@ -1103,22 +1362,18 @@ static int virtnet_probe(struct virtio_device *vdev)
 
 	/* Set up our device-specific information */
 	vi = netdev_priv(dev);
-	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
 	vi->dev = dev;
 	vi->vdev = vdev;
 	vdev->priv = vi;
-	vi->pages = NULL;
 	vi->stats = alloc_percpu(struct virtnet_stats);
 	err = -ENOMEM;
 	if (vi->stats == NULL)
-		goto free;
+		goto free_netdev;
 
-	INIT_DELAYED_WORK(&vi->refill, refill_work);
 	mutex_init(&vi->config_lock);
 	vi->config_enable = true;
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
-	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
-	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
+	vi->num_queue_pairs = num_queue_pairs;
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1129,9 +1384,17 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
 		vi->mergeable_rx_bufs = true;
 
-	err = init_vqs(vi);
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
+		vi->has_cvq = true;
+
+	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
+	err = virtnet_setup_vqs(vi);
 	if (err)
-		goto free_stats;
+		goto free_netdev;
+
+	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) &&
+	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
+		dev->features |= NETIF_F_HW_VLAN_FILTER;
 
 	err = register_netdev(dev);
 	if (err) {
@@ -1140,12 +1403,15 @@ static int virtnet_probe(struct virtio_device *vdev)
 	}
 
 	/* Last of all, set up some receive buffers. */
-	try_fill_recv(vi, GFP_KERNEL);
-
-	/* If we didn't even get one input buffer, we're useless. */
-	if (vi->num == 0) {
-		err = -ENOMEM;
-		goto unregister;
+	for (i = 0; i < num_queue_pairs; i++) {
+		try_fill_recv(vi->rq[i], GFP_KERNEL);
+
+		/* If we didn't even get one input buffer, we're useless. */
+		if (vi->rq[i]->num == 0) {
+			free_unused_bufs(vi);
+			err = -ENOMEM;
+			goto free_recv_bufs;
+		}
 	}
 
 	/* Assume link up if device can't report link status,
@@ -1158,42 +1424,25 @@ static int virtnet_probe(struct virtio_device *vdev)
 		netif_carrier_on(dev);
 	}
 
-	pr_debug("virtnet: registered device %s\n", dev->name);
+	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
+		 dev->name, num_queue_pairs);
+
 	return 0;
 
-unregister:
+free_recv_bufs:
+	free_receive_bufs(vi);
 	unregister_netdev(dev);
+
 free_vqs:
-	vdev->config->del_vqs(vdev);
-free_stats:
-	free_percpu(vi->stats);
-free:
+	for (i = 0; i < num_queue_pairs; i++)
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
+	virtnet_del_vqs(vi);
+
+free_netdev:
 	free_netdev(dev);
 	return err;
 }
 
-static void free_unused_bufs(struct virtnet_info *vi)
-{
-	void *buf;
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->svq);
-		if (!buf)
-			break;
-		dev_kfree_skb(buf);
-	}
-	while (1) {
-		buf = virtqueue_detach_unused_buf(vi->rvq);
-		if (!buf)
-			break;
-		if (vi->mergeable_rx_bufs || vi->big_packets)
-			give_pages(vi, buf);
-		else
-			dev_kfree_skb(buf);
-		--vi->num;
-	}
-	BUG_ON(vi->num != 0);
-}
-
 static void remove_vq_common(struct virtnet_info *vi)
 {
 	vi->vdev->config->reset(vi->vdev);
@@ -1201,10 +1450,9 @@ static void remove_vq_common(struct virtnet_info *vi)
 	/* Free unused buffers in both send and recv, if any. */
 	free_unused_bufs(vi);
 
-	vi->vdev->config->del_vqs(vi->vdev);
+	free_receive_bufs(vi);
 
-	while (vi->pages)
-		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
+	virtnet_del_vqs(vi);
 }
 
 static void __devexit virtnet_remove(struct virtio_device *vdev)
@@ -1230,6 +1478,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 static int virtnet_freeze(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
+	int i;
 
 	/* Prevent config work handler from accessing the device */
 	mutex_lock(&vi->config_lock);
@@ -1237,10 +1486,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	mutex_unlock(&vi->config_lock);
 
 	netif_device_detach(vi->dev);
-	cancel_delayed_work_sync(&vi->refill);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		cancel_delayed_work_sync(&vi->rq[i]->refill);
 
 	if (netif_running(vi->dev))
-		napi_disable(&vi->napi);
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			napi_disable(&vi->rq[i]->napi);
+
 
 	remove_vq_common(vi);
 
@@ -1252,19 +1504,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
 static int virtnet_restore(struct virtio_device *vdev)
 {
 	struct virtnet_info *vi = vdev->priv;
-	int err;
+	int err, i;
 
-	err = init_vqs(vi);
+	err = virtnet_setup_vqs(vi);
 	if (err)
 		return err;
 
 	if (netif_running(vi->dev))
-		virtnet_napi_enable(vi);
+		for (i = 0; i < vi->num_queue_pairs; i++)
+			virtnet_napi_enable(vi->rq[i]);
 
 	netif_device_attach(vi->dev);
 
-	if (!try_fill_recv(vi, GFP_KERNEL))
-		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
+	for (i = 0; i < vi->num_queue_pairs; i++)
+		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
+			queue_delayed_work(system_nrt_wq,
+					   &vi->rq[i]->refill, 0);
 
 	mutex_lock(&vi->config_lock);
 	vi->config_enable = true;
@@ -1287,7 +1542,7 @@ static unsigned int features[] = {
 	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
 	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
-	VIRTIO_NET_F_GUEST_ANNOUNCE,
+	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
 };
 
 static struct virtio_driver virtio_net_driver = {
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1bc7e30..60f09ff 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -61,6 +61,8 @@ struct virtio_net_config {
 	__u8 mac[6];
 	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
 	__u16 status;
+	/* Total number of RX/TX queues */
+	__u16 num_queues;
 } __attribute__((packed));
 
 /* This is the first element of the scatter-gather list.  If you don't
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
                   ` (3 preceding siblings ...)
  2012-07-05 10:29 ` [net-next RFC V5 4/5] virtio_net: multiqueue support Jason Wang
@ 2012-07-05 10:29 ` Jason Wang
  2012-07-05 12:51   ` Sasha Levin
                     ` (2 more replies)
  2012-07-05 17:45 ` [net-next RFC V5 0/5] Multiqueue virtio-net Rick Jones
  2012-07-08  8:19 ` Ronen Hod
  6 siblings, 3 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-05 10:29 UTC (permalink / raw)
  To: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
  Cc: akong, kvm, sri, Jason Wang

This patch let the virtio_net driver can negotiate the number of queues it
wishes to use through control virtqueue and export an ethtool interface to let
use tweak it.

As current multiqueue virtio-net implementation has optimizations on per-cpu
virtuqueues, so only two modes were support:

- single queue pair mode
- multiple queue paris mode, the number of queues matches the number of vcpus

The single queue mode were used by default currently due to regression of
multiqueue mode in some test (especially in stream test).

Since virtio core does not support paritially deleting virtqueues, so during
mode switching the whole virtqueue were deleted and the driver would re-create
the virtqueues it would used.

btw. The queue number negotiating were defered to .ndo_open(), this is because
only after feature negotitaion could we send the command to control virtqueue
(as it may also use event index).

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c   |  171 ++++++++++++++++++++++++++++++++++---------
 include/linux/virtio_net.h |    7 ++
 2 files changed, 142 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7410187..3339eeb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -88,6 +88,7 @@ struct receive_queue {
 
 struct virtnet_info {
 	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
+	u16 total_queue_pairs;
 
 	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
 	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
@@ -137,6 +138,8 @@ struct padded_vnet_hdr {
 	char padding[6];
 };
 
+static const struct ethtool_ops virtnet_ethtool_ops;
+
 static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
 {
 	int ret = virtqueue_get_queue_index(vq);
@@ -802,22 +805,6 @@ static void virtnet_netpoll(struct net_device *dev)
 }
 #endif
 
-static int virtnet_open(struct net_device *dev)
-{
-	struct virtnet_info *vi = netdev_priv(dev);
-	int i;
-
-	for (i = 0; i < vi->num_queue_pairs; i++) {
-		/* Make sure we have some buffers: if oom use wq. */
-		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
-			queue_delayed_work(system_nrt_wq,
-					   &vi->rq[i]->refill, 0);
-		virtnet_napi_enable(vi->rq[i]);
-	}
-
-	return 0;
-}
-
 /*
  * Send command via the control virtqueue and check status.  Commands
  * supported by the hypervisor, as indicated by feature bits, should
@@ -873,6 +860,43 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
 	rtnl_unlock();
 }
 
+static int virtnet_set_queues(struct virtnet_info *vi)
+{
+	struct scatterlist sg;
+	struct net_device *dev = vi->dev;
+	sg_init_one(&sg, &vi->num_queue_pairs, sizeof(vi->num_queue_pairs));
+
+	if (!vi->has_cvq)
+		return -EINVAL;
+
+	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MULTIQUEUE,
+				  VIRTIO_NET_CTRL_MULTIQUEUE_QNUM, &sg, 1, 0)){
+		dev_warn(&dev->dev, "Fail to set the number of queue pairs to"
+			 " %d\n", vi->num_queue_pairs);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int virtnet_open(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < vi->num_queue_pairs; i++) {
+		/* Make sure we have some buffers: if oom use wq. */
+		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
+			queue_delayed_work(system_nrt_wq,
+					   &vi->rq[i]->refill, 0);
+		virtnet_napi_enable(vi->rq[i]);
+	}
+
+	virtnet_set_queues(vi);
+
+	return 0;
+}
+
 static int virtnet_close(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
@@ -1013,12 +1037,6 @@ static void virtnet_get_drvinfo(struct net_device *dev,
 
 }
 
-static const struct ethtool_ops virtnet_ethtool_ops = {
-	.get_drvinfo = virtnet_get_drvinfo,
-	.get_link = ethtool_op_get_link,
-	.get_ringparam = virtnet_get_ringparam,
-};
-
 #define MIN_MTU 68
 #define MAX_MTU 65535
 
@@ -1235,7 +1253,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 
 err:
 	if (ret && names)
-		for (i = 0; i < vi->num_queue_pairs * 2; i++)
+		for (i = 0; i < total_vqs * 2; i++)
 			kfree(names[i]);
 
 	kfree(names);
@@ -1373,7 +1391,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 	mutex_init(&vi->config_lock);
 	vi->config_enable = true;
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
-	vi->num_queue_pairs = num_queue_pairs;
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
@@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
 		vi->has_cvq = true;
 
+	/* Use single tx/rx queue pair as default */
+	vi->num_queue_pairs = 1;
+	vi->total_queue_pairs = num_queue_pairs;
+
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = virtnet_setup_vqs(vi);
 	if (err)
@@ -1396,6 +1417,9 @@ static int virtnet_probe(struct virtio_device *vdev)
 	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
 		dev->features |= NETIF_F_HW_VLAN_FILTER;
 
+	netif_set_real_num_tx_queues(dev, 1);
+	netif_set_real_num_rx_queues(dev, 1);
+
 	err = register_netdev(dev);
 	if (err) {
 		pr_debug("virtio_net: registering device failed\n");
@@ -1403,7 +1427,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	}
 
 	/* Last of all, set up some receive buffers. */
-	for (i = 0; i < num_queue_pairs; i++) {
+	for (i = 0; i < vi->num_queue_pairs; i++) {
 		try_fill_recv(vi->rq[i], GFP_KERNEL);
 
 		/* If we didn't even get one input buffer, we're useless. */
@@ -1474,10 +1498,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
 	free_netdev(vi->dev);
 }
 
-#ifdef CONFIG_PM
-static int virtnet_freeze(struct virtio_device *vdev)
+static void virtnet_stop(struct virtnet_info *vi)
 {
-	struct virtnet_info *vi = vdev->priv;
 	int i;
 
 	/* Prevent config work handler from accessing the device */
@@ -1493,17 +1515,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
 		for (i = 0; i < vi->num_queue_pairs; i++)
 			napi_disable(&vi->rq[i]->napi);
 
-
-	remove_vq_common(vi);
-
-	flush_work(&vi->config_work);
-
-	return 0;
 }
 
-static int virtnet_restore(struct virtio_device *vdev)
+static int virtnet_start(struct virtnet_info *vi)
 {
-	struct virtnet_info *vi = vdev->priv;
 	int err, i;
 
 	err = virtnet_setup_vqs(vi);
@@ -1527,6 +1542,29 @@ static int virtnet_restore(struct virtio_device *vdev)
 
 	return 0;
 }
+
+#ifdef CONFIG_PM
+static int virtnet_freeze(struct virtio_device *vdev)
+{
+	struct virtnet_info *vi = vdev->priv;
+
+	virtnet_stop(vi);
+
+	remove_vq_common(vi);
+
+	flush_work(&vi->config_work);
+
+	return 0;
+}
+
+static int virtnet_restore(struct virtio_device *vdev)
+{
+	struct virtnet_info *vi = vdev->priv;
+
+	virtnet_start(vi);
+
+	return 0;
+}
 #endif
 
 static struct virtio_device_id id_table[] = {
@@ -1560,6 +1598,67 @@ static struct virtio_driver virtio_net_driver = {
 #endif
 };
 
+static int virtnet_set_channels(struct net_device *dev,
+				struct ethtool_channels *channels)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	u16 queues = channels->rx_count;
+	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
+
+	if (channels->rx_count != channels->tx_count)
+		return -EINVAL;
+	/* Only two modes were support currently */
+	if (queues != vi->total_queue_pairs && queues != 1)
+		return -EINVAL;
+	if (!vi->has_cvq)
+		return -EINVAL;
+
+	virtnet_stop(vi);
+
+	netif_set_real_num_tx_queues(dev, queues);
+	netif_set_real_num_rx_queues(dev, queues);
+
+	remove_vq_common(vi);
+	flush_work(&vi->config_work);
+
+	vi->num_queue_pairs = queues;
+	virtnet_start(vi);
+
+	vi->vdev->config->finalize_features(vi->vdev);
+
+	if (virtnet_set_queues(vi))
+		status |= VIRTIO_CONFIG_S_FAILED;
+	else
+		status |= VIRTIO_CONFIG_S_DRIVER_OK;
+
+	vi->vdev->config->set_status(vi->vdev, status);
+
+	return 0;
+}
+
+static void virtnet_get_channels(struct net_device *dev,
+				 struct ethtool_channels *channels)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+
+	channels->max_rx = vi->total_queue_pairs;
+	channels->max_tx = vi->total_queue_pairs;
+	channels->max_other = 0;
+	channels->max_combined = 0;
+	channels->rx_count = vi->num_queue_pairs;
+	channels->tx_count = vi->num_queue_pairs;
+	channels->other_count = 0;
+	channels->combined_count = 0;
+}
+
+static const struct ethtool_ops virtnet_ethtool_ops = {
+	.get_drvinfo = virtnet_get_drvinfo,
+	.get_link = ethtool_op_get_link,
+	.get_ringparam = virtnet_get_ringparam,
+	.set_channels = virtnet_set_channels,
+	.get_channels = virtnet_get_channels,
+};
+
 static int __init init(void)
 {
 	return register_virtio_driver(&virtio_net_driver);
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 60f09ff..0d21e08 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -169,4 +169,11 @@ struct virtio_net_ctrl_mac {
 #define VIRTIO_NET_CTRL_ANNOUNCE       3
  #define VIRTIO_NET_CTRL_ANNOUNCE_ACK         0
 
+/*
+ * Control multiqueue
+ *
+ */
+#define VIRTIO_NET_CTRL_MULTIQUEUE       4
+ #define VIRTIO_NET_CTRL_MULTIQUEUE_QNUM         0
+
 #endif /* _LINUX_VIRTIO_NET_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue
  2012-07-05 10:29 ` [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue Jason Wang
@ 2012-07-05 11:40   ` Sasha Levin
  2012-07-06  3:17     ` Jason Wang
  2012-07-26  8:20     ` Paolo Bonzini
  0 siblings, 2 replies; 46+ messages in thread
From: Sasha Levin @ 2012-07-05 11:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> Instead of storing the queue index in virtio infos, this patch moves them to
> vring_virtqueue and introduces helpers to set and get the value. This would
> simplify the management and tracing.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

This patch actually fails to compile:

drivers/virtio/virtio_mmio.c: In function ‘vm_notify’:
drivers/virtio/virtio_mmio.c:229:13: error: ‘struct virtio_mmio_vq_info’ has no member named ‘queue_index’
drivers/virtio/virtio_mmio.c: In function ‘vm_del_vq’:
drivers/virtio/virtio_mmio.c:278:13: error: ‘struct virtio_mmio_vq_info’ has no member named ‘queue_index’
make[2]: *** [drivers/virtio/virtio_mmio.o] Error 1

It probably misses the following hunks:

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index f5432b6..12b6180 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -222,11 +222,10 @@ static void vm_reset(struct virtio_device *vdev)
 static void vm_notify(struct virtqueue *vq)
 {
        struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
-       struct virtio_mmio_vq_info *info = vq->priv;
 
        /* We write the queue's selector into the notification register to
         * signal the other end */
-       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
+       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
 }
 
 /* Notify all virtqueues on an interrupt. */
@@ -275,7 +274,7 @@ static void vm_del_vq(struct virtqueue *vq)
        vring_del_virtqueue(vq);
 
        /* Select and deactivate the queue */
-       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
+       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
        writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
 
        size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-05 10:29 ` [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq Jason Wang
@ 2012-07-05 12:51   ` Sasha Levin
  2012-07-05 20:07     ` Amos Kong
  2012-07-06  3:20     ` Jason Wang
  2012-07-09 20:13   ` Ben Hutchings
  2012-07-20 12:33   ` Michael S. Tsirkin
  2 siblings, 2 replies; 46+ messages in thread
From: Sasha Levin @ 2012-07-05 12:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>         if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>                 vi->has_cvq = true;
>  
> +       /* Use single tx/rx queue pair as default */
> +       vi->num_queue_pairs = 1;
> +       vi->total_queue_pairs = num_queue_pairs; 

The code is using this "default" even if the amount of queue pairs it
wants was specified during initialization. This basically limits any
device to use 1 pair when starting up.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
  2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
                   ` (4 preceding siblings ...)
  2012-07-05 10:29 ` [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq Jason Wang
@ 2012-07-05 17:45 ` Rick Jones
  2012-07-06  7:42   ` Jason Wang
  2012-07-08  8:19 ` Ronen Hod
  6 siblings, 1 reply; 46+ messages in thread
From: Rick Jones @ 2012-07-05 17:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/05/2012 03:29 AM, Jason Wang wrote:

>
> Test result:
>
> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>
> - Guest to External Host TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 650.55 655.61 100% 24.88 24.86 99%
> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
> 8 64 1450.89 1270.82 87% 30.83 25.95 84%

Was the -D test-specific option used to set TCP_NODELAY?  I'm guessing 
from your description of how packet sizes were smaller with multiqueue 
and your need to hack tcp_write_xmit() it wasn't but since we don't have 
the specific netperf command lines (hint hint :) I wanted to make certain.

Instead of calling them throughput1 and throughput2, it might be more 
clear in future to identify them as singlequeue and multiqueue.

Also, how are you combining the concurrent netperf results?  Are you 
taking sums of what netperf reports, or are you gathering statistics 
outside of netperf?

> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%

A single instance TCP_RR test would help confirm/refute any non-trivial 
change in (effective) path length between the two cases.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-05 10:29 ` [net-next RFC V5 4/5] virtio_net: multiqueue support Jason Wang
@ 2012-07-05 20:02   ` Amos Kong
  2012-07-06  7:45     ` Jason Wang
  2012-07-20 13:40   ` Michael S. Tsirkin
  1 sibling, 1 reply; 46+ messages in thread
From: Amos Kong @ 2012-07-05 20:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, kvm, sri

On 07/05/2012 06:29 PM, Jason Wang wrote:
> This patch converts virtio_net to a multi queue device. After negotiated
> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
> and driver could read the number from config space.
> 
> The driver expects the number of rx/tx queue paris is equal to the number of
> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
> optimization were introduced:
> 
> - Txq selection is based on the processor id in order to avoid contending a lock
>   whose owner may exits to host.
> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>   the queue pairs.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---

...

>  
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> -	int err;
> +	int i, err;
>  	struct net_device *dev;
>  	struct virtnet_info *vi;
> +	u16 num_queues, num_queue_pairs;
> +
> +	/* Find if host supports multiqueue virtio_net device */
> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> +				offsetof(struct virtio_net_config,
> +				num_queues), &num_queues);
> +
> +	/* We need atleast 2 queue's */


s/atleast/at least/


> +	if (err || num_queues < 2)
> +		num_queues = 2;
> +	if (num_queues > MAX_QUEUES * 2)
> +		num_queues = MAX_QUEUES;

                num_queues = MAX_QUEUES * 2;

MAX_QUEUES is the limitation of RX or TX.

> +
> +	num_queue_pairs = num_queues / 2;

...

-- 
			Amos.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-05 12:51   ` Sasha Levin
@ 2012-07-05 20:07     ` Amos Kong
  2012-07-06  7:46       ` Jason Wang
  2012-07-06  3:20     ` Jason Wang
  1 sibling, 1 reply; 46+ messages in thread
From: Amos Kong @ 2012-07-05 20:07 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Jason Wang, mst, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, kvm,
	sri

On 07/05/2012 08:51 PM, Sasha Levin wrote:
> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>         if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>                 vi->has_cvq = true;
>>  


>> +       /* Use single tx/rx queue pair as default */
>> +       vi->num_queue_pairs = 1;
>> +       vi->total_queue_pairs = num_queue_pairs; 

vi->total_queue_pairs also should be set to 1

           vi->total_queue_pairs = 1;

> 
> The code is using this "default" even if the amount of queue pairs it
> wants was specified during initialization. This basically limits any
> device to use 1 pair when starting up.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
			Amos.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue
  2012-07-05 11:40   ` Sasha Levin
@ 2012-07-06  3:17     ` Jason Wang
  2012-07-26  8:20     ` Paolo Bonzini
  1 sibling, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-06  3:17 UTC (permalink / raw)
  To: Sasha Levin
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/05/2012 07:40 PM, Sasha Levin wrote:
> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>> Instead of storing the queue index in virtio infos, this patch moves them to
>> vring_virtqueue and introduces helpers to set and get the value. This would
>> simplify the management and tracing.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> This patch actually fails to compile:
>
> drivers/virtio/virtio_mmio.c: In function ‘vm_notify’:
> drivers/virtio/virtio_mmio.c:229:13: error: ‘struct virtio_mmio_vq_info’ has no member named ‘queue_index’
> drivers/virtio/virtio_mmio.c: In function ‘vm_del_vq’:
> drivers/virtio/virtio_mmio.c:278:13: error: ‘struct virtio_mmio_vq_info’ has no member named ‘queue_index’
> make[2]: *** [drivers/virtio/virtio_mmio.o] Error 1
>
> It probably misses the following hunks:
>
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index f5432b6..12b6180 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -222,11 +222,10 @@ static void vm_reset(struct virtio_device *vdev)
>   static void vm_notify(struct virtqueue *vq)
>   {
>          struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev);
> -       struct virtio_mmio_vq_info *info = vq->priv;
>
>          /* We write the queue's selector into the notification register to
>           * signal the other end */
> -       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
> +       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NOTIFY);
>   }
>
>   /* Notify all virtqueues on an interrupt. */
> @@ -275,7 +274,7 @@ static void vm_del_vq(struct virtqueue *vq)
>          vring_del_virtqueue(vq);
>
>          /* Select and deactivate the queue */
> -       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
> +       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
>          writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>
>          size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN));
>
Oops, I miss the virtio mmio part, thanks for pointing this.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-05 12:51   ` Sasha Levin
  2012-07-05 20:07     ` Amos Kong
@ 2012-07-06  3:20     ` Jason Wang
  2012-07-06  6:38       ` Stephen Hemminger
  2012-07-06  8:10       ` Sasha Levin
  1 sibling, 2 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-06  3:20 UTC (permalink / raw)
  To: Sasha Levin
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/05/2012 08:51 PM, Sasha Levin wrote:
> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>                  vi->has_cvq = true;
>>
>> +       /* Use single tx/rx queue pair as default */
>> +       vi->num_queue_pairs = 1;
>> +       vi->total_queue_pairs = num_queue_pairs;
> The code is using this "default" even if the amount of queue pairs it
> wants was specified during initialization. This basically limits any
> device to use 1 pair when starting up.
>

Yes, currently the virtio-net driver would use 1 txq/txq by default 
since multiqueue may not outperform in all kinds of workload. So it's 
better to keep it as default and let user enable multiqueue by ethtool -L.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-06  3:20     ` Jason Wang
@ 2012-07-06  6:38       ` Stephen Hemminger
  2012-07-06  9:26         ` Jason Wang
  2012-07-06  8:10       ` Sasha Levin
  1 sibling, 1 reply; 46+ messages in thread
From: Stephen Hemminger @ 2012-07-06  6:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: Sasha Levin, krkumar2, habanero, mashirle, kvm, mst, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, sri

On Fri, 06 Jul 2012 11:20:06 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 07/05/2012 08:51 PM, Sasha Levin wrote:
> > On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> >> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> >>                  vi->has_cvq = true;
> >>
> >> +       /* Use single tx/rx queue pair as default */
> >> +       vi->num_queue_pairs = 1;
> >> +       vi->total_queue_pairs = num_queue_pairs;
> > The code is using this "default" even if the amount of queue pairs it
> > wants was specified during initialization. This basically limits any
> > device to use 1 pair when starting up.
> >
> 
> Yes, currently the virtio-net driver would use 1 txq/txq by default 
> since multiqueue may not outperform in all kinds of workload. So it's 
> better to keep it as default and let user enable multiqueue by ethtool -L.
> 

I would prefer that the driver sized number of queues based on number
of online CPU's. That is what real hardware does. What kind of workload
are you doing? If it is some DBMS benchmark then maybe the issue is that
some CPU's need to be reserved.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
  2012-07-05 17:45 ` [net-next RFC V5 0/5] Multiqueue virtio-net Rick Jones
@ 2012-07-06  7:42   ` Jason Wang
  2012-07-06 16:23     ` Rick Jones
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2012-07-06  7:42 UTC (permalink / raw)
  To: Rick Jones
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/06/2012 01:45 AM, Rick Jones wrote:
> On 07/05/2012 03:29 AM, Jason Wang wrote:
>
>>
>> Test result:
>>
>> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>>
>> - Guest to External Host TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 650.55 655.61 100% 24.88 24.86 99%
>> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
>> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
>> 8 64 1450.89 1270.82 87% 30.83 25.95 84%
>
> Was the -D test-specific option used to set TCP_NODELAY?  I'm guessing 
> from your description of how packet sizes were smaller with multiqueue 
> and your need to hack tcp_write_xmit() it wasn't but since we don't 
> have the specific netperf command lines (hint hint :) I wanted to make 
> certain.
Hi Rick:

I didn't specify -D for disabling Nagle. I also collects rx packets and 
average packet size:

Guest to External Host ( 2vcpu 1q vs 2q )
sessions size tput-sq tput-mq %  norm-sq norm-mq %  #tx-pkts-sq 
#tx-pkts-mq % avg-sz-sq avg-sz-mq %
1 64 668.85 671.13 100% 25.80 26.86 104% 629038 627126 99% 1395 1403 100%
2 64 1421.29 1345.40 94% 32.06 27.57 85% 1318498 1246721 94% 1413 1414 100%
4 64 1469.96 1365.42 92% 32.44 27.04 83% 1362542 1277848 93% 1414 1401 99%
8 64 1131.00 1361.58 120% 24.81 26.76 107% 1223700 1280970 104% 1395 
1394 99%
1 256 1883.98 1649.87 87% 60.67 58.48 96% 1542775 1465836 95% 1592 1472 92%
2 256 4847.09 3539.74 73% 98.35 64.05 65% 2683346 3074046 114% 2323 1505 64%
4 256 5197.33 3283.48 63% 109.14 62.39 57% 1819814 2929486 160% 3636 
1467 40%
8 256 5953.53 3359.22 56% 122.75 64.21 52% 906071 2924148 322% 8282 1502 18%
1 512 3019.70 2646.07 87% 93.89 86.78 92% 2003780 2256077 112% 1949 1532 78%
2 512 7455.83 5861.03 78% 173.79 104.43 60% 1200322 3577142 298% 7831 
2114 26%
4 512 8962.28 7062.20 78% 213.08 127.82 59% 468142 2594812 554% 24030 
3468 14%
8 512 7849.82 8523.85 108% 175.41 154.19 87% 304923 1662023 545% 38640 
6479 16%

When multiqueue were enabled, it does have a higher packets per second 
but with a much more smaller packet size. It looks to me that multiqueue 
is faster and guest tcp have less oppotunity to build a larger skbs to 
send, so lots of small packet were required to send which leads to much 
more #exit and vhost works. One interesting thing is, if I run tcpdump 
in the host where guest run, I can get obvious throughput increasing. To 
verify the assumption, I hack the tcp_write_xmit() with following patch 
and set tcp_tso_win_divisor=1, then I multiqueue can outperform or at 
least get the same throughput as singlequeue, though it could introduce 
latency but I havent' measured it.

I'm not expert of tcp, but looks like the changes are reasonable:
- we can do full-sized TSO check in tcp_tso_should_defer() only for 
westwood, according to tcp westwood
- run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c465d3e..166a888 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1567,7 +1567,7 @@ static bool tcp_tso_should_defer(struct sock *sk, 
struct sk_buff *skb)

         in_flight = tcp_packets_in_flight(tp);

-       BUG_ON(tcp_skb_pcount(skb) <= 1 || (tp->snd_cwnd <= in_flight));
+       BUG_ON(tp->snd_cwnd <= in_flight);

         send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;

@@ -1576,9 +1576,11 @@ static bool tcp_tso_should_defer(struct sock *sk, 
struct sk_buff *skb)

         limit = min(send_win, cong_win);

+#if 0
         /* If a full-sized TSO skb can be sent, do it. */
         if (limit >= sk->sk_gso_max_size)
                 goto send_now;
+#endif

         /* Middle in queue won't get any more data, full sendable 
already? */
         if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))
@@ -1795,10 +1797,9 @@ static bool tcp_write_xmit(struct sock *sk, 
unsigned int mss_now, int nonagle,
                                                      
(tcp_skb_is_last(sk, skb) ?
                                                       nonagle : 
TCP_NAGLE_PUSH))))
                                 break;
-               } else {
-                       if (!push_one && tcp_tso_should_defer(sk, skb))
-                               break;
                 }
+               if (!push_one && tcp_tso_should_defer(sk, skb))
+                       break;

                 limit = mss_now;
                 if (tso_segs > 1 && !tcp_urg_mode(tp))




>
> Instead of calling them throughput1 and throughput2, it might be more 
> clear in future to identify them as singlequeue and multiqueue.
>

Sure.
> Also, how are you combining the concurrent netperf results?  Are you 
> taking sums of what netperf reports, or are you gathering statistics 
> outside of netperf?
>

The throughput were just sumed from netperf result like what netperf 
manual suggests. The cpu utilization were measured by mpstat.
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%
>
> A single instance TCP_RR test would help confirm/refute any 
> non-trivial change in (effective) path length between the two cases.
>

Yes, I would test this thanks.
> happy benchmarking,
>
> rick jones
> -- 
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-05 20:02   ` Amos Kong
@ 2012-07-06  7:45     ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-06  7:45 UTC (permalink / raw)
  To: Amos Kong
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, kvm, sri

On 07/06/2012 04:02 AM, Amos Kong wrote:
> On 07/05/2012 06:29 PM, Jason Wang wrote:
>> This patch converts virtio_net to a multi queue device. After negotiated
>> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
>> and driver could read the number from config space.
>>
>> The driver expects the number of rx/tx queue paris is equal to the number of
>> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
>> optimization were introduced:
>>
>> - Txq selection is based on the processor id in order to avoid contending a lock
>>    whose owner may exits to host.
>> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>>    the queue pairs.
>>
>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
> ...
>
>>
>>   static int virtnet_probe(struct virtio_device *vdev)
>>   {
>> -	int err;
>> +	int i, err;
>>   	struct net_device *dev;
>>   	struct virtnet_info *vi;
>> +	u16 num_queues, num_queue_pairs;
>> +
>> +	/* Find if host supports multiqueue virtio_net device */
>> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
>> +				offsetof(struct virtio_net_config,
>> +				num_queues),&num_queues);
>> +
>> +	/* We need atleast 2 queue's */
>
> s/atleast/at least/
>
>
>> +	if (err || num_queues<  2)
>> +		num_queues = 2;
>> +	if (num_queues>  MAX_QUEUES * 2)
>> +		num_queues = MAX_QUEUES;
>                  num_queues = MAX_QUEUES * 2;
>
> MAX_QUEUES is the limitation of RX or TX.

Right, it's a typo, thanks.
>> +
>> +	num_queue_pairs = num_queues / 2;
> ...
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-05 20:07     ` Amos Kong
@ 2012-07-06  7:46       ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-06  7:46 UTC (permalink / raw)
  To: Amos Kong
  Cc: Sasha Levin, mst, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, kvm,
	sri

On 07/06/2012 04:07 AM, Amos Kong wrote:
> On 07/05/2012 08:51 PM, Sasha Levin wrote:
>> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>                  vi->has_cvq = true;
>>>
>
>>> +       /* Use single tx/rx queue pair as default */
>>> +       vi->num_queue_pairs = 1;
>>> +       vi->total_queue_pairs = num_queue_pairs;
> vi->total_queue_pairs also should be set to 1
>
>             vi->total_queue_pairs = 1;

Hi Amos:

total_queue_pairs is the max number of queue pairs that the deivce could 
provide, so it's ok here.
>> The code is using this "default" even if the amount of queue pairs it
>> wants was specified during initialization. This basically limits any
>> device to use 1 pair when starting up.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-06  3:20     ` Jason Wang
  2012-07-06  6:38       ` Stephen Hemminger
@ 2012-07-06  8:10       ` Sasha Levin
  1 sibling, 0 replies; 46+ messages in thread
From: Sasha Levin @ 2012-07-06  8:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On Fri, 2012-07-06 at 11:20 +0800, Jason Wang wrote:
> On 07/05/2012 08:51 PM, Sasha Levin wrote:
> > On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> >> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>          if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> >>                  vi->has_cvq = true;
> >>
> >> +       /* Use single tx/rx queue pair as default */
> >> +       vi->num_queue_pairs = 1;
> >> +       vi->total_queue_pairs = num_queue_pairs;
> > The code is using this "default" even if the amount of queue pairs it
> > wants was specified during initialization. This basically limits any
> > device to use 1 pair when starting up.
> >
> 
> Yes, currently the virtio-net driver would use 1 txq/txq by default 
> since multiqueue may not outperform in all kinds of workload. So it's 
> better to keep it as default and let user enable multiqueue by ethtool -L.

I think it makes sense to set it to 1 if the amount of initial queue
pairs wasn't specified.

On the other hand, if a virtio-net driver was probed to provide
VIRTIO_NET_F_MULTIQUEUE and has set something reasonable in
virtio_net_config.num_queues, then that setting shouldn't be quietly
ignored and reset back to 1.

What I'm basically saying is that I agree that the *default* should be 1
- but if the user has explicitly asked for something else during
initialization, then the default should be overridden.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-06  6:38       ` Stephen Hemminger
@ 2012-07-06  9:26         ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-06  9:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Sasha Levin, krkumar2, habanero, mashirle, kvm, mst, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, sri

On 07/06/2012 02:38 PM, Stephen Hemminger wrote:
> On Fri, 06 Jul 2012 11:20:06 +0800
> Jason Wang<jasowang@redhat.com>  wrote:
>
>> On 07/05/2012 08:51 PM, Sasha Levin wrote:
>>> On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
>>>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>>>           if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>                   vi->has_cvq = true;
>>>>
>>>> +       /* Use single tx/rx queue pair as default */
>>>> +       vi->num_queue_pairs = 1;
>>>> +       vi->total_queue_pairs = num_queue_pairs;
>>> The code is using this "default" even if the amount of queue pairs it
>>> wants was specified during initialization. This basically limits any
>>> device to use 1 pair when starting up.
>>>
>> Yes, currently the virtio-net driver would use 1 txq/txq by default
>> since multiqueue may not outperform in all kinds of workload. So it's
>> better to keep it as default and let user enable multiqueue by ethtool -L.
>>
> I would prefer that the driver sized number of queues based on number
> of online CPU's. That is what real hardware does. What kind of workload
> are you doing? If it is some DBMS benchmark then maybe the issue is that
> some CPU's need to be reserved.

I run rr and stream test of netperf, and multiqueue shows improvement on 
rr test and regression on small packet transmission in stream test. For 
small packet transmission, multiqueue tends to send much more small 
packets which also increase the cpu utilization. I suspect multiqueue is 
faster and tcp does not merger big enough packet to send, but may need 
more think.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
  2012-07-06  7:42   ` Jason Wang
@ 2012-07-06 16:23     ` Rick Jones
  2012-07-09  3:23       ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Rick Jones @ 2012-07-06 16:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/06/2012 12:42 AM, Jason Wang wrote:
> I'm not expert of tcp, but looks like the changes are reasonable:
> - we can do full-sized TSO check in tcp_tso_should_defer() only for
> westwood, according to tcp westwood
> - run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.

I'm sure Eric and David will weigh-in on the TCP change.  My initial 
inclination would have been to say "well, if multiqueue is draining 
faster, that means ACKs come-back faster, which means the "race" between 
more data being queued by netperf and ACKs will go more to the ACKs 
which means the segments being sent will be smaller - as TCP_NODELAY is 
not set, the Nagle algorithm is in force, which means once there is data 
outstanding on the connection, no more will be sent until either the 
outstanding data is ACKed, or there is an accumulation of > MSS worth of 
data to send.

>> Also, how are you combining the concurrent netperf results?  Are you
>> taking sums of what netperf reports, or are you gathering statistics
>> outside of netperf?
>>
>
> The throughput were just sumed from netperf result like what netperf
> manual suggests. The cpu utilization were measured by mpstat.

Which mechanism to address skew error?  The netperf manual describes 
more than one:

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

Personally, my preference these days is to use the "demo mode" method of 
aggregate results as it can be rather faster than (ab)using the 
confidence intervals mechanism, which I suspect may not really scale all 
that well to large numbers of concurrent netperfs.

I also tend to use the --enable-burst configure option to allow me to 
minimize the number of concurrent netperfs in the first place.  Set 
TCP_NODELAY (the test-specific -D option) and then have several 
transactions outstanding at one time (test-specific -b option with a 
number of additional in-flight transactions).

This is expressed in the runemomniaggdemo.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniaggdemo.sh

which uses the find_max_burst.sh script:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/find_max_burst.sh

to pick the burst size to use in the concurrent netperfs, the results of 
which can be post-processed with:

http://www.netperf.org/svn/netperf2/trunk/doc/examples/post_proc.py

The nice feature of using the "demo mode" mechanism is when it is 
coupled with systems with reasonably synchronized clocks (eg NTP) it can 
be used for many-to-many testing in addition to one-to-many testing 
(which cannot be dealt with by the confidence interval method of dealing 
with skew error)

>> A single instance TCP_RR test would help confirm/refute any
>> non-trivial change in (effective) path length between the two cases.
>>
>
> Yes, I would test this thanks.

Excellent.

happy benchmarking,

rick jones


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
  2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
                   ` (5 preceding siblings ...)
  2012-07-05 17:45 ` [net-next RFC V5 0/5] Multiqueue virtio-net Rick Jones
@ 2012-07-08  8:19 ` Ronen Hod
  2012-07-09  5:35   ` Jason Wang
  6 siblings, 1 reply; 46+ messages in thread
From: Ronen Hod @ 2012-07-08  8:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/05/2012 01:29 PM, Jason Wang wrote:
> Hello All:
>
> This series is an update version of multiqueue virtio-net driver based on
> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
> packets reception and transmission. Please review and comments.
>
> Test Environment:
> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
> - Two directed connected 82599
>
> Test Summary:
>
> - Highlights: huge improvements on TCP_RR test

Hi Jason,

It might be that the good TCP_RR results are due to the large number of sessions (50-250). Can you test it also with small number of sessions?

> - Lowlights: regression on small packet transmission, higher cpu utilization
>               than single queue, need further optimization
>
> Analysis of the performance result:
>
> - I count the number of packets sending/receiving during the test, and
>    multiqueue show much more ability in terms of packets per second.
>
> - For the tx regression, multiqueue send about 1-2 times of more packets
>    compared to single queue, and the packets size were much smaller than single
>    queue does. I suspect tcp does less batching in multiqueue, so I hack the
>    tcp_write_xmit() to forece more batching, multiqueue works as well as
>    singlequeue for both small transmission and throughput

Could it be that since the CPUs are not busy they are available for immediate handling of the packets (little batching)? In such scenario the CPU utilization is not really interesting. What will happen on a busy machine?

Ronen.

>
> - I didn't pack the accelerate RFS with virtio-net in this sereis as it still
>    need further shaping, for the one that interested in this please see:
>    http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html
>
> Changes from V4:
> - Add ability to negotiate the number of queues through control virtqueue
> - Ethtool -{L|l} support and default the tx/rx queue number to 1
> - Expose the API to set irq affinity instead of irq itself
>
> Changes from V3:
>
> - Rebase to the net-next
> - Let queue 2 to be the control virtqueue to obey the spec
> - Prodives irq affinity
> - Choose txq based on processor id
>
> References:
>
> - V4: https://lkml.org/lkml/2012/6/25/120
> - V3: http://lwn.net/Articles/467283/
>
> Test result:
>
> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>
> - Guest to External Host TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 650.55 655.61 100% 24.88 24.86 99%
> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
> 8 64 1450.89 1270.82 87% 30.83 25.95 84%
> 1 256 1699.45 1779.58 104% 56.75 59.08 104%
> 2 256 4902.71 3446.59 70% 98.53 62.78 63%
> 4 256 4803.76 2980.76 62% 97.44 54.68 56%
> 8 256 5128.88 3158.74 61% 104.68 58.61 55%
> 1 512 2837.98 2838.42 100% 89.76 90.41 100%
> 2 512 6742.59 5495.83 81% 155.03 99.07 63%
> 4 512 9193.70 5900.17 64% 202.84 106.44 52%
> 8 512 9287.51 7107.79 76% 202.18 129.08 63%
> 1 1024 4166.42 4224.98 101% 128.55 129.86 101%
> 2 1024 6196.94 7823.08 126% 181.80 168.81 92%
> 4 1024 9113.62 9219.49 101% 235.15 190.93 81%
> 8 1024 9324.25 9402.66 100% 239.10 179.99 75%
> 1 2048 7441.63 6534.04 87% 248.01 215.63 86%
> 2 2048 7024.61 7414.90 105% 225.79 219.62 97%
> 4 2048 8971.49 9269.00 103% 278.94 220.84 79%
> 8 2048 9314.20 9359.96 100% 268.36 192.23 71%
> 1 4096 8282.60 8990.08 108% 277.45 320.05 115%
> 2 4096 9194.80 9293.78 101% 317.02 248.76 78%
> 4 4096 9340.73 9313.19 99% 300.34 230.35 76%
> 8 4096 9148.23 9347.95 102% 279.49 199.43 71%
> 1 16384 8787.89 8766.31 99% 312.38 316.53 101%
> 2 16384 9306.35 9156.14 98% 319.53 279.83 87%
> 4 16384 9177.81 9307.50 101% 312.69 230.07 73%
> 8 16384 9035.82 9188.00 101% 298.32 199.17 66%
> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%
> 100 1 60141.88 88598.94 147% 2157.90 2000.45 92%
> 250 1 74763.56 135584.22 181% 2541.94 2628.59 103%
> 50 64 51628.38 82867.50 160% 1872.55 1812.16 96%
> 100 64 60367.73 84080.60 139% 2215.69 1867.69 84%
> 250 64 68502.70 124910.59 182% 2321.43 2495.76 107%
> 50 128 53477.08 77625.07 145% 1905.10 1870.99 98%
> 100 128 59697.56 74902.37 125% 2230.66 1751.03 78%
> 250 128 71248.74 133963.55 188% 2453.12 2711.72 110%
> 50 256 47663.86 67742.63 142% 1880.45 1735.30 92%
> 100 256 54051.84 68738.57 127% 2123.03 1778.59 83%
> 250 256 68250.06 124487.90 182% 2321.89 2598.60 111%
> - External Host to Guest TCP STRAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 847.71 864.83 102% 57.99 57.93 99%
> 2 64 1690.82 1544.94 91% 80.13 55.09 68%
> 4 64 3434.98 3455.53 100% 127.17 89.00 69%
> 8 64 5890.19 6557.35 111% 194.70 146.52 75%
> 1 256 2094.04 2109.14 100% 130.73 127.14 97%
> 2 256 5218.13 3731.97 71% 219.15 114.02 52%
> 4 256 6734.51 9213.47 136% 227.87 208.31 91%
> 8 256 6452.86 9402.78 145% 224.83 207.77 92%
> 1 512 3945.07 4203.68 106% 279.72 273.30 97%
> 2 512 7878.96 8122.55 103% 278.25 231.71 83%
> 4 512 7645.89 9402.13 122% 252.10 217.42 86%
> 8 512 6657.06 9403.71 141% 239.81 214.89 89%
> 1 1024 5729.06 5111.21 89% 289.38 303.09 104%
> 2 1024 8097.27 8159.67 100% 269.29 242.97 90%
> 4 1024 7778.93 8919.02 114% 261.28 205.50 78%
> 8 1024 6458.02 9360.02 144% 221.26 208.09 94%
> 1 2048 6426.94 5195.59 80% 292.52 307.47 105%
> 2 2048 8221.90 9025.66 109% 283.80 242.25 85%
> 4 2048 7364.72 8527.79 115% 248.10 198.36 79%
> 8 2048 6760.63 9161.07 135% 230.53 205.12 88%
> 1 4096 7247.02 6874.21 94% 276.23 287.68 104%
> 2 4096 8346.04 8818.65 105% 281.49 254.81 90%
> 4 4096 6710.00 9354.59 139% 216.41 210.13 97%
> 8 4096 6265.69 9406.87 150% 206.69 210.92 102%
> 1 16384 8159.50 8048.79 98% 266.94 283.11 106%
> 2 16384 8525.66 8552.41 100% 294.36 239.27 81%
> 4 16384 6042.24 8447.86 139% 200.21 196.40 98%
> 8 16384 6432.63 9403.49 146% 211.48 206.13 97%
>
> 2) 1 vm 4 vcpu 1q vs 4q, 1 - 1q, 2 - 4q, no pinning
>
> - Guest to External Host TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 636.93 657.69 103% 23.55 24.42 103%
> 2 64 1457.46 1268.78 87% 30.97 26.02 84%
> 4 64 3062.86 2302.43 75% 41.00 29.64 72%
> 8 64 3107.68 2308.32 74% 41.62 29.07 69%
> 1 256 1743.50 1750.11 100% 59.00 56.63 95%
> 2 256 4582.61 2870.31 62% 92.47 51.97 56%
> 4 256 8440.96 4795.37 56% 135.10 56.39 41%
> 8 256 9240.31 6654.82 72% 144.76 74.89 51%
> 1 512 2918.25 2735.26 93% 91.08 86.47 94%
> 2 512 8978.32 5107.95 56% 200.00 94.97 47%
> 4 512 8850.39 6864.37 77% 190.32 101.09 53%
> 8 512 9270.30 8483.01 91% 193.44 118.73 61%
> 1 1024 4416.10 3679.70 83% 135.54 110.63 81%
> 2 1024 9085.20 8770.48 96% 242.23 175.59 72%
> 4 1024 9158.57 9011.56 98% 234.39 159.17 67%
> 8 1024 9345.89 9067.43 97% 233.35 138.73 59%
> 1 2048 8455.19 6077.94 71% 338.52 190.16 56%
> 2 2048 9223.32 8237.73 89% 270.00 198.27 73%
> 4 2048 9080.75 9257.63 101% 261.30 172.80 66%
> 8 2048 9177.39 8977.10 97% 256.89 147.50 57%
> 1 4096 8665.35 8394.78 96% 289.63 289.85 100%
> 2 4096 7850.73 8857.86 112% 253.33 252.62 99%
> 4 4096 9332.55 8508.37 91% 289.19 151.29 52%
> 8 4096 8482.30 9146.80 107% 255.41 156.02 61%
> 1 16384 8825.72 8778.26 99% 314.60 308.89 98%
> 2 16384 9283.85 8927.40 96% 316.48 246.98 78%
> 4 16384 7766.95 8708.06 112% 265.25 155.59 58%
> 8 16384 8945.55 8940.23 99% 298.45 151.32 50%
> - TCP_RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 60848.70 81719.39 134% 2196.86 1551.05 70%
> 100 1 61886.19 81425.02 131% 2215.76 1517.52 68%
> 250 1 72058.41 162597.84 225% 2441.84 2278.14 93%
> 50 64 51646.93 74160.10 143% 1861.07 1322.22 71%
> 100 64 57574.86 83488.26 145% 2076.54 1479.79 71%
> 250 64 67583.35 138482.15 204% 2314.46 2022.83 87%
> 50 128 59931.51 71633.03 119% 2244.60 1309.18 58%
> 100 128 58329.80 73104.90 125% 2202.98 1329.52 60%
> 250 128 71021.55 161067.73 226% 2469.11 2205.28 89%
> 50 256 47509.24 64330.24 135% 1915.75 1269.90 66%
> 100 256 49293.03 68507.94 138% 1939.75 1263.64 65%
> 250 256 63169.07 138390.68 219% 2255.47 2098.13 93%
> - External Host to Guest TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 850.18 854.96 100% 56.94 58.25 102%
> 2 64 1659.12 1730.25 104% 81.65 67.57 82%
> 4 64 3254.70 3397.17 104% 118.57 76.21 64%
> 8 64 6251.97 6389.29 102% 207.68 104.21 50%
> 1 256 2029.14 2105.18 103% 116.45 119.69 102%
> 2 256 5412.02 4260.32 78% 240.87 139.73 58%
> 4 256 7777.28 8743.12 112% 263.20 174.65 66%
> 8 256 6459.51 9388.93 145% 218.94 158.37 72%
> 1 512 4566.31 4269.30 93% 274.74 289.83 105%
> 2 512 7444.52 8240.64 110% 286.24 243.74 85%
> 4 512 7722.29 9391.16 121% 261.96 180.36 68%
> 8 512 6228.50 9134.52 146% 209.17 161.00 76%
> 1 1024 4965.50 4953.68 99% 307.64 280.48 91%
> 2 1024 8270.08 7733.71 93% 288.32 197.04 68%
> 4 1024 7551.04 9394.58 124% 268.41 206.62 76%
> 8 1024 6307.78 9179.03 145% 216.67 159.63 73%
> 1 2048 5741.12 5948.80 103% 290.34 268.66 92%
> 2 2048 7932.79 8766.05 110% 262.96 215.90 82%
> 4 2048 6907.55 9255.97 133% 233.56 203.96 87%
> 8 2048 6037.22 9399.41 155% 197.14 164.09 83%
> 1 4096 7131.70 7535.10 105% 279.43 275.12 98%
> 2 4096 8109.17 9348.04 115% 274.29 211.49 77%
> 4 4096 6878.92 9319.13 135% 244.21 192.06 78%
> 8 4096 6265.92 9408.35 150% 211.85 159.26 75%
> 1 16384 8288.01 8596.39 103% 272.85 290.22 106%
> 2 16384 8166.29 9280.12 113% 277.04 236.61 85%
> 4 16384 6446.97 9382.22 145% 222.91 187.24 83%
> 8 16384 6066.98 9405.51 155% 198.98 157.09 78%
>
> 3) 2 vms each with 2 vcpus, 1q vs 2q - pin vhost/vcpu in the same node
>
> - 2 Guests to External Hosts TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 1442.07 1475.11 102% 30.82 31.21 101%
> 2 64 3124.87 2900.93 92% 40.29 35.95 89%
> 4 64 3166.52 2864.04 90% 40.70 35.47 87%
> 8 64 3141.45 2848.94 90% 40.38 35.34 87%
> 1 256 3628.54 3711.73 102% 68.47 70.22 102%
> 2 256 7806.95 7586.69 97% 111.23 84.38 75%
> 4 256 8823.65 7612.74 86% 132.92 85.04 63%
> 8 256 9194.89 9373.41 101% 135.98 119.62 87%
> 1 512 7106.67 7128.00 100% 124.79 124.30 99%
> 2 512 9190.22 9397.33 102% 180.84 149.34 82%
> 4 512 9401.01 9376.67 99% 173.00 140.15 81%
> 8 512 8572.84 9032.90 105% 150.49 127.58 84%
> 1 1024 9361.93 9379.24 100% 205.81 202.94 98%
> 2 1024 9386.69 9389.04 100% 201.78 165.75 82%
> 4 1024 9403.43 9378.54 99% 195.33 152.06 77%
> 8 1024 9213.63 9180.64 99% 178.99 141.51 79%
> 1 2048 9338.95 9384.67 100% 223.22 227.86 102%
> 2 2048 9389.28 9389.45 100% 202.37 170.08 84%
> 4 2048 9405.86 9388.71 99% 193.76 161.54 83%
> 8 2048 9352.40 9384.06 100% 189.16 157.06 83%
> 1 4096 9380.74 9384.90 100% 239.37 241.56 100%
> 2 4096 9393.47 9376.74 99% 213.84 195.61 91%
> 4 4096 9393.85 9381.50 99% 198.06 170.18 85%
> 8 4096 9400.41 9232.31 98% 192.87 163.56 84%
> 1 16384 9348.18 9335.55 99% 253.02 254.86 100%
> 2 16384 9384.97 9359.53 99% 218.56 208.59 95%
> 4 16384 9326.60 9382.15 100% 206.24 179.72 87%
> 8 16384 9355.82 9392.85 100% 198.22 172.89 87%
> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 200340.33 261750.19 130% 2935.27 3018.59 102%
> 100 1 236141.58 266304.49 112% 3452.16 3071.74 88%
> 250 1 361574.59 320825.08 88% 4972.98 3705.70 74%
> 50 64 225748.53 242671.12 107% 3011.48 2869.07 95%
> 100 64 249885.37 260453.72 104% 3240.21 3063.67 94%
> 250 64 360341.12 310775.60 86% 4682.42 3657.91 78%
> 50 128 227995.27 289320.38 126% 2950.92 3479.37 117%
> 100 128 239491.11 291135.77 121% 3099.55 3508.75 113%
> 250 128 390390.68 362484.35 92% 5042.30 4368.52 86%
> 50 256 222604.51 317140.97 142% 3058.08 3839.39 125%
> 100 256 254770.92 335606.03 131% 3326.16 4046.65 121%
> 250 256 400584.52 436749.22 109% 5220.79 5278.86 101%
> - External Host to 2 Guests
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 1667.99 1684.50 100% 59.66 60.77 101%
> 2 64 3338.83 3379.97 101% 83.61 64.82 77%
> 4 64 6613.65 6619.11 100% 131.00 97.19 74%
> 8 64 6553.07 6418.31 97% 141.35 98.27 69%
> 1 256 3938.40 4068.52 103% 125.21 123.76 98%
> 2 256 9215.57 9210.88 99% 185.31 154.27 83%
> 4 256 9407.29 9008.13 95% 186.72 150.01 80%
> 8 256 9377.17 9385.57 100% 190.28 137.59 72%
> 1 512 7360.19 6984.80 94% 214.09 211.66 98%
> 2 512 9392.91 9401.88 100% 193.92 173.11 89%
> 4 512 9382.64 9394.34 100% 189.27 145.80 77%
> 8 512 9308.60 9094.08 97% 189.70 141.26 74%
> 1 1024 9153.26 9066.06 99% 223.07 219.95 98%
> 2 1024 9393.38 9398.43 100% 194.02 173.82 89%
> 4 1024 9395.92 8960.73 95% 192.61 145.82 75%
> 8 1024 9388.92 9399.08 100% 191.18 143.87 75%
> 1 2048 9355.32 9240.63 98% 221.50 223.03 100%
> 2 2048 9395.68 9399.62 100% 193.31 177.21 91%
> 4 2048 9397.67 9399.56 100% 195.25 157.53 80%
> 8 2048 9397.89 9401.70 100% 197.57 146.96 74%
> 1 4096 9375.84 9381.72 100% 223.06 225.06 100%
> 2 4096 9389.47 9396.00 100% 193.91 197.13 101%
> 4 4096 9397.45 9400.11 100% 192.33 163.60 85%
> 8 4096 9105.40 9415.76 103% 192.71 140.41 72%
> 1 16384 9381.53 9381.40 99% 223.53 225.66 100%
> 2 16384 9387.90 9395.44 100% 193.34 177.03 91%
> 4 16384 9397.92 9410.98 100% 195.04 151.14 77%
> 8 16384 9259.00 9419.48 101% 194.91 153.48 78%
>
> 4) Local vm to vm 2 vcpu 1q vs 2q - pin vcpu/thread in the same numa node
>
> - VM to VM TCP STREAM
> sessions size throughput1 throughput2   norm1 norm2
> 1 64 576.05 576.14 100% 12.25 12.32 100%
> 2 64 1266.75 1160.04 91% 19.10 16.05 84%
> 4 64 1267.34 1123.70 88% 19.08 15.51 81%
> 8 64 1230.88 1174.70 95% 18.53 15.58 84%
> 1 256 1311.00 1303.02 99% 25.34 25.35 100%
> 2 256 5400.26 2794.00 51% 75.92 36.43 47%
> 4 256 5200.67 2818.88 54% 72.81 33.92 46%
> 8 256 5234.55 2893.74 55% 73.10 34.97 47%
> 1 512 3244.09 3263.72 100% 56.48 56.65 100%
> 2 512 8172.16 4661.15 57% 119.05 67.89 57%
> 4 512 10567.44 7063.25 66% 147.76 77.27 52%
> 8 512 10477.87 8471.33 80% 145.94 102.91 70%
> 1 1024 5432.54 5333.99 98% 93.69 92.38 98%
> 2 1024 12590.24 9259.97 73% 185.37 135.28 72%
> 4 1024 15600.53 10731.93 68% 222.20 123.60 55%
> 8 1024 16222.87 10704.85 65% 227.05 113.81 50%
> 1 2048 6667.61 7484.37 112% 116.75 129.72 111%
> 2 2048 8180.43 11500.88 140% 137.84 156.64 113%
> 4 2048 15127.93 14416.16 95% 227.60 154.59 67%
> 8 2048 16381.79 14794.10 90% 244.29 158.45 64%
> 1 4096 7375.63 8948.90 121% 131.97 156.57 118%
> 2 4096 9321.16 14443.21 154% 161.24 163.74 101%
> 4 4096 13028.45 15984.94 122% 212.78 171.26 80%
> 8 4096 15611.28 18810.54 120% 245.15 198.65 81%
> 1 16384 15304.38 14202.08 92% 259.94 244.04 93%
> 2 16384 15508.97 15913.09 102% 261.30 244.26 93%
> 4 16384 14859.98 20164.34 135% 248.29 214.26 86%
> 8 16384 15594.59 19960.99 127% 253.79 211.27 83%
> - TCP RR
> sessions size throughput1 throughput2   norm1 norm2
> 50 1 54972.51 69820.99 127% 1133.58 1063.58 93%
> 100 1 55847.16 72407.93 129% 1155.73 1024.35 88%
> 250 1 60066.23 108266.50 180% 1114.30 1323.55 118%
> 50 64 48727.63 62378.32 128% 1014.29 888.78 87%
> 100 64 51804.65 69250.51 133% 1077.78 986.97 91%
> 250 64 61278.68 100015.78 163% 1076.93 1243.18 115%
> 50 256 51593.29 62046.22 120% 1069.14 871.08 81%
> 100 256 51647.00 68197.43 132% 1071.66 958.51 89%
> 250 256 60433.88 99072.59 163% 1072.41 1199.10 111%
> 50 512 52177.79 66483.77 127% 1082.65 960.82 88%
> 100 512 50351.67 62537.63 124% 1041.61 876.41 84%
> 250 512 60510.14 103856.79 171% 1055.21 1245.17 118%
>
>
> Jason Wang (4):
>    virtio_ring: move queue_index to vring_virtqueue
>    virtio: intorduce an API to set affinity for a virtqueue
>    virtio_net: multiqueue support
>    virtio_net: support negotiating the number of queues through ctrl vq
>
> Krishna Kumar (1):
>    virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE
>
>   drivers/net/virtio_net.c      |  792 +++++++++++++++++++++++++++++------------
>   drivers/virtio/virtio_mmio.c  |    5 +-
>   drivers/virtio/virtio_pci.c   |   58 +++-
>   drivers/virtio/virtio_ring.c  |   17 +
>   include/linux/virtio.h        |    4 +
>   include/linux/virtio_config.h |   21 ++
>   include/linux/virtio_net.h    |   10 +
>   7 files changed, 677 insertions(+), 230 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
  2012-07-06 16:23     ` Rick Jones
@ 2012-07-09  3:23       ` Jason Wang
  2012-07-09 16:46         ` Rick Jones
  0 siblings, 1 reply; 46+ messages in thread
From: Jason Wang @ 2012-07-09  3:23 UTC (permalink / raw)
  To: Rick Jones
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/07/2012 12:23 AM, Rick Jones wrote:
> On 07/06/2012 12:42 AM, Jason Wang wrote:
>> I'm not expert of tcp, but looks like the changes are reasonable:
>> - we can do full-sized TSO check in tcp_tso_should_defer() only for
>> westwood, according to tcp westwood
>> - run tcp_tso_should_defer for tso_segs = 1 when tso is enabled.
>
> I'm sure Eric and David will weigh-in on the TCP change.  My initial 
> inclination would have been to say "well, if multiqueue is draining 
> faster, that means ACKs come-back faster, which means the "race" 
> between more data being queued by netperf and ACKs will go more to the 
> ACKs which means the segments being sent will be smaller - as 
> TCP_NODELAY is not set, the Nagle algorithm is in force, which means 
> once there is data outstanding on the connection, no more will be sent 
> until either the outstanding data is ACKed, or there is an 
> accumulation of > MSS worth of data to send.
>
>>> Also, how are you combining the concurrent netperf results?  Are you
>>> taking sums of what netperf reports, or are you gathering statistics
>>> outside of netperf?
>>>
>>
>> The throughput were just sumed from netperf result like what netperf
>> manual suggests. The cpu utilization were measured by mpstat.
>
> Which mechanism to address skew error?  The netperf manual describes 
> more than one:

This mechanism is missed in my test, I would add them to my test scripts.
>
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance 
>
>
> Personally, my preference these days is to use the "demo mode" method 
> of aggregate results as it can be rather faster than (ab)using the 
> confidence intervals mechanism, which I suspect may not really scale 
> all that well to large numbers of concurrent netperfs.

During my test, the confidence interval would even hard to achieved in 
RR test when I pin vhost/vcpus in the processors, so I didn't use it.
>
> I also tend to use the --enable-burst configure option to allow me to 
> minimize the number of concurrent netperfs in the first place.  Set 
> TCP_NODELAY (the test-specific -D option) and then have several 
> transactions outstanding at one time (test-specific -b option with a 
> number of additional in-flight transactions).
>
> This is expressed in the runemomniaggdemo.sh script:
>
> http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniaggdemo.sh 
>
>
> which uses the find_max_burst.sh script:
>
> http://www.netperf.org/svn/netperf2/trunk/doc/examples/find_max_burst.sh
>
> to pick the burst size to use in the concurrent netperfs, the results 
> of which can be post-processed with:
>
> http://www.netperf.org/svn/netperf2/trunk/doc/examples/post_proc.py
>
> The nice feature of using the "demo mode" mechanism is when it is 
> coupled with systems with reasonably synchronized clocks (eg NTP) it 
> can be used for many-to-many testing in addition to one-to-many 
> testing (which cannot be dealt with by the confidence interval method 
> of dealing with skew error)
>

Yes, looks "demo mode" is helpful. I would have a look at these scripts, 
Thanks.
>>> A single instance TCP_RR test would help confirm/refute any
>>> non-trivial change in (effective) path length between the two cases.
>>>
>>
>> Yes, I would test this thanks.
>
> Excellent.
>
> happy benchmarking,
>
> rick jones
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
  2012-07-08  8:19 ` Ronen Hod
@ 2012-07-09  5:35   ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-09  5:35 UTC (permalink / raw)
  To: Ronen Hod
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/08/2012 04:19 PM, Ronen Hod wrote:
> On 07/05/2012 01:29 PM, Jason Wang wrote:
>> Hello All:
>>
>> This series is an update version of multiqueue virtio-net driver 
>> based on
>> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to 
>> do the
>> packets reception and transmission. Please review and comments.
>>
>> Test Environment:
>> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
>> - Two directed connected 82599
>>
>> Test Summary:
>>
>> - Highlights: huge improvements on TCP_RR test
>
> Hi Jason,
>
> It might be that the good TCP_RR results are due to the large number 
> of sessions (50-250). Can you test it also with small number of sessions?

Sure, I would test them.
>
>> - Lowlights: regression on small packet transmission, higher cpu 
>> utilization
>>               than single queue, need further optimization
>>
>> Analysis of the performance result:
>>
>> - I count the number of packets sending/receiving during the test, and
>>    multiqueue show much more ability in terms of packets per second.
>>
>> - For the tx regression, multiqueue send about 1-2 times of more packets
>>    compared to single queue, and the packets size were much smaller 
>> than single
>>    queue does. I suspect tcp does less batching in multiqueue, so I 
>> hack the
>>    tcp_write_xmit() to forece more batching, multiqueue works as well as
>>    singlequeue for both small transmission and throughput
>
> Could it be that since the CPUs are not busy they are available for 
> immediate handling of the packets (little batching)? In such scenario 
> the CPU utilization is not really interesting. What will happen on a 
> busy machine?
>

The regression happnes when test guest transmission in stream test, the 
cpu utilization is 100% in this situation.
> Ronen.
>
>>
>> - I didn't pack the accelerate RFS with virtio-net in this sereis as 
>> it still
>>    need further shaping, for the one that interested in this please see:
>>    http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html
>>
>> Changes from V4:
>> - Add ability to negotiate the number of queues through control 
>> virtqueue
>> - Ethtool -{L|l} support and default the tx/rx queue number to 1
>> - Expose the API to set irq affinity instead of irq itself
>>
>> Changes from V3:
>>
>> - Rebase to the net-next
>> - Let queue 2 to be the control virtqueue to obey the spec
>> - Prodives irq affinity
>> - Choose txq based on processor id
>>
>> References:
>>
>> - V4: https://lkml.org/lkml/2012/6/25/120
>> - V3: http://lwn.net/Articles/467283/
>>
>> Test result:
>>
>> 1) 1 vm 2 vcpu 1q vs 2q, 1 - 1q, 2 - 2q, no pinning
>>
>> - Guest to External Host TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 650.55 655.61 100% 24.88 24.86 99%
>> 2 64 1446.81 1309.44 90% 30.49 27.16 89%
>> 4 64 1430.52 1305.59 91% 30.78 26.80 87%
>> 8 64 1450.89 1270.82 87% 30.83 25.95 84%
>> 1 256 1699.45 1779.58 104% 56.75 59.08 104%
>> 2 256 4902.71 3446.59 70% 98.53 62.78 63%
>> 4 256 4803.76 2980.76 62% 97.44 54.68 56%
>> 8 256 5128.88 3158.74 61% 104.68 58.61 55%
>> 1 512 2837.98 2838.42 100% 89.76 90.41 100%
>> 2 512 6742.59 5495.83 81% 155.03 99.07 63%
>> 4 512 9193.70 5900.17 64% 202.84 106.44 52%
>> 8 512 9287.51 7107.79 76% 202.18 129.08 63%
>> 1 1024 4166.42 4224.98 101% 128.55 129.86 101%
>> 2 1024 6196.94 7823.08 126% 181.80 168.81 92%
>> 4 1024 9113.62 9219.49 101% 235.15 190.93 81%
>> 8 1024 9324.25 9402.66 100% 239.10 179.99 75%
>> 1 2048 7441.63 6534.04 87% 248.01 215.63 86%
>> 2 2048 7024.61 7414.90 105% 225.79 219.62 97%
>> 4 2048 8971.49 9269.00 103% 278.94 220.84 79%
>> 8 2048 9314.20 9359.96 100% 268.36 192.23 71%
>> 1 4096 8282.60 8990.08 108% 277.45 320.05 115%
>> 2 4096 9194.80 9293.78 101% 317.02 248.76 78%
>> 4 4096 9340.73 9313.19 99% 300.34 230.35 76%
>> 8 4096 9148.23 9347.95 102% 279.49 199.43 71%
>> 1 16384 8787.89 8766.31 99% 312.38 316.53 101%
>> 2 16384 9306.35 9156.14 98% 319.53 279.83 87%
>> 4 16384 9177.81 9307.50 101% 312.69 230.07 73%
>> 8 16384 9035.82 9188.00 101% 298.32 199.17 66%
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 54695.41 84164.98 153% 1957.33 1901.31 97%
>> 100 1 60141.88 88598.94 147% 2157.90 2000.45 92%
>> 250 1 74763.56 135584.22 181% 2541.94 2628.59 103%
>> 50 64 51628.38 82867.50 160% 1872.55 1812.16 96%
>> 100 64 60367.73 84080.60 139% 2215.69 1867.69 84%
>> 250 64 68502.70 124910.59 182% 2321.43 2495.76 107%
>> 50 128 53477.08 77625.07 145% 1905.10 1870.99 98%
>> 100 128 59697.56 74902.37 125% 2230.66 1751.03 78%
>> 250 128 71248.74 133963.55 188% 2453.12 2711.72 110%
>> 50 256 47663.86 67742.63 142% 1880.45 1735.30 92%
>> 100 256 54051.84 68738.57 127% 2123.03 1778.59 83%
>> 250 256 68250.06 124487.90 182% 2321.89 2598.60 111%
>> - External Host to Guest TCP STRAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 847.71 864.83 102% 57.99 57.93 99%
>> 2 64 1690.82 1544.94 91% 80.13 55.09 68%
>> 4 64 3434.98 3455.53 100% 127.17 89.00 69%
>> 8 64 5890.19 6557.35 111% 194.70 146.52 75%
>> 1 256 2094.04 2109.14 100% 130.73 127.14 97%
>> 2 256 5218.13 3731.97 71% 219.15 114.02 52%
>> 4 256 6734.51 9213.47 136% 227.87 208.31 91%
>> 8 256 6452.86 9402.78 145% 224.83 207.77 92%
>> 1 512 3945.07 4203.68 106% 279.72 273.30 97%
>> 2 512 7878.96 8122.55 103% 278.25 231.71 83%
>> 4 512 7645.89 9402.13 122% 252.10 217.42 86%
>> 8 512 6657.06 9403.71 141% 239.81 214.89 89%
>> 1 1024 5729.06 5111.21 89% 289.38 303.09 104%
>> 2 1024 8097.27 8159.67 100% 269.29 242.97 90%
>> 4 1024 7778.93 8919.02 114% 261.28 205.50 78%
>> 8 1024 6458.02 9360.02 144% 221.26 208.09 94%
>> 1 2048 6426.94 5195.59 80% 292.52 307.47 105%
>> 2 2048 8221.90 9025.66 109% 283.80 242.25 85%
>> 4 2048 7364.72 8527.79 115% 248.10 198.36 79%
>> 8 2048 6760.63 9161.07 135% 230.53 205.12 88%
>> 1 4096 7247.02 6874.21 94% 276.23 287.68 104%
>> 2 4096 8346.04 8818.65 105% 281.49 254.81 90%
>> 4 4096 6710.00 9354.59 139% 216.41 210.13 97%
>> 8 4096 6265.69 9406.87 150% 206.69 210.92 102%
>> 1 16384 8159.50 8048.79 98% 266.94 283.11 106%
>> 2 16384 8525.66 8552.41 100% 294.36 239.27 81%
>> 4 16384 6042.24 8447.86 139% 200.21 196.40 98%
>> 8 16384 6432.63 9403.49 146% 211.48 206.13 97%
>>
>> 2) 1 vm 4 vcpu 1q vs 4q, 1 - 1q, 2 - 4q, no pinning
>>
>> - Guest to External Host TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 636.93 657.69 103% 23.55 24.42 103%
>> 2 64 1457.46 1268.78 87% 30.97 26.02 84%
>> 4 64 3062.86 2302.43 75% 41.00 29.64 72%
>> 8 64 3107.68 2308.32 74% 41.62 29.07 69%
>> 1 256 1743.50 1750.11 100% 59.00 56.63 95%
>> 2 256 4582.61 2870.31 62% 92.47 51.97 56%
>> 4 256 8440.96 4795.37 56% 135.10 56.39 41%
>> 8 256 9240.31 6654.82 72% 144.76 74.89 51%
>> 1 512 2918.25 2735.26 93% 91.08 86.47 94%
>> 2 512 8978.32 5107.95 56% 200.00 94.97 47%
>> 4 512 8850.39 6864.37 77% 190.32 101.09 53%
>> 8 512 9270.30 8483.01 91% 193.44 118.73 61%
>> 1 1024 4416.10 3679.70 83% 135.54 110.63 81%
>> 2 1024 9085.20 8770.48 96% 242.23 175.59 72%
>> 4 1024 9158.57 9011.56 98% 234.39 159.17 67%
>> 8 1024 9345.89 9067.43 97% 233.35 138.73 59%
>> 1 2048 8455.19 6077.94 71% 338.52 190.16 56%
>> 2 2048 9223.32 8237.73 89% 270.00 198.27 73%
>> 4 2048 9080.75 9257.63 101% 261.30 172.80 66%
>> 8 2048 9177.39 8977.10 97% 256.89 147.50 57%
>> 1 4096 8665.35 8394.78 96% 289.63 289.85 100%
>> 2 4096 7850.73 8857.86 112% 253.33 252.62 99%
>> 4 4096 9332.55 8508.37 91% 289.19 151.29 52%
>> 8 4096 8482.30 9146.80 107% 255.41 156.02 61%
>> 1 16384 8825.72 8778.26 99% 314.60 308.89 98%
>> 2 16384 9283.85 8927.40 96% 316.48 246.98 78%
>> 4 16384 7766.95 8708.06 112% 265.25 155.59 58%
>> 8 16384 8945.55 8940.23 99% 298.45 151.32 50%
>> - TCP_RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 60848.70 81719.39 134% 2196.86 1551.05 70%
>> 100 1 61886.19 81425.02 131% 2215.76 1517.52 68%
>> 250 1 72058.41 162597.84 225% 2441.84 2278.14 93%
>> 50 64 51646.93 74160.10 143% 1861.07 1322.22 71%
>> 100 64 57574.86 83488.26 145% 2076.54 1479.79 71%
>> 250 64 67583.35 138482.15 204% 2314.46 2022.83 87%
>> 50 128 59931.51 71633.03 119% 2244.60 1309.18 58%
>> 100 128 58329.80 73104.90 125% 2202.98 1329.52 60%
>> 250 128 71021.55 161067.73 226% 2469.11 2205.28 89%
>> 50 256 47509.24 64330.24 135% 1915.75 1269.90 66%
>> 100 256 49293.03 68507.94 138% 1939.75 1263.64 65%
>> 250 256 63169.07 138390.68 219% 2255.47 2098.13 93%
>> - External Host to Guest TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 850.18 854.96 100% 56.94 58.25 102%
>> 2 64 1659.12 1730.25 104% 81.65 67.57 82%
>> 4 64 3254.70 3397.17 104% 118.57 76.21 64%
>> 8 64 6251.97 6389.29 102% 207.68 104.21 50%
>> 1 256 2029.14 2105.18 103% 116.45 119.69 102%
>> 2 256 5412.02 4260.32 78% 240.87 139.73 58%
>> 4 256 7777.28 8743.12 112% 263.20 174.65 66%
>> 8 256 6459.51 9388.93 145% 218.94 158.37 72%
>> 1 512 4566.31 4269.30 93% 274.74 289.83 105%
>> 2 512 7444.52 8240.64 110% 286.24 243.74 85%
>> 4 512 7722.29 9391.16 121% 261.96 180.36 68%
>> 8 512 6228.50 9134.52 146% 209.17 161.00 76%
>> 1 1024 4965.50 4953.68 99% 307.64 280.48 91%
>> 2 1024 8270.08 7733.71 93% 288.32 197.04 68%
>> 4 1024 7551.04 9394.58 124% 268.41 206.62 76%
>> 8 1024 6307.78 9179.03 145% 216.67 159.63 73%
>> 1 2048 5741.12 5948.80 103% 290.34 268.66 92%
>> 2 2048 7932.79 8766.05 110% 262.96 215.90 82%
>> 4 2048 6907.55 9255.97 133% 233.56 203.96 87%
>> 8 2048 6037.22 9399.41 155% 197.14 164.09 83%
>> 1 4096 7131.70 7535.10 105% 279.43 275.12 98%
>> 2 4096 8109.17 9348.04 115% 274.29 211.49 77%
>> 4 4096 6878.92 9319.13 135% 244.21 192.06 78%
>> 8 4096 6265.92 9408.35 150% 211.85 159.26 75%
>> 1 16384 8288.01 8596.39 103% 272.85 290.22 106%
>> 2 16384 8166.29 9280.12 113% 277.04 236.61 85%
>> 4 16384 6446.97 9382.22 145% 222.91 187.24 83%
>> 8 16384 6066.98 9405.51 155% 198.98 157.09 78%
>>
>> 3) 2 vms each with 2 vcpus, 1q vs 2q - pin vhost/vcpu in the same node
>>
>> - 2 Guests to External Hosts TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 1442.07 1475.11 102% 30.82 31.21 101%
>> 2 64 3124.87 2900.93 92% 40.29 35.95 89%
>> 4 64 3166.52 2864.04 90% 40.70 35.47 87%
>> 8 64 3141.45 2848.94 90% 40.38 35.34 87%
>> 1 256 3628.54 3711.73 102% 68.47 70.22 102%
>> 2 256 7806.95 7586.69 97% 111.23 84.38 75%
>> 4 256 8823.65 7612.74 86% 132.92 85.04 63%
>> 8 256 9194.89 9373.41 101% 135.98 119.62 87%
>> 1 512 7106.67 7128.00 100% 124.79 124.30 99%
>> 2 512 9190.22 9397.33 102% 180.84 149.34 82%
>> 4 512 9401.01 9376.67 99% 173.00 140.15 81%
>> 8 512 8572.84 9032.90 105% 150.49 127.58 84%
>> 1 1024 9361.93 9379.24 100% 205.81 202.94 98%
>> 2 1024 9386.69 9389.04 100% 201.78 165.75 82%
>> 4 1024 9403.43 9378.54 99% 195.33 152.06 77%
>> 8 1024 9213.63 9180.64 99% 178.99 141.51 79%
>> 1 2048 9338.95 9384.67 100% 223.22 227.86 102%
>> 2 2048 9389.28 9389.45 100% 202.37 170.08 84%
>> 4 2048 9405.86 9388.71 99% 193.76 161.54 83%
>> 8 2048 9352.40 9384.06 100% 189.16 157.06 83%
>> 1 4096 9380.74 9384.90 100% 239.37 241.56 100%
>> 2 4096 9393.47 9376.74 99% 213.84 195.61 91%
>> 4 4096 9393.85 9381.50 99% 198.06 170.18 85%
>> 8 4096 9400.41 9232.31 98% 192.87 163.56 84%
>> 1 16384 9348.18 9335.55 99% 253.02 254.86 100%
>> 2 16384 9384.97 9359.53 99% 218.56 208.59 95%
>> 4 16384 9326.60 9382.15 100% 206.24 179.72 87%
>> 8 16384 9355.82 9392.85 100% 198.22 172.89 87%
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 200340.33 261750.19 130% 2935.27 3018.59 102%
>> 100 1 236141.58 266304.49 112% 3452.16 3071.74 88%
>> 250 1 361574.59 320825.08 88% 4972.98 3705.70 74%
>> 50 64 225748.53 242671.12 107% 3011.48 2869.07 95%
>> 100 64 249885.37 260453.72 104% 3240.21 3063.67 94%
>> 250 64 360341.12 310775.60 86% 4682.42 3657.91 78%
>> 50 128 227995.27 289320.38 126% 2950.92 3479.37 117%
>> 100 128 239491.11 291135.77 121% 3099.55 3508.75 113%
>> 250 128 390390.68 362484.35 92% 5042.30 4368.52 86%
>> 50 256 222604.51 317140.97 142% 3058.08 3839.39 125%
>> 100 256 254770.92 335606.03 131% 3326.16 4046.65 121%
>> 250 256 400584.52 436749.22 109% 5220.79 5278.86 101%
>> - External Host to 2 Guests
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 1667.99 1684.50 100% 59.66 60.77 101%
>> 2 64 3338.83 3379.97 101% 83.61 64.82 77%
>> 4 64 6613.65 6619.11 100% 131.00 97.19 74%
>> 8 64 6553.07 6418.31 97% 141.35 98.27 69%
>> 1 256 3938.40 4068.52 103% 125.21 123.76 98%
>> 2 256 9215.57 9210.88 99% 185.31 154.27 83%
>> 4 256 9407.29 9008.13 95% 186.72 150.01 80%
>> 8 256 9377.17 9385.57 100% 190.28 137.59 72%
>> 1 512 7360.19 6984.80 94% 214.09 211.66 98%
>> 2 512 9392.91 9401.88 100% 193.92 173.11 89%
>> 4 512 9382.64 9394.34 100% 189.27 145.80 77%
>> 8 512 9308.60 9094.08 97% 189.70 141.26 74%
>> 1 1024 9153.26 9066.06 99% 223.07 219.95 98%
>> 2 1024 9393.38 9398.43 100% 194.02 173.82 89%
>> 4 1024 9395.92 8960.73 95% 192.61 145.82 75%
>> 8 1024 9388.92 9399.08 100% 191.18 143.87 75%
>> 1 2048 9355.32 9240.63 98% 221.50 223.03 100%
>> 2 2048 9395.68 9399.62 100% 193.31 177.21 91%
>> 4 2048 9397.67 9399.56 100% 195.25 157.53 80%
>> 8 2048 9397.89 9401.70 100% 197.57 146.96 74%
>> 1 4096 9375.84 9381.72 100% 223.06 225.06 100%
>> 2 4096 9389.47 9396.00 100% 193.91 197.13 101%
>> 4 4096 9397.45 9400.11 100% 192.33 163.60 85%
>> 8 4096 9105.40 9415.76 103% 192.71 140.41 72%
>> 1 16384 9381.53 9381.40 99% 223.53 225.66 100%
>> 2 16384 9387.90 9395.44 100% 193.34 177.03 91%
>> 4 16384 9397.92 9410.98 100% 195.04 151.14 77%
>> 8 16384 9259.00 9419.48 101% 194.91 153.48 78%
>>
>> 4) Local vm to vm 2 vcpu 1q vs 2q - pin vcpu/thread in the same numa 
>> node
>>
>> - VM to VM TCP STREAM
>> sessions size throughput1 throughput2   norm1 norm2
>> 1 64 576.05 576.14 100% 12.25 12.32 100%
>> 2 64 1266.75 1160.04 91% 19.10 16.05 84%
>> 4 64 1267.34 1123.70 88% 19.08 15.51 81%
>> 8 64 1230.88 1174.70 95% 18.53 15.58 84%
>> 1 256 1311.00 1303.02 99% 25.34 25.35 100%
>> 2 256 5400.26 2794.00 51% 75.92 36.43 47%
>> 4 256 5200.67 2818.88 54% 72.81 33.92 46%
>> 8 256 5234.55 2893.74 55% 73.10 34.97 47%
>> 1 512 3244.09 3263.72 100% 56.48 56.65 100%
>> 2 512 8172.16 4661.15 57% 119.05 67.89 57%
>> 4 512 10567.44 7063.25 66% 147.76 77.27 52%
>> 8 512 10477.87 8471.33 80% 145.94 102.91 70%
>> 1 1024 5432.54 5333.99 98% 93.69 92.38 98%
>> 2 1024 12590.24 9259.97 73% 185.37 135.28 72%
>> 4 1024 15600.53 10731.93 68% 222.20 123.60 55%
>> 8 1024 16222.87 10704.85 65% 227.05 113.81 50%
>> 1 2048 6667.61 7484.37 112% 116.75 129.72 111%
>> 2 2048 8180.43 11500.88 140% 137.84 156.64 113%
>> 4 2048 15127.93 14416.16 95% 227.60 154.59 67%
>> 8 2048 16381.79 14794.10 90% 244.29 158.45 64%
>> 1 4096 7375.63 8948.90 121% 131.97 156.57 118%
>> 2 4096 9321.16 14443.21 154% 161.24 163.74 101%
>> 4 4096 13028.45 15984.94 122% 212.78 171.26 80%
>> 8 4096 15611.28 18810.54 120% 245.15 198.65 81%
>> 1 16384 15304.38 14202.08 92% 259.94 244.04 93%
>> 2 16384 15508.97 15913.09 102% 261.30 244.26 93%
>> 4 16384 14859.98 20164.34 135% 248.29 214.26 86%
>> 8 16384 15594.59 19960.99 127% 253.79 211.27 83%
>> - TCP RR
>> sessions size throughput1 throughput2   norm1 norm2
>> 50 1 54972.51 69820.99 127% 1133.58 1063.58 93%
>> 100 1 55847.16 72407.93 129% 1155.73 1024.35 88%
>> 250 1 60066.23 108266.50 180% 1114.30 1323.55 118%
>> 50 64 48727.63 62378.32 128% 1014.29 888.78 87%
>> 100 64 51804.65 69250.51 133% 1077.78 986.97 91%
>> 250 64 61278.68 100015.78 163% 1076.93 1243.18 115%
>> 50 256 51593.29 62046.22 120% 1069.14 871.08 81%
>> 100 256 51647.00 68197.43 132% 1071.66 958.51 89%
>> 250 256 60433.88 99072.59 163% 1072.41 1199.10 111%
>> 50 512 52177.79 66483.77 127% 1082.65 960.82 88%
>> 100 512 50351.67 62537.63 124% 1041.61 876.41 84%
>> 250 512 60510.14 103856.79 171% 1055.21 1245.17 118%
>>
>>
>> Jason Wang (4):
>>    virtio_ring: move queue_index to vring_virtqueue
>>    virtio: intorduce an API to set affinity for a virtqueue
>>    virtio_net: multiqueue support
>>    virtio_net: support negotiating the number of queues through ctrl vq
>>
>> Krishna Kumar (1):
>>    virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE
>>
>>   drivers/net/virtio_net.c      |  792 
>> +++++++++++++++++++++++++++++------------
>>   drivers/virtio/virtio_mmio.c  |    5 +-
>>   drivers/virtio/virtio_pci.c   |   58 +++-
>>   drivers/virtio/virtio_ring.c  |   17 +
>>   include/linux/virtio.h        |    4 +
>>   include/linux/virtio_config.h |   21 ++
>>   include/linux/virtio_net.h    |   10 +
>>   7 files changed, 677 insertions(+), 230 deletions(-)
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 0/5] Multiqueue virtio-net
  2012-07-09  3:23       ` Jason Wang
@ 2012-07-09 16:46         ` Rick Jones
  0 siblings, 0 replies; 46+ messages in thread
From: Rick Jones @ 2012-07-09 16:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/08/2012 08:23 PM, Jason Wang wrote:
> On 07/07/2012 12:23 AM, Rick Jones wrote:
>> On 07/06/2012 12:42 AM, Jason Wang wrote:
>> Which mechanism to address skew error?  The netperf manual describes
>> more than one:
>
> This mechanism is missed in my test, I would add them to my test scripts.
>>
>> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
>>
>>
>> Personally, my preference these days is to use the "demo mode" method
>> of aggregate results as it can be rather faster than (ab)using the
>> confidence intervals mechanism, which I suspect may not really scale
>> all that well to large numbers of concurrent netperfs.
>
> During my test, the confidence interval would even hard to achieved in
> RR test when I pin vhost/vcpus in the processors, so I didn't use it.

When running aggregate netperfs, *something* has to be done to address 
the prospect of skew error.  Otherwise the results are suspect.

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-05 10:29 ` [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq Jason Wang
  2012-07-05 12:51   ` Sasha Levin
@ 2012-07-09 20:13   ` Ben Hutchings
  2012-07-20 12:33   ` Michael S. Tsirkin
  2 siblings, 0 replies; 46+ messages in thread
From: Ben Hutchings @ 2012-07-09 20:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
> This patch let the virtio_net driver can negotiate the number of queues it
> wishes to use through control virtqueue and export an ethtool interface to let
> use tweak it.
> 
> As current multiqueue virtio-net implementation has optimizations on per-cpu
> virtuqueues, so only two modes were support:
> 
> - single queue pair mode
> - multiple queue paris mode, the number of queues matches the number of vcpus
> 
> The single queue mode were used by default currently due to regression of
> multiqueue mode in some test (especially in stream test).
> 
> Since virtio core does not support paritially deleting virtqueues, so during
> mode switching the whole virtqueue were deleted and the driver would re-create
> the virtqueues it would used.
> 
> btw. The queue number negotiating were defered to .ndo_open(), this is because
> only after feature negotitaion could we send the command to control virtqueue
> (as it may also use event index).
[...]
> +static int virtnet_set_channels(struct net_device *dev,
> +				struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	u16 queues = channels->rx_count;
> +	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
> +
> +	if (channels->rx_count != channels->tx_count)
> +		return -EINVAL;
[...]
> +static void virtnet_get_channels(struct net_device *dev,
> +				 struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	channels->max_rx = vi->total_queue_pairs;
> +	channels->max_tx = vi->total_queue_pairs;
> +	channels->max_other = 0;
> +	channels->max_combined = 0;
> +	channels->rx_count = vi->num_queue_pairs;
> +	channels->tx_count = vi->num_queue_pairs;
> +	channels->other_count = 0;
> +	channels->combined_count = 0;
> +}
[...]

It looks like the queue-pairs should be treated as 'combined channels',
not separate RX and TX channels.  Also you don't need to clear the other
members; you can assume that the ethtool core will zero-initialise
structures for 'get' operations.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-05 10:29 ` [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq Jason Wang
  2012-07-05 12:51   ` Sasha Levin
  2012-07-09 20:13   ` Ben Hutchings
@ 2012-07-20 12:33   ` Michael S. Tsirkin
  2012-07-23  5:32     ` Jason Wang
  2 siblings, 1 reply; 46+ messages in thread
From: Michael S. Tsirkin @ 2012-07-20 12:33 UTC (permalink / raw)
  To: Jason Wang
  Cc: mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On Thu, Jul 05, 2012 at 06:29:54PM +0800, Jason Wang wrote:
> This patch let the virtio_net driver can negotiate the number of queues it
> wishes to use through control virtqueue and export an ethtool interface to let
> use tweak it.
> 
> As current multiqueue virtio-net implementation has optimizations on per-cpu
> virtuqueues, so only two modes were support:
> 
> - single queue pair mode
> - multiple queue paris mode, the number of queues matches the number of vcpus
> 
> The single queue mode were used by default currently due to regression of
> multiqueue mode in some test (especially in stream test).
> 
> Since virtio core does not support paritially deleting virtqueues, so during
> mode switching the whole virtqueue were deleted and the driver would re-create
> the virtqueues it would used.
> 
> btw. The queue number negotiating were defered to .ndo_open(), this is because
> only after feature negotitaion could we send the command to control virtqueue
> (as it may also use event index).
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c   |  171 ++++++++++++++++++++++++++++++++++---------
>  include/linux/virtio_net.h |    7 ++
>  2 files changed, 142 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7410187..3339eeb 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -88,6 +88,7 @@ struct receive_queue {
>  
>  struct virtnet_info {
>  	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
> +	u16 total_queue_pairs;
>  
>  	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>  	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> @@ -137,6 +138,8 @@ struct padded_vnet_hdr {
>  	char padding[6];
>  };
>  
> +static const struct ethtool_ops virtnet_ethtool_ops;
> +
>  static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>  {
>  	int ret = virtqueue_get_queue_index(vq);
> @@ -802,22 +805,6 @@ static void virtnet_netpoll(struct net_device *dev)
>  }
>  #endif
>  
> -static int virtnet_open(struct net_device *dev)
> -{
> -	struct virtnet_info *vi = netdev_priv(dev);
> -	int i;
> -
> -	for (i = 0; i < vi->num_queue_pairs; i++) {
> -		/* Make sure we have some buffers: if oom use wq. */
> -		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> -			queue_delayed_work(system_nrt_wq,
> -					   &vi->rq[i]->refill, 0);
> -		virtnet_napi_enable(vi->rq[i]);
> -	}
> -
> -	return 0;
> -}
> -
>  /*
>   * Send command via the control virtqueue and check status.  Commands
>   * supported by the hypervisor, as indicated by feature bits, should
> @@ -873,6 +860,43 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>  	rtnl_unlock();
>  }
>  
> +static int virtnet_set_queues(struct virtnet_info *vi)
> +{
> +	struct scatterlist sg;
> +	struct net_device *dev = vi->dev;
> +	sg_init_one(&sg, &vi->num_queue_pairs, sizeof(vi->num_queue_pairs));
> +
> +	if (!vi->has_cvq)
> +		return -EINVAL;
> +
> +	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MULTIQUEUE,
> +				  VIRTIO_NET_CTRL_MULTIQUEUE_QNUM, &sg, 1, 0)){
> +		dev_warn(&dev->dev, "Fail to set the number of queue pairs to"
> +			 " %d\n", vi->num_queue_pairs);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtnet_open(struct net_device *dev)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		/* Make sure we have some buffers: if oom use wq. */
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
> +		virtnet_napi_enable(vi->rq[i]);
> +	}
> +
> +	virtnet_set_queues(vi);
> +
> +	return 0;
> +}
> +
>  static int virtnet_close(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> @@ -1013,12 +1037,6 @@ static void virtnet_get_drvinfo(struct net_device *dev,
>  
>  }
>  
> -static const struct ethtool_ops virtnet_ethtool_ops = {
> -	.get_drvinfo = virtnet_get_drvinfo,
> -	.get_link = ethtool_op_get_link,
> -	.get_ringparam = virtnet_get_ringparam,
> -};
> -
>  #define MIN_MTU 68
>  #define MAX_MTU 65535
>  
> @@ -1235,7 +1253,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>  
>  err:
>  	if (ret && names)
> -		for (i = 0; i < vi->num_queue_pairs * 2; i++)
> +		for (i = 0; i < total_vqs * 2; i++)
>  			kfree(names[i]);
>  
>  	kfree(names);
> @@ -1373,7 +1391,6 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	mutex_init(&vi->config_lock);
>  	vi->config_enable = true;
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> -	vi->num_queue_pairs = num_queue_pairs;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>  		vi->has_cvq = true;
>  
> +	/* Use single tx/rx queue pair as default */
> +	vi->num_queue_pairs = 1;
> +	vi->total_queue_pairs = num_queue_pairs;
> +
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>  	err = virtnet_setup_vqs(vi);
>  	if (err)
> @@ -1396,6 +1417,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>  		dev->features |= NETIF_F_HW_VLAN_FILTER;
>  
> +	netif_set_real_num_tx_queues(dev, 1);
> +	netif_set_real_num_rx_queues(dev, 1);
> +
>  	err = register_netdev(dev);
>  	if (err) {
>  		pr_debug("virtio_net: registering device failed\n");
> @@ -1403,7 +1427,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	/* Last of all, set up some receive buffers. */
> -	for (i = 0; i < num_queue_pairs; i++) {
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
>  		try_fill_recv(vi->rq[i], GFP_KERNEL);
>  
>  		/* If we didn't even get one input buffer, we're useless. */
> @@ -1474,10 +1498,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>  	free_netdev(vi->dev);
>  }
>  
> -#ifdef CONFIG_PM
> -static int virtnet_freeze(struct virtio_device *vdev)
> +static void virtnet_stop(struct virtnet_info *vi)
>  {
> -	struct virtnet_info *vi = vdev->priv;
>  	int i;
>  
>  	/* Prevent config work handler from accessing the device */
> @@ -1493,17 +1515,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  		for (i = 0; i < vi->num_queue_pairs; i++)
>  			napi_disable(&vi->rq[i]->napi);
>  
> -
> -	remove_vq_common(vi);
> -
> -	flush_work(&vi->config_work);
> -
> -	return 0;
>  }
>  
> -static int virtnet_restore(struct virtio_device *vdev)
> +static int virtnet_start(struct virtnet_info *vi)
>  {
> -	struct virtnet_info *vi = vdev->priv;
>  	int err, i;
>  
>  	err = virtnet_setup_vqs(vi);
> @@ -1527,6 +1542,29 @@ static int virtnet_restore(struct virtio_device *vdev)
>  
>  	return 0;
>  }
> +
> +#ifdef CONFIG_PM
> +static int virtnet_freeze(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	virtnet_stop(vi);
> +
> +	remove_vq_common(vi);
> +
> +	flush_work(&vi->config_work);
> +
> +	return 0;
> +}
> +
> +static int virtnet_restore(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	virtnet_start(vi);
> +
> +	return 0;
> +}
>  #endif
>  
>  static struct virtio_device_id id_table[] = {
> @@ -1560,6 +1598,67 @@ static struct virtio_driver virtio_net_driver = {
>  #endif
>  };
>  
> +static int virtnet_set_channels(struct net_device *dev,
> +				struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	u16 queues = channels->rx_count;
> +	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
> +
> +	if (channels->rx_count != channels->tx_count)
> +		return -EINVAL;
> +	/* Only two modes were support currently */

s/were/are/ ?

> +	if (queues != vi->total_queue_pairs && queues != 1)
> +		return -EINVAL;

So userspace has to get queue number right. How does it know
what the valid value is?

> +	if (!vi->has_cvq)
> +		return -EINVAL;
> +
> +	virtnet_stop(vi);
> +
> +	netif_set_real_num_tx_queues(dev, queues);
> +	netif_set_real_num_rx_queues(dev, queues);
> +
> +	remove_vq_common(vi);
> +	flush_work(&vi->config_work);
> +
> +	vi->num_queue_pairs = queues;
> +	virtnet_start(vi);
> +
> +	vi->vdev->config->finalize_features(vi->vdev);
> +
> +	if (virtnet_set_queues(vi))
> +		status |= VIRTIO_CONFIG_S_FAILED;
> +	else
> +		status |= VIRTIO_CONFIG_S_DRIVER_OK;
> +
> +	vi->vdev->config->set_status(vi->vdev, status);
> +

Why do we need to tweak status like that?
Can we maybe just roll changes back on error?

> +	return 0;
> +}
> +
> +static void virtnet_get_channels(struct net_device *dev,
> +				 struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	channels->max_rx = vi->total_queue_pairs;
> +	channels->max_tx = vi->total_queue_pairs;
> +	channels->max_other = 0;
> +	channels->max_combined = 0;
> +	channels->rx_count = vi->num_queue_pairs;
> +	channels->tx_count = vi->num_queue_pairs;
> +	channels->other_count = 0;
> +	channels->combined_count = 0;
> +}
> +
> +static const struct ethtool_ops virtnet_ethtool_ops = {
> +	.get_drvinfo = virtnet_get_drvinfo,
> +	.get_link = ethtool_op_get_link,
> +	.get_ringparam = virtnet_get_ringparam,
> +	.set_channels = virtnet_set_channels,
> +	.get_channels = virtnet_get_channels,
> +};
> +
>  static int __init init(void)
>  {
>  	return register_virtio_driver(&virtio_net_driver);
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 60f09ff..0d21e08 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -169,4 +169,11 @@ struct virtio_net_ctrl_mac {
>  #define VIRTIO_NET_CTRL_ANNOUNCE       3
>   #define VIRTIO_NET_CTRL_ANNOUNCE_ACK         0
>  
> +/*
> + * Control multiqueue
> + *
> + */
> +#define VIRTIO_NET_CTRL_MULTIQUEUE       4
> + #define VIRTIO_NET_CTRL_MULTIQUEUE_QNUM         0
> +
>  #endif /* _LINUX_VIRTIO_NET_H */
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-05 10:29 ` [net-next RFC V5 4/5] virtio_net: multiqueue support Jason Wang
  2012-07-05 20:02   ` Amos Kong
@ 2012-07-20 13:40   ` Michael S. Tsirkin
  2012-07-21 12:02     ` Sasha Levin
  2012-07-23  5:48     ` Jason Wang
  1 sibling, 2 replies; 46+ messages in thread
From: Michael S. Tsirkin @ 2012-07-20 13:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On Thu, Jul 05, 2012 at 06:29:53PM +0800, Jason Wang wrote:
> This patch converts virtio_net to a multi queue device. After negotiated
> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
> and driver could read the number from config space.
> 
> The driver expects the number of rx/tx queue paris is equal to the number of
> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
> optimization were introduced:
> 
> - Txq selection is based on the processor id in order to avoid contending a lock
>   whose owner may exits to host.
> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>   the queue pairs.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Overall fine. I think it is best to smash the following patch
into this one, so that default behavior does not
jump to mq then back. some comments below: mostly nits, and a minor bug.

If you are worried the patch is too big, it can be split
differently
	- rework to use send_queue/receive_queue structures, no
	  functional changes.
	- add multiqueue

but this is not a must.

> ---
>  drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
>  include/linux/virtio_net.h |    2 +
>  2 files changed, 452 insertions(+), 195 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 1db445b..7410187 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/if_vlan.h>
>  #include <linux/slab.h>
> +#include <linux/interrupt.h>
>  
>  static int napi_weight = 128;
>  module_param(napi_weight, int, 0444);
> @@ -41,6 +42,8 @@ module_param(gso, bool, 0444);
>  #define VIRTNET_SEND_COMMAND_SG_MAX    2
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
> +#define MAX_QUEUES 256
> +
>  struct virtnet_stats {
>  	struct u64_stats_sync tx_syncp;
>  	struct u64_stats_sync rx_syncp;

Would be a bit better not to have artificial limits like that.
Maybe allocate arrays at probe time, then we can
take whatever the device gives us?

> @@ -51,43 +54,69 @@ struct virtnet_stats {
>  	u64 rx_packets;
>  };
>  
> -struct virtnet_info {
> -	struct virtio_device *vdev;
> -	struct virtqueue *rvq, *svq, *cvq;
> -	struct net_device *dev;
> +/* Internal representation of a send virtqueue */
> +struct send_queue {
> +	/* Virtqueue associated with this send _queue */
> +	struct virtqueue *vq;
> +
> +	/* TX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +/* Internal representation of a receive virtqueue */
> +struct receive_queue {
> +	/* Virtqueue associated with this receive_queue */
> +	struct virtqueue *vq;
> +
> +	/* Back pointer to the virtnet_info */
> +	struct virtnet_info *vi;
> +
>  	struct napi_struct napi;
> -	unsigned int status;
>  
>  	/* Number of input buffers, and max we've ever had. */
>  	unsigned int num, max;
>  
> +	/* Work struct for refilling if we run low on memory. */
> +	struct delayed_work refill;
> +
> +	/* Chain pages by the private ptr. */
> +	struct page *pages;
> +
> +	/* RX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +struct virtnet_info {
> +	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
> +
> +	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> +	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;

The assumption is a tx/rx pair is handled on the same cpu, yes?
If yes maybe make it a single array to improve cache locality
a bit?
	struct queue_pair {
		struct send_queue sq;
		struct receive_queue rq;
	};

> +	struct virtqueue *cvq;
> +
> +	struct virtio_device *vdev;
> +	struct net_device *dev;
> +	unsigned int status;
> +
>  	/* I like... big packets and I cannot lie! */
>  	bool big_packets;
>  
>  	/* Host will merge rx buffers for big packets (shake it! shake it!) */
>  	bool mergeable_rx_bufs;
>  
> +	/* Has control virtqueue */
> +	bool has_cvq;
> +

won't checking (cvq != NULL) be enough?

>  	/* enable config space updates */
>  	bool config_enable;
>  
>  	/* Active statistics */
>  	struct virtnet_stats __percpu *stats;
>  
> -	/* Work struct for refilling if we run low on memory. */
> -	struct delayed_work refill;
> -
>  	/* Work struct for config space updates */
>  	struct work_struct config_work;
>  
>  	/* Lock for config space updates */
>  	struct mutex config_lock;
> -
> -	/* Chain pages by the private ptr. */
> -	struct page *pages;
> -
> -	/* fragments + linear part + virtio header */
> -	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
> -	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>  };
>  
>  struct skb_vnet_hdr {
> @@ -108,6 +137,22 @@ struct padded_vnet_hdr {
>  	char padding[6];
>  };
>  
> +static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
> +{
> +	int ret = virtqueue_get_queue_index(vq);
> +
> +	/* skip ctrl vq */
> +	if (vi->has_cvq)
> +		return (ret - 1) / 2;
> +	else
> +		return ret / 2;
> +}
> +
> +static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
> +{
> +	return virtqueue_get_queue_index(vq) / 2;
> +}
> +
>  static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>  {
>  	return (struct skb_vnet_hdr *)skb->cb;
> @@ -117,22 +162,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>   * private is used to chain pages for big packets, put the whole
>   * most recent used list in the beginning for reuse
>   */
> -static void give_pages(struct virtnet_info *vi, struct page *page)
> +static void give_pages(struct receive_queue *rq, struct page *page)
>  {
>  	struct page *end;
>  
>  	/* Find end of list, sew whole thing into vi->pages. */
>  	for (end = page; end->private; end = (struct page *)end->private);
> -	end->private = (unsigned long)vi->pages;
> -	vi->pages = page;
> +	end->private = (unsigned long)rq->pages;
> +	rq->pages = page;
>  }
>  
> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  {
> -	struct page *p = vi->pages;
> +	struct page *p = rq->pages;
>  
>  	if (p) {
> -		vi->pages = (struct page *)p->private;
> +		rq->pages = (struct page *)p->private;
>  		/* clear private here, it is used to chain pages */
>  		p->private = 0;
>  	} else
> @@ -140,15 +185,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>  	return p;
>  }
>  
> -static void skb_xmit_done(struct virtqueue *svq)
> +static void skb_xmit_done(struct virtqueue *vq)
>  {
> -	struct virtnet_info *vi = svq->vdev->priv;
> +	struct virtnet_info *vi = vq->vdev->priv;
>  
>  	/* Suppress further interrupts. */
> -	virtqueue_disable_cb(svq);
> +	virtqueue_disable_cb(vq);
>  
>  	/* We were probably waiting for more output buffers. */
> -	netif_wake_queue(vi->dev);
> +	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
>  }
>  
>  static void set_skb_frag(struct sk_buff *skb, struct page *page,
> @@ -167,9 +212,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
>  }
>  
>  /* Called from bottom half context */
> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
>  				   struct page *page, unsigned int len)
>  {
> +	struct virtnet_info *vi = rq->vi;
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	unsigned int copy, hdr_len, offset;
> @@ -225,12 +271,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  	}
>  
>  	if (page)
> -		give_pages(vi, page);
> +		give_pages(rq, page);
>  
>  	return skb;
>  }
>  
> -static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
> +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	struct page *page;
> @@ -244,7 +290,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>  			skb->dev->stats.rx_length_errors++;
>  			return -EINVAL;
>  		}
> -		page = virtqueue_get_buf(vi->rvq, &len);
> +		page = virtqueue_get_buf(rq->vq, &len);
>  		if (!page) {
>  			pr_debug("%s: rx error: %d buffers missing\n",
>  				 skb->dev->name, hdr->mhdr.num_buffers);
> @@ -257,13 +303,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>  
>  		set_skb_frag(skb, page, 0, &len);
>  
> -		--vi->num;
> +		--rq->num;
>  	}
>  	return 0;
>  }
>  
> -static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
> +static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  {
> +	struct net_device *dev = rq->vi->dev;
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  	struct sk_buff *skb;
> @@ -274,7 +321,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>  		pr_debug("%s: short packet %i\n", dev->name, len);
>  		dev->stats.rx_length_errors++;
>  		if (vi->mergeable_rx_bufs || vi->big_packets)
> -			give_pages(vi, buf);
> +			give_pages(rq, buf);
>  		else
>  			dev_kfree_skb(buf);
>  		return;
> @@ -286,14 +333,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>  		skb_trim(skb, len);
>  	} else {
>  		page = buf;
> -		skb = page_to_skb(vi, page, len);
> +		skb = page_to_skb(rq, page, len);
>  		if (unlikely(!skb)) {
>  			dev->stats.rx_dropped++;
> -			give_pages(vi, page);
> +			give_pages(rq, page);
>  			return;
>  		}
>  		if (vi->mergeable_rx_bufs)
> -			if (receive_mergeable(vi, skb)) {
> +			if (receive_mergeable(rq, skb)) {
>  				dev_kfree_skb(skb);
>  				return;
>  			}
> @@ -363,90 +410,91 @@ frame_err:
>  	dev_kfree_skb(skb);
>  }
>  
> -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	int err;
>  
> -	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
> +	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
>  	if (unlikely(!skb))
>  		return -ENOMEM;
>  
>  	skb_put(skb, MAX_PACKET_LEN);
>  
>  	hdr = skb_vnet_hdr(skb);
> -	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
> +	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
> +
> +	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
>  
> -	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
>  	if (err < 0)
>  		dev_kfree_skb(skb);
>  
>  	return err;
>  }
>  
> -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct page *first, *list = NULL;
>  	char *p;
>  	int i, err, offset;
>  
> -	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
> +	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
>  	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
> -		first = get_a_page(vi, gfp);
> +		first = get_a_page(rq, gfp);
>  		if (!first) {
>  			if (list)
> -				give_pages(vi, list);
> +				give_pages(rq, list);
>  			return -ENOMEM;
>  		}
> -		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
> +		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
>  
>  		/* chain new page in list head to match sg */
>  		first->private = (unsigned long)list;
>  		list = first;
>  	}
>  
> -	first = get_a_page(vi, gfp);
> +	first = get_a_page(rq, gfp);
>  	if (!first) {
> -		give_pages(vi, list);
> +		give_pages(rq, list);
>  		return -ENOMEM;
>  	}
>  	p = page_address(first);
>  
> -	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
> -	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
> -	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
> +	/* rq->sg[0], rq->sg[1] share the same page */
> +	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
> +	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
>  
> -	/* vi->rx_sg[1] for data packet, from offset */
> +	/* rq->sg[1] for data packet, from offset */
>  	offset = sizeof(struct padded_vnet_hdr);
> -	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
> +	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
>  
>  	/* chain first in list head */
>  	first->private = (unsigned long)list;
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
>  				first, gfp);
>  	if (err < 0)
> -		give_pages(vi, first);
> +		give_pages(rq, first);
>  
>  	return err;
>  }
>  
> -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct page *page;
>  	int err;
>  
> -	page = get_a_page(vi, gfp);
> +	page = get_a_page(rq, gfp);
>  	if (!page)
>  		return -ENOMEM;
>  
> -	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
> +	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
>  	if (err < 0)
> -		give_pages(vi, page);
> +		give_pages(rq, page);
>  
>  	return err;
>  }
> @@ -458,97 +506,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>   * before we're receiving packets, or from refill_work which is
>   * careful to disable receiving (using napi_disable).
>   */
> -static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
> +static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  {
> +	struct virtnet_info *vi = rq->vi;
>  	int err;
>  	bool oom;
>  
>  	do {
>  		if (vi->mergeable_rx_bufs)
> -			err = add_recvbuf_mergeable(vi, gfp);
> +			err = add_recvbuf_mergeable(rq, gfp);
>  		else if (vi->big_packets)
> -			err = add_recvbuf_big(vi, gfp);
> +			err = add_recvbuf_big(rq, gfp);
>  		else
> -			err = add_recvbuf_small(vi, gfp);
> +			err = add_recvbuf_small(rq, gfp);
>  
>  		oom = err == -ENOMEM;
>  		if (err < 0)
>  			break;
> -		++vi->num;
> +		++rq->num;
>  	} while (err > 0);
> -	if (unlikely(vi->num > vi->max))
> -		vi->max = vi->num;
> -	virtqueue_kick(vi->rvq);
> +	if (unlikely(rq->num > rq->max))
> +		rq->max = rq->num;
> +	virtqueue_kick(rq->vq);
>  	return !oom;
>  }
>  
> -static void skb_recv_done(struct virtqueue *rvq)
> +static void skb_recv_done(struct virtqueue *vq)
>  {
> -	struct virtnet_info *vi = rvq->vdev->priv;
> +	struct virtnet_info *vi = vq->vdev->priv;
> +	struct napi_struct *napi = &vi->rq[rxq_get_qnum(vi, vq)]->napi;
> +
>  	/* Schedule NAPI, Suppress further interrupts if successful. */
> -	if (napi_schedule_prep(&vi->napi)) {
> -		virtqueue_disable_cb(rvq);
> -		__napi_schedule(&vi->napi);
> +	if (napi_schedule_prep(napi)) {
> +		virtqueue_disable_cb(vq);
> +		__napi_schedule(napi);
>  	}
>  }
>  
> -static void virtnet_napi_enable(struct virtnet_info *vi)
> +static void virtnet_napi_enable(struct receive_queue *rq)
>  {
> -	napi_enable(&vi->napi);
> +	napi_enable(&rq->napi);
>  
>  	/* If all buffers were filled by other side before we napi_enabled, we
>  	 * won't get another interrupt, so process any outstanding packets
>  	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
>  	 * We synchronize against interrupts via NAPI_STATE_SCHED */
> -	if (napi_schedule_prep(&vi->napi)) {
> -		virtqueue_disable_cb(vi->rvq);
> +	if (napi_schedule_prep(&rq->napi)) {
> +		virtqueue_disable_cb(rq->vq);
>  		local_bh_disable();
> -		__napi_schedule(&vi->napi);
> +		__napi_schedule(&rq->napi);
>  		local_bh_enable();
>  	}
>  }
>  
>  static void refill_work(struct work_struct *work)
>  {
> -	struct virtnet_info *vi;
> +	struct napi_struct *napi;
> +	struct receive_queue *rq;
>  	bool still_empty;
>  
> -	vi = container_of(work, struct virtnet_info, refill.work);
> -	napi_disable(&vi->napi);
> -	still_empty = !try_fill_recv(vi, GFP_KERNEL);
> -	virtnet_napi_enable(vi);
> +	rq = container_of(work, struct receive_queue, refill.work);
> +	napi = &rq->napi;
> +
> +	napi_disable(napi);
> +	still_empty = !try_fill_recv(rq, GFP_KERNEL);
> +	virtnet_napi_enable(rq);
>  
>  	/* In theory, this can happen: if we don't get any buffers in
>  	 * we will *never* try to fill again. */
>  	if (still_empty)
> -		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
> +		queue_delayed_work(system_nrt_wq, &rq->refill, HZ/2);
>  }
>  
>  static int virtnet_poll(struct napi_struct *napi, int budget)
>  {
> -	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
> +	struct receive_queue *rq = container_of(napi, struct receive_queue,
> +						napi);
>  	void *buf;
>  	unsigned int len, received = 0;
>  
>  again:
>  	while (received < budget &&
> -	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
> -		receive_buf(vi->dev, buf, len);
> -		--vi->num;
> +	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
> +		receive_buf(rq, buf, len);
> +		--rq->num;
>  		received++;
>  	}
>  
> -	if (vi->num < vi->max / 2) {
> -		if (!try_fill_recv(vi, GFP_ATOMIC))
> -			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	if (rq->num < rq->max / 2) {
> +		if (!try_fill_recv(rq, GFP_ATOMIC))
> +			queue_delayed_work(system_nrt_wq, &rq->refill, 0);
>  	}
>  
>  	/* Out of packets? */
>  	if (received < budget) {
>  		napi_complete(napi);
> -		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
> +		if (unlikely(!virtqueue_enable_cb(rq->vq)) &&
>  		    napi_schedule_prep(napi)) {
> -			virtqueue_disable_cb(vi->rvq);
> +			virtqueue_disable_cb(rq->vq);
>  			__napi_schedule(napi);
>  			goto again;
>  		}
> @@ -557,13 +612,14 @@ again:
>  	return received;
>  }
>  
> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
> +				       struct virtqueue *vq)
>  {
>  	struct sk_buff *skb;
>  	unsigned int len, tot_sgs = 0;
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  
>  		u64_stats_update_begin(&stats->tx_syncp);
> @@ -577,7 +633,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>  	return tot_sgs;
>  }
>  
> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
> +		    struct virtqueue *vq, struct scatterlist *sg)
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> @@ -615,44 +672,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>  
>  	/* Encode metadata header at front. */
>  	if (vi->mergeable_rx_bufs)
> -		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
> +		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
>  	else
> -		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
> +		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
>  
> -	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
> +	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
> +	return virtqueue_add_buf(vq, sg, hdr->num_sg,
>  				 0, skb, GFP_ATOMIC);
>  }
>  
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int qnum = skb_get_queue_mapping(skb);
> +	struct virtqueue *vq = vi->sq[qnum]->vq;
>  	int capacity;
>  
>  	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> +	free_old_xmit_skbs(vi, vq);
>  
>  	/* Try to transmit */
> -	capacity = xmit_skb(vi, skb);
> +	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
>  
>  	/* This can happen with OOM and indirect buffers. */
>  	if (unlikely(capacity < 0)) {
>  		if (likely(capacity == -ENOMEM)) {
>  			if (net_ratelimit())
>  				dev_warn(&dev->dev,
> -					 "TX queue failure: out of memory\n");
> +					"TXQ (%d) failure: out of memory\n",
> +					qnum);
>  		} else {
>  			dev->stats.tx_fifo_errors++;
>  			if (net_ratelimit())
>  				dev_warn(&dev->dev,
> -					 "Unexpected TX queue failure: %d\n",
> -					 capacity);
> +					"Unexpected TXQ (%d) failure: %d\n",
> +					qnum, capacity);
>  		}
>  		dev->stats.tx_dropped++;
>  		kfree_skb(skb);
>  		return NETDEV_TX_OK;
>  	}
> -	virtqueue_kick(vi->svq);
> +	virtqueue_kick(vq);
>  
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
> @@ -661,13 +721,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
> +		netif_stop_subqueue(dev, qnum);
> +		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
>  			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> +			capacity += free_old_xmit_skbs(vi, vq);
>  			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				virtqueue_disable_cb(vi->svq);
> +				netif_start_subqueue(dev, qnum);
> +				virtqueue_disable_cb(vq);
>  			}
>  		}
>  	}
> @@ -700,7 +760,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>  	unsigned int start;
>  
>  	for_each_possible_cpu(cpu) {
> -		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
> +		struct virtnet_stats __percpu *stats
> +			= per_cpu_ptr(vi->stats, cpu);
>  		u64 tpackets, tbytes, rpackets, rbytes;
>  
>  		do {
> @@ -734,20 +795,26 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>  static void virtnet_netpoll(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
> -	napi_schedule(&vi->napi);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		napi_schedule(&vi->rq[i]->napi);
>  }
>  #endif
>  
>  static int virtnet_open(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
> -	/* Make sure we have some buffers: if oom use wq. */
> -	if (!try_fill_recv(vi, GFP_KERNEL))
> -		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		/* Make sure we have some buffers: if oom use wq. */
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
> +		virtnet_napi_enable(vi->rq[i]);
> +	}
>  
> -	virtnet_napi_enable(vi);
>  	return 0;
>  }
>  
> @@ -809,10 +876,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>  static int virtnet_close(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
>  	/* Make sure refill_work doesn't re-enable napi! */
> -	cancel_delayed_work_sync(&vi->refill);
> -	napi_disable(&vi->napi);
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
> +		napi_disable(&vi->rq[i]->napi);
> +	}
>  
>  	return 0;
>  }
> @@ -924,11 +994,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  
> -	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
> -	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
> +	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
> +	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
>  	ring->rx_pending = ring->rx_max_pending;
>  	ring->tx_pending = ring->tx_max_pending;
> -
>  }
>  
>  
> @@ -961,6 +1030,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>  	return 0;
>  }
>  
> +/* To avoid contending a lock hold by a vcpu who would exit to host, select the
> + * txq based on the processor id.
> + */
> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
> +		  smp_processor_id();
> +
> +	while (unlikely(txq >= dev->real_num_tx_queues))
> +		txq -= dev->real_num_tx_queues;
> +	return txq;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
>  	.ndo_open            = virtnet_open,
>  	.ndo_stop   	     = virtnet_close,
> @@ -972,6 +1054,7 @@ static const struct net_device_ops virtnet_netdev = {
>  	.ndo_get_stats64     = virtnet_stats,
>  	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
>  	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> +	.ndo_select_queue     = virtnet_select_queue,
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  	.ndo_poll_controller = virtnet_netpoll,
>  #endif
> @@ -1007,10 +1090,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
>  
>  	if (vi->status & VIRTIO_NET_S_LINK_UP) {
>  		netif_carrier_on(vi->dev);
> -		netif_wake_queue(vi->dev);
> +		netif_tx_wake_all_queues(vi->dev);
>  	} else {
>  		netif_carrier_off(vi->dev);
> -		netif_stop_queue(vi->dev);
> +		netif_tx_stop_all_queues(vi->dev);
>  	}
>  done:
>  	mutex_unlock(&vi->config_lock);
> @@ -1023,41 +1106,217 @@ static void virtnet_config_changed(struct virtio_device *vdev)
>  	queue_work(system_nrt_wq, &vi->config_work);
>  }
>  
> -static int init_vqs(struct virtnet_info *vi)
> +static void free_receive_bufs(struct virtnet_info *vi)
> +{
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		while (vi->rq[i]->pages)
> +			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
> +	}
> +}
> +
> +/* Free memory allocated for send and receive queues */
> +static void virtnet_free_queues(struct virtnet_info *vi)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
> -	const char *names[] = { "input", "output", "control" };
> -	int nvqs, err;
> +	int i;
>  
> -	/* We expect two virtqueues, receive then send,
> -	 * and optionally control. */
> -	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		kfree(vi->rq[i]);
> +		vi->rq[i] = NULL;
> +		kfree(vi->sq[i]);
> +		vi->sq[i] = NULL;
> +	}
> +}
>  
> -	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
> -	if (err)
> -		return err;
> +static void free_unused_bufs(struct virtnet_info *vi)
> +{
> +	void *buf;
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		struct virtqueue *vq = vi->sq[i]->vq;
> +
> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
> +			dev_kfree_skb(buf);
> +	}
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		struct virtqueue *vq = vi->rq[i]->vq;
> +
> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> +			if (vi->mergeable_rx_bufs || vi->big_packets)
> +				give_pages(vi->rq[i], buf);
> +			else
> +				dev_kfree_skb(buf);
> +			--vi->rq[i]->num;
> +		}
> +		BUG_ON(vi->rq[i]->num != 0);
> +	}
> +}
> +
> +static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
> +{
> +	int i;
> +
> +	if (vi->num_queue_pairs == 1)
> +		return;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		int cpu = set ? i : -1;
> +		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
> +		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
> +	}
> +	return;
> +}
> +
> +static void virtnet_del_vqs(struct virtnet_info *vi)
> +{
> +	struct virtio_device *vdev = vi->vdev;
> +
> +	virtnet_set_affinity(vi, false);
> +
> +	vdev->config->del_vqs(vdev);
> +
> +	virtnet_free_queues(vi);
> +}
> +
> +static int virtnet_find_vqs(struct virtnet_info *vi)
> +{
> +	vq_callback_t **callbacks;
> +	struct virtqueue **vqs;
> +	int ret = -ENOMEM;
> +	int i, total_vqs;
> +	char **names;
>  
> -	vi->rvq = vqs[0];
> -	vi->svq = vqs[1];
> +	/*
> +	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
> +	 * possible control virtqueue and followed by the same
> +	 * 'vi->num_queue_pairs-1' more times
> +	 */
> +	total_vqs = vi->num_queue_pairs * 2 +
> +		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
> +	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
> +	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);

so this needs to be kzalloc otherwise on an error cleanup will
get uninitialized data and crash?

> +	if (!vqs || !callbacks || !names)
> +		goto err;
> +
> +	/* Parameters for control virtqueue, if any */
> +	if (vi->has_cvq) {
> +		callbacks[2] = NULL;
> +		names[2] = "control";
> +	}
> +
> +	/* Allocate/initialize parameters for send/receive virtqueues */
> +	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
> +		int j = (i == 0 ? i : i + vi->has_cvq);
> +		callbacks[j] = skb_recv_done;
> +		callbacks[j + 1] = skb_xmit_done;
> +		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
> +		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);

This needs wrappers. E.g. virtnet_rx_vq(int queue_pair), virtnet_tx_vq(int queue_pair);
Then you would just scan 0 to num_queue_pairs, and i is queue pair
number.

> +	}
>  
> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
> +					 (const char **)names);
> +	if (ret)
> +		goto err;
> +
> +	if (vi->has_cvq)
>  		vi->cvq = vqs[2];
>  
> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
> +	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
> +		int j = i == 0 ? i : i + vi->has_cvq;
> +		vi->rq[i / 2]->vq = vqs[j];
> +		vi->sq[i / 2]->vq = vqs[j + 1];

Same here.

>  	}
> -	return 0;
> +
> +err:
> +	if (ret && names)

If we are here ret != 0. For names, just add another label, don't
complicate cleanup.

> +		for (i = 0; i < vi->num_queue_pairs * 2; i++)
> +			kfree(names[i]);
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
> +
> +	return ret;
> +}
> +
> +static int virtnet_alloc_queues(struct virtnet_info *vi)
> +{
> +	int ret = -ENOMEM;
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
> +		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
> +		if (!vi->rq[i] || !vi->sq[i])
> +			goto err;
> +	}
> +
> +	ret = 0;
> +
> +	/* setup initial receive and send queue parameters */
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		vi->rq[i]->vi = vi;
> +		vi->rq[i]->pages = NULL;
> +		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
> +		netif_napi_add(vi->dev, &vi->rq[i]->napi, virtnet_poll,
> +			       napi_weight);
> +
> +		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
> +		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
> +	}
> +

Add return 0 here, then ret = 0 will not be needed
above and if (ret) below.


> +err:
> +	if (ret)
> +		virtnet_free_queues(vi);
> +
> +	return ret;
> +}
> +
> +static int virtnet_setup_vqs(struct virtnet_info *vi)
> +{
> +	int ret;
> +
> +	/* Allocate send & receive queues */
> +	ret = virtnet_alloc_queues(vi);
> +	if (!ret) {
> +		ret = virtnet_find_vqs(vi);
> +		if (ret)
> +			virtnet_free_queues(vi);
> +		else
> +			virtnet_set_affinity(vi, true);
> +	}
> +
> +	return ret;

Add some labels for error handling, this if nesting is messy.

>  }
>  
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> -	int err;
> +	int i, err;
>  	struct net_device *dev;
>  	struct virtnet_info *vi;
> +	u16 num_queues, num_queue_pairs;
> +
> +	/* Find if host supports multiqueue virtio_net device */
> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> +				offsetof(struct virtio_net_config,
> +				num_queues), &num_queues);
> +
> +	/* We need atleast 2 queue's */

typo

> +	if (err || num_queues < 2)
> +		num_queues = 2;
> +	if (num_queues > MAX_QUEUES * 2)
> +		num_queues = MAX_QUEUES;
> +
> +	num_queue_pairs = num_queues / 2;
>  
>  	/* Allocate ourselves a network device with room for our info */
> -	dev = alloc_etherdev(sizeof(struct virtnet_info));
> +	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
>  	if (!dev)
>  		return -ENOMEM;
>  
> @@ -1103,22 +1362,18 @@ static int virtnet_probe(struct virtio_device *vdev)
>  
>  	/* Set up our device-specific information */
>  	vi = netdev_priv(dev);
> -	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
>  	vi->dev = dev;
>  	vi->vdev = vdev;
>  	vdev->priv = vi;
> -	vi->pages = NULL;
>  	vi->stats = alloc_percpu(struct virtnet_stats);
>  	err = -ENOMEM;
>  	if (vi->stats == NULL)
> -		goto free;
> +		goto free_netdev;
>  
> -	INIT_DELAYED_WORK(&vi->refill, refill_work);
>  	mutex_init(&vi->config_lock);
>  	vi->config_enable = true;
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> -	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
> -	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
> +	vi->num_queue_pairs = num_queue_pairs;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -1129,9 +1384,17 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>  		vi->mergeable_rx_bufs = true;
>  
> -	err = init_vqs(vi);
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> +		vi->has_cvq = true;
> +

How about we disable multiqueue if there's no cvq?
Will make logic a bit simpler, won't it?

> +	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
> +	err = virtnet_setup_vqs(vi);
>  	if (err)
> -		goto free_stats;
> +		goto free_netdev;
> +
> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) &&
> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> +		dev->features |= NETIF_F_HW_VLAN_FILTER;
>  
>  	err = register_netdev(dev);
>  	if (err) {
> @@ -1140,12 +1403,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	/* Last of all, set up some receive buffers. */
> -	try_fill_recv(vi, GFP_KERNEL);
> -
> -	/* If we didn't even get one input buffer, we're useless. */
> -	if (vi->num == 0) {
> -		err = -ENOMEM;
> -		goto unregister;
> +	for (i = 0; i < num_queue_pairs; i++) {
> +		try_fill_recv(vi->rq[i], GFP_KERNEL);
> +
> +		/* If we didn't even get one input buffer, we're useless. */
> +		if (vi->rq[i]->num == 0) {
> +			free_unused_bufs(vi);
> +			err = -ENOMEM;
> +			goto free_recv_bufs;
> +		}
>  	}
>  
>  	/* Assume link up if device can't report link status,
> @@ -1158,42 +1424,25 @@ static int virtnet_probe(struct virtio_device *vdev)
>  		netif_carrier_on(dev);
>  	}
>  
> -	pr_debug("virtnet: registered device %s\n", dev->name);
> +	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
> +		 dev->name, num_queue_pairs);
> +
>  	return 0;
>  
> -unregister:
> +free_recv_bufs:
> +	free_receive_bufs(vi);
>  	unregister_netdev(dev);
> +
>  free_vqs:
> -	vdev->config->del_vqs(vdev);
> -free_stats:
> -	free_percpu(vi->stats);
> -free:
> +	for (i = 0; i < num_queue_pairs; i++)
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
> +	virtnet_del_vqs(vi);
> +
> +free_netdev:
>  	free_netdev(dev);
>  	return err;
>  }
>  
> -static void free_unused_bufs(struct virtnet_info *vi)
> -{
> -	void *buf;
> -	while (1) {
> -		buf = virtqueue_detach_unused_buf(vi->svq);
> -		if (!buf)
> -			break;
> -		dev_kfree_skb(buf);
> -	}
> -	while (1) {
> -		buf = virtqueue_detach_unused_buf(vi->rvq);
> -		if (!buf)
> -			break;
> -		if (vi->mergeable_rx_bufs || vi->big_packets)
> -			give_pages(vi, buf);
> -		else
> -			dev_kfree_skb(buf);
> -		--vi->num;
> -	}
> -	BUG_ON(vi->num != 0);
> -}
> -
>  static void remove_vq_common(struct virtnet_info *vi)
>  {
>  	vi->vdev->config->reset(vi->vdev);
> @@ -1201,10 +1450,9 @@ static void remove_vq_common(struct virtnet_info *vi)
>  	/* Free unused buffers in both send and recv, if any. */
>  	free_unused_bufs(vi);
>  
> -	vi->vdev->config->del_vqs(vi->vdev);
> +	free_receive_bufs(vi);
>  
> -	while (vi->pages)
> -		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
> +	virtnet_del_vqs(vi);
>  }
>  
>  static void __devexit virtnet_remove(struct virtio_device *vdev)
> @@ -1230,6 +1478,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>  static int virtnet_freeze(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> +	int i;
>  
>  	/* Prevent config work handler from accessing the device */
>  	mutex_lock(&vi->config_lock);
> @@ -1237,10 +1486,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  	mutex_unlock(&vi->config_lock);
>  
>  	netif_device_detach(vi->dev);
> -	cancel_delayed_work_sync(&vi->refill);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>  
>  	if (netif_running(vi->dev))
> -		napi_disable(&vi->napi);
> +		for (i = 0; i < vi->num_queue_pairs; i++)
> +			napi_disable(&vi->rq[i]->napi);
> +
>  
>  	remove_vq_common(vi);
>  
> @@ -1252,19 +1504,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  static int virtnet_restore(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> -	int err;
> +	int err, i;
>  
> -	err = init_vqs(vi);
> +	err = virtnet_setup_vqs(vi);
>  	if (err)
>  		return err;
>  
>  	if (netif_running(vi->dev))
> -		virtnet_napi_enable(vi);
> +		for (i = 0; i < vi->num_queue_pairs; i++)
> +			virtnet_napi_enable(vi->rq[i]);
>  
>  	netif_device_attach(vi->dev);
>  
> -	if (!try_fill_recv(vi, GFP_KERNEL))
> -		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
>  
>  	mutex_lock(&vi->config_lock);
>  	vi->config_enable = true;
> @@ -1287,7 +1542,7 @@ static unsigned int features[] = {
>  	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
>  	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
> -	VIRTIO_NET_F_GUEST_ANNOUNCE,
> +	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
>  };
>  
>  static struct virtio_driver virtio_net_driver = {
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 1bc7e30..60f09ff 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -61,6 +61,8 @@ struct virtio_net_config {
>  	__u8 mac[6];
>  	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
>  	__u16 status;
> +	/* Total number of RX/TX queues */
> +	__u16 num_queues;
>  } __attribute__((packed));
>  
>  /* This is the first element of the scatter-gather list.  If you don't
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-20 13:40   ` Michael S. Tsirkin
@ 2012-07-21 12:02     ` Sasha Levin
  2012-07-23  5:54       ` Jason Wang
  2012-07-29  9:44       ` Michael S. Tsirkin
  2012-07-23  5:48     ` Jason Wang
  1 sibling, 2 replies; 46+ messages in thread
From: Sasha Levin @ 2012-07-21 12:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem,
	akong, kvm, sri

On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>> -	err = init_vqs(vi);
>> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>> > +		vi->has_cvq = true;
>> > +
> How about we disable multiqueue if there's no cvq?
> Will make logic a bit simpler, won't it?

multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
  2012-07-20 12:33   ` Michael S. Tsirkin
@ 2012-07-23  5:32     ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-23  5:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/20/2012 08:33 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 05, 2012 at 06:29:54PM +0800, Jason Wang wrote:
>> This patch let the virtio_net driver can negotiate the number of queues it
>> wishes to use through control virtqueue and export an ethtool interface to let
>> use tweak it.
>>
>> As current multiqueue virtio-net implementation has optimizations on per-cpu
>> virtuqueues, so only two modes were support:
>>
>> - single queue pair mode
>> - multiple queue paris mode, the number of queues matches the number of vcpus
>>
>> The single queue mode were used by default currently due to regression of
>> multiqueue mode in some test (especially in stream test).
>>
>> Since virtio core does not support paritially deleting virtqueues, so during
>> mode switching the whole virtqueue were deleted and the driver would re-create
>> the virtqueues it would used.
>>
>> btw. The queue number negotiating were defered to .ndo_open(), this is because
>> only after feature negotitaion could we send the command to control virtqueue
>> (as it may also use event index).
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/net/virtio_net.c   |  171 ++++++++++++++++++++++++++++++++++---------
>>   include/linux/virtio_net.h |    7 ++
>>   2 files changed, 142 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 7410187..3339eeb 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -88,6 +88,7 @@ struct receive_queue {
>>
>>   struct virtnet_info {
>>   	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
>> +	u16 total_queue_pairs;
>>
>>   	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>>   	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>> @@ -137,6 +138,8 @@ struct padded_vnet_hdr {
>>   	char padding[6];
>>   };
>>
>> +static const struct ethtool_ops virtnet_ethtool_ops;
>> +
>>   static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>>   {
>>   	int ret = virtqueue_get_queue_index(vq);
>> @@ -802,22 +805,6 @@ static void virtnet_netpoll(struct net_device *dev)
>>   }
>>   #endif
>>
>> -static int virtnet_open(struct net_device *dev)
>> -{
>> -	struct virtnet_info *vi = netdev_priv(dev);
>> -	int i;
>> -
>> -	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> -		/* Make sure we have some buffers: if oom use wq. */
>> -		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> -			queue_delayed_work(system_nrt_wq,
>> -					&vi->rq[i]->refill, 0);
>> -		virtnet_napi_enable(vi->rq[i]);
>> -	}
>> -
>> -	return 0;
>> -}
>> -
>>   /*
>>    * Send command via the control virtqueue and check status.  Commands
>>    * supported by the hypervisor, as indicated by feature bits, should
>> @@ -873,6 +860,43 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>>   	rtnl_unlock();
>>   }
>>
>> +static int virtnet_set_queues(struct virtnet_info *vi)
>> +{
>> +	struct scatterlist sg;
>> +	struct net_device *dev = vi->dev;
>> +	sg_init_one(&sg,&vi->num_queue_pairs, sizeof(vi->num_queue_pairs));
>> +
>> +	if (!vi->has_cvq)
>> +		return -EINVAL;
>> +
>> +	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MULTIQUEUE,
>> +				  VIRTIO_NET_CTRL_MULTIQUEUE_QNUM,&sg, 1, 0)){
>> +		dev_warn(&dev->dev, "Fail to set the number of queue pairs to"
>> +			 " %d\n", vi->num_queue_pairs);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int virtnet_open(struct net_device *dev)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		/* Make sure we have some buffers: if oom use wq. */
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>> +		virtnet_napi_enable(vi->rq[i]);
>> +	}
>> +
>> +	virtnet_set_queues(vi);
>> +
>> +	return 0;
>> +}
>> +
>>   static int virtnet_close(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> @@ -1013,12 +1037,6 @@ static void virtnet_get_drvinfo(struct net_device *dev,
>>
>>   }
>>
>> -static const struct ethtool_ops virtnet_ethtool_ops = {
>> -	.get_drvinfo = virtnet_get_drvinfo,
>> -	.get_link = ethtool_op_get_link,
>> -	.get_ringparam = virtnet_get_ringparam,
>> -};
>> -
>>   #define MIN_MTU 68
>>   #define MAX_MTU 65535
>>
>> @@ -1235,7 +1253,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>>
>>   err:
>>   	if (ret&&  names)
>> -		for (i = 0; i<  vi->num_queue_pairs * 2; i++)
>> +		for (i = 0; i<  total_vqs * 2; i++)
>>   			kfree(names[i]);
>>
>>   	kfree(names);
>> @@ -1373,7 +1391,6 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	mutex_init(&vi->config_lock);
>>   	vi->config_enable = true;
>>   	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
>> -	vi->num_queue_pairs = num_queue_pairs;
>>
>>   	/* If we can receive ANY GSO packets, we must allocate large ones. */
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>   		vi->has_cvq = true;
>>
>> +	/* Use single tx/rx queue pair as default */
>> +	vi->num_queue_pairs = 1;
>> +	vi->total_queue_pairs = num_queue_pairs;
>> +
>>   	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>>   	err = virtnet_setup_vqs(vi);
>>   	if (err)
>> @@ -1396,6 +1417,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>>   		dev->features |= NETIF_F_HW_VLAN_FILTER;
>>
>> +	netif_set_real_num_tx_queues(dev, 1);
>> +	netif_set_real_num_rx_queues(dev, 1);
>> +
>>   	err = register_netdev(dev);
>>   	if (err) {
>>   		pr_debug("virtio_net: registering device failed\n");
>> @@ -1403,7 +1427,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	}
>>
>>   	/* Last of all, set up some receive buffers. */
>> -	for (i = 0; i<  num_queue_pairs; i++) {
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>>   		try_fill_recv(vi->rq[i], GFP_KERNEL);
>>
>>   		/* If we didn't even get one input buffer, we're useless. */
>> @@ -1474,10 +1498,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>>   	free_netdev(vi->dev);
>>   }
>>
>> -#ifdef CONFIG_PM
>> -static int virtnet_freeze(struct virtio_device *vdev)
>> +static void virtnet_stop(struct virtnet_info *vi)
>>   {
>> -	struct virtnet_info *vi = vdev->priv;
>>   	int i;
>>
>>   	/* Prevent config work handler from accessing the device */
>> @@ -1493,17 +1515,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   		for (i = 0; i<  vi->num_queue_pairs; i++)
>>   			napi_disable(&vi->rq[i]->napi);
>>
>> -
>> -	remove_vq_common(vi);
>> -
>> -	flush_work(&vi->config_work);
>> -
>> -	return 0;
>>   }
>>
>> -static int virtnet_restore(struct virtio_device *vdev)
>> +static int virtnet_start(struct virtnet_info *vi)
>>   {
>> -	struct virtnet_info *vi = vdev->priv;
>>   	int err, i;
>>
>>   	err = virtnet_setup_vqs(vi);
>> @@ -1527,6 +1542,29 @@ static int virtnet_restore(struct virtio_device *vdev)
>>
>>   	return 0;
>>   }
>> +
>> +#ifdef CONFIG_PM
>> +static int virtnet_freeze(struct virtio_device *vdev)
>> +{
>> +	struct virtnet_info *vi = vdev->priv;
>> +
>> +	virtnet_stop(vi);
>> +
>> +	remove_vq_common(vi);
>> +
>> +	flush_work(&vi->config_work);
>> +
>> +	return 0;
>> +}
>> +
>> +static int virtnet_restore(struct virtio_device *vdev)
>> +{
>> +	struct virtnet_info *vi = vdev->priv;
>> +
>> +	virtnet_start(vi);
>> +
>> +	return 0;
>> +}
>>   #endif
>>
>>   static struct virtio_device_id id_table[] = {
>> @@ -1560,6 +1598,67 @@ static struct virtio_driver virtio_net_driver = {
>>   #endif
>>   };
>>
>> +static int virtnet_set_channels(struct net_device *dev,
>> +				struct ethtool_channels *channels)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +	u16 queues = channels->rx_count;
>> +	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
>> +
>> +	if (channels->rx_count != channels->tx_count)
>> +		return -EINVAL;
>> +	/* Only two modes were support currently */
> s/were/are/ ?

Ok.
>
>> +	if (queues != vi->total_queue_pairs&&  queues != 1)
>> +		return -EINVAL;
> So userspace has to get queue number right. How does it know
> what the valid value is?

Usespace could query the number through ethtool -l (virtnet_get_channels()).
>
>> +	if (!vi->has_cvq)
>> +		return -EINVAL;
>> +
>> +	virtnet_stop(vi);
>> +
>> +	netif_set_real_num_tx_queues(dev, queues);
>> +	netif_set_real_num_rx_queues(dev, queues);
>> +
>> +	remove_vq_common(vi);
>> +	flush_work(&vi->config_work);
>> +
>> +	vi->num_queue_pairs = queues;
>> +	virtnet_start(vi);
>> +
>> +	vi->vdev->config->finalize_features(vi->vdev);
>> +
>> +	if (virtnet_set_queues(vi))
>> +		status |= VIRTIO_CONFIG_S_FAILED;
>> +	else
>> +		status |= VIRTIO_CONFIG_S_DRIVER_OK;
>> +
>> +	vi->vdev->config->set_status(vi->vdev, status);
>> +
> Why do we need to tweak status like that?

Because remove_vq_common() reset the device. Since virtio core api does 
not support remove a specified number of virtqueues, it's the only 
method when we change the number of queues.
> Can we maybe just roll changes back on error?

Not easy, we reset and detroy previous virtqueues and create new ones.
>> +	return 0;
>> +}
>> +
>> +static void virtnet_get_channels(struct net_device *dev,
>> +				 struct ethtool_channels *channels)
>> +{
>> +	struct virtnet_info *vi = netdev_priv(dev);
>> +
>> +	channels->max_rx = vi->total_queue_pairs;
>> +	channels->max_tx = vi->total_queue_pairs;
>> +	channels->max_other = 0;
>> +	channels->max_combined = 0;
>> +	channels->rx_count = vi->num_queue_pairs;
>> +	channels->tx_count = vi->num_queue_pairs;
>> +	channels->other_count = 0;
>> +	channels->combined_count = 0;
>> +}
>> +
>> +static const struct ethtool_ops virtnet_ethtool_ops = {
>> +	.get_drvinfo = virtnet_get_drvinfo,
>> +	.get_link = ethtool_op_get_link,
>> +	.get_ringparam = virtnet_get_ringparam,
>> +	.set_channels = virtnet_set_channels,
>> +	.get_channels = virtnet_get_channels,
>> +};
>> +
>>   static int __init init(void)
>>   {
>>   	return register_virtio_driver(&virtio_net_driver);
>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>> index 60f09ff..0d21e08 100644
>> --- a/include/linux/virtio_net.h
>> +++ b/include/linux/virtio_net.h
>> @@ -169,4 +169,11 @@ struct virtio_net_ctrl_mac {
>>   #define VIRTIO_NET_CTRL_ANNOUNCE       3
>>    #define VIRTIO_NET_CTRL_ANNOUNCE_ACK         0
>>
>> +/*
>> + * Control multiqueue
>> + *
>> + */
>> +#define VIRTIO_NET_CTRL_MULTIQUEUE       4
>> + #define VIRTIO_NET_CTRL_MULTIQUEUE_QNUM         0
>> +
>>   #endif /* _LINUX_VIRTIO_NET_H */
>> -- 
>> 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-20 13:40   ` Michael S. Tsirkin
  2012-07-21 12:02     ` Sasha Levin
@ 2012-07-23  5:48     ` Jason Wang
  2012-07-29  9:50       ` Michael S. Tsirkin
  1 sibling, 1 reply; 46+ messages in thread
From: Jason Wang @ 2012-07-23  5:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/20/2012 09:40 PM, Michael S. Tsirkin wrote:
> On Thu, Jul 05, 2012 at 06:29:53PM +0800, Jason Wang wrote:
>> This patch converts virtio_net to a multi queue device. After negotiated
>> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
>> and driver could read the number from config space.
>>
>> The driver expects the number of rx/tx queue paris is equal to the number of
>> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
>> optimization were introduced:
>>
>> - Txq selection is based on the processor id in order to avoid contending a lock
>>    whose owner may exits to host.
>> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>>    the queue pairs.
>>
>> Signed-off-by: Krishna Kumar<krkumar2@in.ibm.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Overall fine. I think it is best to smash the following patch
> into this one, so that default behavior does not
> jump to mq then back. some comments below: mostly nits, and a minor bug.

Sure, thanks for the reviewing.
>
> If you are worried the patch is too big, it can be split
> differently
> 	- rework to use send_queue/receive_queue structures, no
> 	  functional changes.
> 	- add multiqueue
>
> but this is not a must.
>
>> ---
>>   drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
>>   include/linux/virtio_net.h |    2 +
>>   2 files changed, 452 insertions(+), 195 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 1db445b..7410187 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -26,6 +26,7 @@
>>   #include<linux/scatterlist.h>
>>   #include<linux/if_vlan.h>
>>   #include<linux/slab.h>
>> +#include<linux/interrupt.h>
>>
>>   static int napi_weight = 128;
>>   module_param(napi_weight, int, 0444);
>> @@ -41,6 +42,8 @@ module_param(gso, bool, 0444);
>>   #define VIRTNET_SEND_COMMAND_SG_MAX    2
>>   #define VIRTNET_DRIVER_VERSION "1.0.0"
>>
>> +#define MAX_QUEUES 256
>> +
>>   struct virtnet_stats {
>>   	struct u64_stats_sync tx_syncp;
>>   	struct u64_stats_sync rx_syncp;
> Would be a bit better not to have artificial limits like that.
> Maybe allocate arrays at probe time, then we can
> take whatever the device gives us?

Sure.
>> @@ -51,43 +54,69 @@ struct virtnet_stats {
>>   	u64 rx_packets;
>>   };
>>
>> -struct virtnet_info {
>> -	struct virtio_device *vdev;
>> -	struct virtqueue *rvq, *svq, *cvq;
>> -	struct net_device *dev;
>> +/* Internal representation of a send virtqueue */
>> +struct send_queue {
>> +	/* Virtqueue associated with this send _queue */
>> +	struct virtqueue *vq;
>> +
>> +	/* TX: fragments + linear part + virtio header */
>> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>> +};
>> +
>> +/* Internal representation of a receive virtqueue */
>> +struct receive_queue {
>> +	/* Virtqueue associated with this receive_queue */
>> +	struct virtqueue *vq;
>> +
>> +	/* Back pointer to the virtnet_info */
>> +	struct virtnet_info *vi;
>> +
>>   	struct napi_struct napi;
>> -	unsigned int status;
>>
>>   	/* Number of input buffers, and max we've ever had. */
>>   	unsigned int num, max;
>>
>> +	/* Work struct for refilling if we run low on memory. */
>> +	struct delayed_work refill;
>> +
>> +	/* Chain pages by the private ptr. */
>> +	struct page *pages;
>> +
>> +	/* RX: fragments + linear part + virtio header */
>> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>> +};
>> +
>> +struct virtnet_info {
>> +	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
>> +
>> +	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>> +	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> The assumption is a tx/rx pair is handled on the same cpu, yes?
> If yes maybe make it a single array to improve cache locality
> a bit?
> 	struct queue_pair {
> 		struct send_queue sq;
> 		struct receive_queue rq;
> 	};

Ok.
>> +	struct virtqueue *cvq;
>> +
>> +	struct virtio_device *vdev;
>> +	struct net_device *dev;
>> +	unsigned int status;
>> +
>>   	/* I like... big packets and I cannot lie! */
>>   	bool big_packets;
>>
>>   	/* Host will merge rx buffers for big packets (shake it! shake it!) */
>>   	bool mergeable_rx_bufs;
>>
>> +	/* Has control virtqueue */
>> +	bool has_cvq;
>> +
> won't checking (cvq != NULL) be enough?

Enough, so has_cvq is dupliated with vi->cvq. I will remove it in next 
version.
>
>>   	/* enable config space updates */
>>   	bool config_enable;
>>
>>   	/* Active statistics */
>>   	struct virtnet_stats __percpu *stats;
>>
>> -	/* Work struct for refilling if we run low on memory. */
>> -	struct delayed_work refill;
>> -
>>   	/* Work struct for config space updates */
>>   	struct work_struct config_work;
>>
>>   	/* Lock for config space updates */
>>   	struct mutex config_lock;
>> -
>> -	/* Chain pages by the private ptr. */
>> -	struct page *pages;
>> -
>> -	/* fragments + linear part + virtio header */
>> -	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
>> -	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>>   };
>>
>>   struct skb_vnet_hdr {
>> @@ -108,6 +137,22 @@ struct padded_vnet_hdr {
>>   	char padding[6];
>>   };
>>
>> +static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>> +{
>> +	int ret = virtqueue_get_queue_index(vq);
>> +
>> +	/* skip ctrl vq */
>> +	if (vi->has_cvq)
>> +		return (ret - 1) / 2;
>> +	else
>> +		return ret / 2;
>> +}
>> +
>> +static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>> +{
>> +	return virtqueue_get_queue_index(vq) / 2;
>> +}
>> +
>>   static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>>   {
>>   	return (struct skb_vnet_hdr *)skb->cb;
>> @@ -117,22 +162,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>>    * private is used to chain pages for big packets, put the whole
>>    * most recent used list in the beginning for reuse
>>    */
>> -static void give_pages(struct virtnet_info *vi, struct page *page)
>> +static void give_pages(struct receive_queue *rq, struct page *page)
>>   {
>>   	struct page *end;
>>
>>   	/* Find end of list, sew whole thing into vi->pages. */
>>   	for (end = page; end->private; end = (struct page *)end->private);
>> -	end->private = (unsigned long)vi->pages;
>> -	vi->pages = page;
>> +	end->private = (unsigned long)rq->pages;
>> +	rq->pages = page;
>>   }
>>
>> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>>   {
>> -	struct page *p = vi->pages;
>> +	struct page *p = rq->pages;
>>
>>   	if (p) {
>> -		vi->pages = (struct page *)p->private;
>> +		rq->pages = (struct page *)p->private;
>>   		/* clear private here, it is used to chain pages */
>>   		p->private = 0;
>>   	} else
>> @@ -140,15 +185,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>>   	return p;
>>   }
>>
>> -static void skb_xmit_done(struct virtqueue *svq)
>> +static void skb_xmit_done(struct virtqueue *vq)
>>   {
>> -	struct virtnet_info *vi = svq->vdev->priv;
>> +	struct virtnet_info *vi = vq->vdev->priv;
>>
>>   	/* Suppress further interrupts. */
>> -	virtqueue_disable_cb(svq);
>> +	virtqueue_disable_cb(vq);
>>
>>   	/* We were probably waiting for more output buffers. */
>> -	netif_wake_queue(vi->dev);
>> +	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
>>   }
>>
>>   static void set_skb_frag(struct sk_buff *skb, struct page *page,
>> @@ -167,9 +212,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
>>   }
>>
>>   /* Called from bottom half context */
>> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
>>   				   struct page *page, unsigned int len)
>>   {
>> +	struct virtnet_info *vi = rq->vi;
>>   	struct sk_buff *skb;
>>   	struct skb_vnet_hdr *hdr;
>>   	unsigned int copy, hdr_len, offset;
>> @@ -225,12 +271,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>   	}
>>
>>   	if (page)
>> -		give_pages(vi, page);
>> +		give_pages(rq, page);
>>
>>   	return skb;
>>   }
>>
>> -static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>> +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
>>   {
>>   	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>>   	struct page *page;
>> @@ -244,7 +290,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>>   			skb->dev->stats.rx_length_errors++;
>>   			return -EINVAL;
>>   		}
>> -		page = virtqueue_get_buf(vi->rvq,&len);
>> +		page = virtqueue_get_buf(rq->vq,&len);
>>   		if (!page) {
>>   			pr_debug("%s: rx error: %d buffers missing\n",
>>   				 skb->dev->name, hdr->mhdr.num_buffers);
>> @@ -257,13 +303,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>>
>>   		set_skb_frag(skb, page, 0,&len);
>>
>> -		--vi->num;
>> +		--rq->num;
>>   	}
>>   	return 0;
>>   }
>>
>> -static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>> +static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>>   {
>> +	struct net_device *dev = rq->vi->dev;
>>   	struct virtnet_info *vi = netdev_priv(dev);
>>   	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>   	struct sk_buff *skb;
>> @@ -274,7 +321,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>>   		pr_debug("%s: short packet %i\n", dev->name, len);
>>   		dev->stats.rx_length_errors++;
>>   		if (vi->mergeable_rx_bufs || vi->big_packets)
>> -			give_pages(vi, buf);
>> +			give_pages(rq, buf);
>>   		else
>>   			dev_kfree_skb(buf);
>>   		return;
>> @@ -286,14 +333,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>>   		skb_trim(skb, len);
>>   	} else {
>>   		page = buf;
>> -		skb = page_to_skb(vi, page, len);
>> +		skb = page_to_skb(rq, page, len);
>>   		if (unlikely(!skb)) {
>>   			dev->stats.rx_dropped++;
>> -			give_pages(vi, page);
>> +			give_pages(rq, page);
>>   			return;
>>   		}
>>   		if (vi->mergeable_rx_bufs)
>> -			if (receive_mergeable(vi, skb)) {
>> +			if (receive_mergeable(rq, skb)) {
>>   				dev_kfree_skb(skb);
>>   				return;
>>   			}
>> @@ -363,90 +410,91 @@ frame_err:
>>   	dev_kfree_skb(skb);
>>   }
>>
>> -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct sk_buff *skb;
>>   	struct skb_vnet_hdr *hdr;
>>   	int err;
>>
>> -	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
>> +	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
>>   	if (unlikely(!skb))
>>   		return -ENOMEM;
>>
>>   	skb_put(skb, MAX_PACKET_LEN);
>>
>>   	hdr = skb_vnet_hdr(skb);
>> -	sg_set_buf(vi->rx_sg,&hdr->hdr, sizeof hdr->hdr);
>> +	sg_set_buf(rq->sg,&hdr->hdr, sizeof hdr->hdr);
>> +
>> +	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
>>
>> -	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
>>
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
>>   	if (err<  0)
>>   		dev_kfree_skb(skb);
>>
>>   	return err;
>>   }
>>
>> -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct page *first, *list = NULL;
>>   	char *p;
>>   	int i, err, offset;
>>
>> -	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
>> +	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
>>   	for (i = MAX_SKB_FRAGS + 1; i>  1; --i) {
>> -		first = get_a_page(vi, gfp);
>> +		first = get_a_page(rq, gfp);
>>   		if (!first) {
>>   			if (list)
>> -				give_pages(vi, list);
>> +				give_pages(rq, list);
>>   			return -ENOMEM;
>>   		}
>> -		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
>> +		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
>>
>>   		/* chain new page in list head to match sg */
>>   		first->private = (unsigned long)list;
>>   		list = first;
>>   	}
>>
>> -	first = get_a_page(vi, gfp);
>> +	first = get_a_page(rq, gfp);
>>   	if (!first) {
>> -		give_pages(vi, list);
>> +		give_pages(rq, list);
>>   		return -ENOMEM;
>>   	}
>>   	p = page_address(first);
>>
>> -	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
>> -	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
>> -	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
>> +	/* rq->sg[0], rq->sg[1] share the same page */
>> +	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
>> +	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
>>
>> -	/* vi->rx_sg[1] for data packet, from offset */
>> +	/* rq->sg[1] for data packet, from offset */
>>   	offset = sizeof(struct padded_vnet_hdr);
>> -	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
>> +	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
>>
>>   	/* chain first in list head */
>>   	first->private = (unsigned long)list;
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
>>   				first, gfp);
>>   	if (err<  0)
>> -		give_pages(vi, first);
>> +		give_pages(rq, first);
>>
>>   	return err;
>>   }
>>
>> -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>>   {
>>   	struct page *page;
>>   	int err;
>>
>> -	page = get_a_page(vi, gfp);
>> +	page = get_a_page(rq, gfp);
>>   	if (!page)
>>   		return -ENOMEM;
>>
>> -	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
>> +	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
>>
>> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
>> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
>>   	if (err<  0)
>> -		give_pages(vi, page);
>> +		give_pages(rq, page);
>>
>>   	return err;
>>   }
>> @@ -458,97 +506,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>>    * before we're receiving packets, or from refill_work which is
>>    * careful to disable receiving (using napi_disable).
>>    */
>> -static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
>> +static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>>   {
>> +	struct virtnet_info *vi = rq->vi;
>>   	int err;
>>   	bool oom;
>>
>>   	do {
>>   		if (vi->mergeable_rx_bufs)
>> -			err = add_recvbuf_mergeable(vi, gfp);
>> +			err = add_recvbuf_mergeable(rq, gfp);
>>   		else if (vi->big_packets)
>> -			err = add_recvbuf_big(vi, gfp);
>> +			err = add_recvbuf_big(rq, gfp);
>>   		else
>> -			err = add_recvbuf_small(vi, gfp);
>> +			err = add_recvbuf_small(rq, gfp);
>>
>>   		oom = err == -ENOMEM;
>>   		if (err<  0)
>>   			break;
>> -		++vi->num;
>> +		++rq->num;
>>   	} while (err>  0);
>> -	if (unlikely(vi->num>  vi->max))
>> -		vi->max = vi->num;
>> -	virtqueue_kick(vi->rvq);
>> +	if (unlikely(rq->num>  rq->max))
>> +		rq->max = rq->num;
>> +	virtqueue_kick(rq->vq);
>>   	return !oom;
>>   }
>>
>> -static void skb_recv_done(struct virtqueue *rvq)
>> +static void skb_recv_done(struct virtqueue *vq)
>>   {
>> -	struct virtnet_info *vi = rvq->vdev->priv;
>> +	struct virtnet_info *vi = vq->vdev->priv;
>> +	struct napi_struct *napi =&vi->rq[rxq_get_qnum(vi, vq)]->napi;
>> +
>>   	/* Schedule NAPI, Suppress further interrupts if successful. */
>> -	if (napi_schedule_prep(&vi->napi)) {
>> -		virtqueue_disable_cb(rvq);
>> -		__napi_schedule(&vi->napi);
>> +	if (napi_schedule_prep(napi)) {
>> +		virtqueue_disable_cb(vq);
>> +		__napi_schedule(napi);
>>   	}
>>   }
>>
>> -static void virtnet_napi_enable(struct virtnet_info *vi)
>> +static void virtnet_napi_enable(struct receive_queue *rq)
>>   {
>> -	napi_enable(&vi->napi);
>> +	napi_enable(&rq->napi);
>>
>>   	/* If all buffers were filled by other side before we napi_enabled, we
>>   	 * won't get another interrupt, so process any outstanding packets
>>   	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
>>   	 * We synchronize against interrupts via NAPI_STATE_SCHED */
>> -	if (napi_schedule_prep(&vi->napi)) {
>> -		virtqueue_disable_cb(vi->rvq);
>> +	if (napi_schedule_prep(&rq->napi)) {
>> +		virtqueue_disable_cb(rq->vq);
>>   		local_bh_disable();
>> -		__napi_schedule(&vi->napi);
>> +		__napi_schedule(&rq->napi);
>>   		local_bh_enable();
>>   	}
>>   }
>>
>>   static void refill_work(struct work_struct *work)
>>   {
>> -	struct virtnet_info *vi;
>> +	struct napi_struct *napi;
>> +	struct receive_queue *rq;
>>   	bool still_empty;
>>
>> -	vi = container_of(work, struct virtnet_info, refill.work);
>> -	napi_disable(&vi->napi);
>> -	still_empty = !try_fill_recv(vi, GFP_KERNEL);
>> -	virtnet_napi_enable(vi);
>> +	rq = container_of(work, struct receive_queue, refill.work);
>> +	napi =&rq->napi;
>> +
>> +	napi_disable(napi);
>> +	still_empty = !try_fill_recv(rq, GFP_KERNEL);
>> +	virtnet_napi_enable(rq);
>>
>>   	/* In theory, this can happen: if we don't get any buffers in
>>   	 * we will *never* try to fill again. */
>>   	if (still_empty)
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, HZ/2);
>> +		queue_delayed_work(system_nrt_wq,&rq->refill, HZ/2);
>>   }
>>
>>   static int virtnet_poll(struct napi_struct *napi, int budget)
>>   {
>> -	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
>> +	struct receive_queue *rq = container_of(napi, struct receive_queue,
>> +						napi);
>>   	void *buf;
>>   	unsigned int len, received = 0;
>>
>>   again:
>>   	while (received<  budget&&
>> -	       (buf = virtqueue_get_buf(vi->rvq,&len)) != NULL) {
>> -		receive_buf(vi->dev, buf, len);
>> -		--vi->num;
>> +	       (buf = virtqueue_get_buf(rq->vq,&len)) != NULL) {
>> +		receive_buf(rq, buf, len);
>> +		--rq->num;
>>   		received++;
>>   	}
>>
>> -	if (vi->num<  vi->max / 2) {
>> -		if (!try_fill_recv(vi, GFP_ATOMIC))
>> -			queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	if (rq->num<  rq->max / 2) {
>> +		if (!try_fill_recv(rq, GFP_ATOMIC))
>> +			queue_delayed_work(system_nrt_wq,&rq->refill, 0);
>>   	}
>>
>>   	/* Out of packets? */
>>   	if (received<  budget) {
>>   		napi_complete(napi);
>> -		if (unlikely(!virtqueue_enable_cb(vi->rvq))&&
>> +		if (unlikely(!virtqueue_enable_cb(rq->vq))&&
>>   		napi_schedule_prep(napi)) {
>> -			virtqueue_disable_cb(vi->rvq);
>> +			virtqueue_disable_cb(rq->vq);
>>   			__napi_schedule(napi);
>>   			goto again;
>>   		}
>> @@ -557,13 +612,14 @@ again:
>>   	return received;
>>   }
>>
>> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
>> +				       struct virtqueue *vq)
>>   {
>>   	struct sk_buff *skb;
>>   	unsigned int len, tot_sgs = 0;
>>   	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>
>> -	while ((skb = virtqueue_get_buf(vi->svq,&len)) != NULL) {
>> +	while ((skb = virtqueue_get_buf(vq,&len)) != NULL) {
>>   		pr_debug("Sent skb %p\n", skb);
>>
>>   		u64_stats_update_begin(&stats->tx_syncp);
>> @@ -577,7 +633,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>>   	return tot_sgs;
>>   }
>>
>> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
>> +		    struct virtqueue *vq, struct scatterlist *sg)
>>   {
>>   	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>>   	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
>> @@ -615,44 +672,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>>
>>   	/* Encode metadata header at front. */
>>   	if (vi->mergeable_rx_bufs)
>> -		sg_set_buf(vi->tx_sg,&hdr->mhdr, sizeof hdr->mhdr);
>> +		sg_set_buf(sg,&hdr->mhdr, sizeof hdr->mhdr);
>>   	else
>> -		sg_set_buf(vi->tx_sg,&hdr->hdr, sizeof hdr->hdr);
>> +		sg_set_buf(sg,&hdr->hdr, sizeof hdr->hdr);
>>
>> -	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
>> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
>> +	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
>> +	return virtqueue_add_buf(vq, sg, hdr->num_sg,
>>   				 0, skb, GFP_ATOMIC);
>>   }
>>
>>   static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int qnum = skb_get_queue_mapping(skb);
>> +	struct virtqueue *vq = vi->sq[qnum]->vq;
>>   	int capacity;
>>
>>   	/* Free up any pending old buffers before queueing new ones. */
>> -	free_old_xmit_skbs(vi);
>> +	free_old_xmit_skbs(vi, vq);
>>
>>   	/* Try to transmit */
>> -	capacity = xmit_skb(vi, skb);
>> +	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
>>
>>   	/* This can happen with OOM and indirect buffers. */
>>   	if (unlikely(capacity<  0)) {
>>   		if (likely(capacity == -ENOMEM)) {
>>   			if (net_ratelimit())
>>   				dev_warn(&dev->dev,
>> -					 "TX queue failure: out of memory\n");
>> +					"TXQ (%d) failure: out of memory\n",
>> +					qnum);
>>   		} else {
>>   			dev->stats.tx_fifo_errors++;
>>   			if (net_ratelimit())
>>   				dev_warn(&dev->dev,
>> -					 "Unexpected TX queue failure: %d\n",
>> -					 capacity);
>> +					"Unexpected TXQ (%d) failure: %d\n",
>> +					qnum, capacity);
>>   		}
>>   		dev->stats.tx_dropped++;
>>   		kfree_skb(skb);
>>   		return NETDEV_TX_OK;
>>   	}
>> -	virtqueue_kick(vi->svq);
>> +	virtqueue_kick(vq);
>>
>>   	/* Don't wait up for transmitted skbs to be freed. */
>>   	skb_orphan(skb);
>> @@ -661,13 +721,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>   	/* Apparently nice girls don't return TX_BUSY; stop the queue
>>   	 * before it gets out of hand.  Naturally, this wastes entries. */
>>   	if (capacity<  2+MAX_SKB_FRAGS) {
>> -		netif_stop_queue(dev);
>> -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
>> +		netif_stop_subqueue(dev, qnum);
>> +		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
>>   			/* More just got used, free them then recheck. */
>> -			capacity += free_old_xmit_skbs(vi);
>> +			capacity += free_old_xmit_skbs(vi, vq);
>>   			if (capacity>= 2+MAX_SKB_FRAGS) {
>> -				netif_start_queue(dev);
>> -				virtqueue_disable_cb(vi->svq);
>> +				netif_start_subqueue(dev, qnum);
>> +				virtqueue_disable_cb(vq);
>>   			}
>>   		}
>>   	}
>> @@ -700,7 +760,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>>   	unsigned int start;
>>
>>   	for_each_possible_cpu(cpu) {
>> -		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
>> +		struct virtnet_stats __percpu *stats
>> +			= per_cpu_ptr(vi->stats, cpu);
>>   		u64 tpackets, tbytes, rpackets, rbytes;
>>
>>   		do {
>> @@ -734,20 +795,26 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>>   static void virtnet_netpoll(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>> -	napi_schedule(&vi->napi);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		napi_schedule(&vi->rq[i]->napi);
>>   }
>>   #endif
>>
>>   static int virtnet_open(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>> -	/* Make sure we have some buffers: if oom use wq. */
>> -	if (!try_fill_recv(vi, GFP_KERNEL))
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		/* Make sure we have some buffers: if oom use wq. */
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>> +		virtnet_napi_enable(vi->rq[i]);
>> +	}
>>
>> -	virtnet_napi_enable(vi);
>>   	return 0;
>>   }
>>
>> @@ -809,10 +876,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>>   static int virtnet_close(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> +	int i;
>>
>>   	/* Make sure refill_work doesn't re-enable napi! */
>> -	cancel_delayed_work_sync(&vi->refill);
>> -	napi_disable(&vi->napi);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>> +		napi_disable(&vi->rq[i]->napi);
>> +	}
>>
>>   	return 0;
>>   }
>> @@ -924,11 +994,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>>
>> -	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
>> -	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
>> +	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
>> +	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
>>   	ring->rx_pending = ring->rx_max_pending;
>>   	ring->tx_pending = ring->tx_max_pending;
>> -
>>   }
>>
>>
>> @@ -961,6 +1030,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>>   	return 0;
>>   }
>>
>> +/* To avoid contending a lock hold by a vcpu who would exit to host, select the
>> + * txq based on the processor id.
>> + */
>> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
>> +{
>> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
>> +		  smp_processor_id();
>> +
>> +	while (unlikely(txq>= dev->real_num_tx_queues))
>> +		txq -= dev->real_num_tx_queues;
>> +	return txq;
>> +}
>> +
>>   static const struct net_device_ops virtnet_netdev = {
>>   	.ndo_open            = virtnet_open,
>>   	.ndo_stop   	     = virtnet_close,
>> @@ -972,6 +1054,7 @@ static const struct net_device_ops virtnet_netdev = {
>>   	.ndo_get_stats64     = virtnet_stats,
>>   	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
>>   	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
>> +	.ndo_select_queue     = virtnet_select_queue,
>>   #ifdef CONFIG_NET_POLL_CONTROLLER
>>   	.ndo_poll_controller = virtnet_netpoll,
>>   #endif
>> @@ -1007,10 +1090,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
>>
>>   	if (vi->status&  VIRTIO_NET_S_LINK_UP) {
>>   		netif_carrier_on(vi->dev);
>> -		netif_wake_queue(vi->dev);
>> +		netif_tx_wake_all_queues(vi->dev);
>>   	} else {
>>   		netif_carrier_off(vi->dev);
>> -		netif_stop_queue(vi->dev);
>> +		netif_tx_stop_all_queues(vi->dev);
>>   	}
>>   done:
>>   	mutex_unlock(&vi->config_lock);
>> @@ -1023,41 +1106,217 @@ static void virtnet_config_changed(struct virtio_device *vdev)
>>   	queue_work(system_nrt_wq,&vi->config_work);
>>   }
>>
>> -static int init_vqs(struct virtnet_info *vi)
>> +static void free_receive_bufs(struct virtnet_info *vi)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		while (vi->rq[i]->pages)
>> +			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
>> +	}
>> +}
>> +
>> +/* Free memory allocated for send and receive queues */
>> +static void virtnet_free_queues(struct virtnet_info *vi)
>>   {
>> -	struct virtqueue *vqs[3];
>> -	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
>> -	const char *names[] = { "input", "output", "control" };
>> -	int nvqs, err;
>> +	int i;
>>
>> -	/* We expect two virtqueues, receive then send,
>> -	 * and optionally control. */
>> -	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		kfree(vi->rq[i]);
>> +		vi->rq[i] = NULL;
>> +		kfree(vi->sq[i]);
>> +		vi->sq[i] = NULL;
>> +	}
>> +}
>>
>> -	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
>> -	if (err)
>> -		return err;
>> +static void free_unused_bufs(struct virtnet_info *vi)
>> +{
>> +	void *buf;
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		struct virtqueue *vq = vi->sq[i]->vq;
>> +
>> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
>> +			dev_kfree_skb(buf);
>> +	}
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		struct virtqueue *vq = vi->rq[i]->vq;
>> +
>> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
>> +			if (vi->mergeable_rx_bufs || vi->big_packets)
>> +				give_pages(vi->rq[i], buf);
>> +			else
>> +				dev_kfree_skb(buf);
>> +			--vi->rq[i]->num;
>> +		}
>> +		BUG_ON(vi->rq[i]->num != 0);
>> +	}
>> +}
>> +
>> +static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
>> +{
>> +	int i;
>> +
>> +	if (vi->num_queue_pairs == 1)
>> +		return;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		int cpu = set ? i : -1;
>> +		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
>> +		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
>> +	}
>> +	return;
>> +}
>> +
>> +static void virtnet_del_vqs(struct virtnet_info *vi)
>> +{
>> +	struct virtio_device *vdev = vi->vdev;
>> +
>> +	virtnet_set_affinity(vi, false);
>> +
>> +	vdev->config->del_vqs(vdev);
>> +
>> +	virtnet_free_queues(vi);
>> +}
>> +
>> +static int virtnet_find_vqs(struct virtnet_info *vi)
>> +{
>> +	vq_callback_t **callbacks;
>> +	struct virtqueue **vqs;
>> +	int ret = -ENOMEM;
>> +	int i, total_vqs;
>> +	char **names;
>>
>> -	vi->rvq = vqs[0];
>> -	vi->svq = vqs[1];
>> +	/*
>> +	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
>> +	 * possible control virtqueue and followed by the same
>> +	 * 'vi->num_queue_pairs-1' more times
>> +	 */
>> +	total_vqs = vi->num_queue_pairs * 2 +
>> +		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
>> +
>> +	/* Allocate space for find_vqs parameters */
>> +	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
>> +	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
>> +	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);
> so this needs to be kzalloc otherwise on an error cleanup will
> get uninitialized data and crash?

Yes, will change it to use kzalloc.
>
>> +	if (!vqs || !callbacks || !names)
>> +		goto err;
>> +
>> +	/* Parameters for control virtqueue, if any */
>> +	if (vi->has_cvq) {
>> +		callbacks[2] = NULL;
>> +		names[2] = "control";
>> +	}
>> +
>> +	/* Allocate/initialize parameters for send/receive virtqueues */
>> +	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
>> +		int j = (i == 0 ? i : i + vi->has_cvq);
>> +		callbacks[j] = skb_recv_done;
>> +		callbacks[j + 1] = skb_xmit_done;
>> +		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
>> +		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);
> This needs wrappers. E.g. virtnet_rx_vq(int queue_pair), virtnet_tx_vq(int queue_pair);
> Then you would just scan 0 to num_queue_pairs, and i is queue pair
> number.

Ok.
>> +	}
>>
>> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
>> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
>> +					 (const char **)names);
>> +	if (ret)
>> +		goto err;
>> +
>> +	if (vi->has_cvq)
>>   		vi->cvq = vqs[2];
>>
>> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
>> +	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
>> +		int j = i == 0 ? i : i + vi->has_cvq;
>> +		vi->rq[i / 2]->vq = vqs[j];
>> +		vi->sq[i / 2]->vq = vqs[j + 1];
> Same here.

Consider the code is really simple, seem no need to use helpers.
>
>>   	}
>> -	return 0;
>> +
>> +err:
>> +	if (ret&&  names)
> If we are here ret != 0. For names, just add another label, don't
> complicate cleanup.

Ok.
>> +		for (i = 0; i<  vi->num_queue_pairs * 2; i++)
>> +			kfree(names[i]);
>> +
>> +	kfree(names);
>> +	kfree(callbacks);
>> +	kfree(vqs);
>> +
>> +	return ret;
>> +}
>> +
>> +static int virtnet_alloc_queues(struct virtnet_info *vi)
>> +{
>> +	int ret = -ENOMEM;
>> +	int i;
>> +
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
>> +		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
>> +		if (!vi->rq[i] || !vi->sq[i])
>> +			goto err;
>> +	}
>> +
>> +	ret = 0;
>> +
>> +	/* setup initial receive and send queue parameters */
>> +	for (i = 0; i<  vi->num_queue_pairs; i++) {
>> +		vi->rq[i]->vi = vi;
>> +		vi->rq[i]->pages = NULL;
>> +		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
>> +		netif_napi_add(vi->dev,&vi->rq[i]->napi, virtnet_poll,
>> +			       napi_weight);
>> +
>> +		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
>> +		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
>> +	}
>> +
> Add return 0 here, then ret = 0 will not be needed
> above and if (ret) below.
>

Ok.
>> +err:
>> +	if (ret)
>> +		virtnet_free_queues(vi);
>> +
>> +	return ret;
>> +}
>> +
>> +static int virtnet_setup_vqs(struct virtnet_info *vi)
>> +{
>> +	int ret;
>> +
>> +	/* Allocate send&  receive queues */
>> +	ret = virtnet_alloc_queues(vi);
>> +	if (!ret) {
>> +		ret = virtnet_find_vqs(vi);
>> +		if (ret)
>> +			virtnet_free_queues(vi);
>> +		else
>> +			virtnet_set_affinity(vi, true);
>> +	}
>> +
>> +	return ret;
> Add some labels for error handling, this if nesting is messy.

Ok.
>>   }
>>
>>   static int virtnet_probe(struct virtio_device *vdev)
>>   {
>> -	int err;
>> +	int i, err;
>>   	struct net_device *dev;
>>   	struct virtnet_info *vi;
>> +	u16 num_queues, num_queue_pairs;
>> +
>> +	/* Find if host supports multiqueue virtio_net device */
>> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
>> +				offsetof(struct virtio_net_config,
>> +				num_queues),&num_queues);
>> +
>> +	/* We need atleast 2 queue's */
> typo

Will fix this, thanks.
>> +	if (err || num_queues<  2)
>> +		num_queues = 2;
>> +	if (num_queues>  MAX_QUEUES * 2)
>> +		num_queues = MAX_QUEUES;
>> +
>> +	num_queue_pairs = num_queues / 2;
>>
>>   	/* Allocate ourselves a network device with room for our info */
>> -	dev = alloc_etherdev(sizeof(struct virtnet_info));
>> +	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
>>   	if (!dev)
>>   		return -ENOMEM;
>>
>> @@ -1103,22 +1362,18 @@ static int virtnet_probe(struct virtio_device *vdev)
>>
>>   	/* Set up our device-specific information */
>>   	vi = netdev_priv(dev);
>> -	netif_napi_add(dev,&vi->napi, virtnet_poll, napi_weight);
>>   	vi->dev = dev;
>>   	vi->vdev = vdev;
>>   	vdev->priv = vi;
>> -	vi->pages = NULL;
>>   	vi->stats = alloc_percpu(struct virtnet_stats);
>>   	err = -ENOMEM;
>>   	if (vi->stats == NULL)
>> -		goto free;
>> +		goto free_netdev;
>>
>> -	INIT_DELAYED_WORK(&vi->refill, refill_work);
>>   	mutex_init(&vi->config_lock);
>>   	vi->config_enable = true;
>>   	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
>> -	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
>> -	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
>> +	vi->num_queue_pairs = num_queue_pairs;
>>
>>   	/* If we can receive ANY GSO packets, we must allocate large ones. */
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>> @@ -1129,9 +1384,17 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>>   		vi->mergeable_rx_bufs = true;
>>
>> -	err = init_vqs(vi);
>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>> +		vi->has_cvq = true;
>> +
> How about we disable multiqueue if there's no cvq?
> Will make logic a bit simpler, won't it?

We can, but as you said, just let the logic simpler a bit.
>> +	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>> +	err = virtnet_setup_vqs(vi);
>>   	if (err)
>> -		goto free_stats;
>> +		goto free_netdev;
>> +
>> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)&&
>> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>> +		dev->features |= NETIF_F_HW_VLAN_FILTER;
>>
>>   	err = register_netdev(dev);
>>   	if (err) {
>> @@ -1140,12 +1403,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	}
>>
>>   	/* Last of all, set up some receive buffers. */
>> -	try_fill_recv(vi, GFP_KERNEL);
>> -
>> -	/* If we didn't even get one input buffer, we're useless. */
>> -	if (vi->num == 0) {
>> -		err = -ENOMEM;
>> -		goto unregister;
>> +	for (i = 0; i<  num_queue_pairs; i++) {
>> +		try_fill_recv(vi->rq[i], GFP_KERNEL);
>> +
>> +		/* If we didn't even get one input buffer, we're useless. */
>> +		if (vi->rq[i]->num == 0) {
>> +			free_unused_bufs(vi);
>> +			err = -ENOMEM;
>> +			goto free_recv_bufs;
>> +		}
>>   	}
>>
>>   	/* Assume link up if device can't report link status,
>> @@ -1158,42 +1424,25 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   		netif_carrier_on(dev);
>>   	}
>>
>> -	pr_debug("virtnet: registered device %s\n", dev->name);
>> +	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
>> +		 dev->name, num_queue_pairs);
>> +
>>   	return 0;
>>
>> -unregister:
>> +free_recv_bufs:
>> +	free_receive_bufs(vi);
>>   	unregister_netdev(dev);
>> +
>>   free_vqs:
>> -	vdev->config->del_vqs(vdev);
>> -free_stats:
>> -	free_percpu(vi->stats);
>> -free:
>> +	for (i = 0; i<  num_queue_pairs; i++)
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>> +	virtnet_del_vqs(vi);
>> +
>> +free_netdev:
>>   	free_netdev(dev);
>>   	return err;
>>   }
>>
>> -static void free_unused_bufs(struct virtnet_info *vi)
>> -{
>> -	void *buf;
>> -	while (1) {
>> -		buf = virtqueue_detach_unused_buf(vi->svq);
>> -		if (!buf)
>> -			break;
>> -		dev_kfree_skb(buf);
>> -	}
>> -	while (1) {
>> -		buf = virtqueue_detach_unused_buf(vi->rvq);
>> -		if (!buf)
>> -			break;
>> -		if (vi->mergeable_rx_bufs || vi->big_packets)
>> -			give_pages(vi, buf);
>> -		else
>> -			dev_kfree_skb(buf);
>> -		--vi->num;
>> -	}
>> -	BUG_ON(vi->num != 0);
>> -}
>> -
>>   static void remove_vq_common(struct virtnet_info *vi)
>>   {
>>   	vi->vdev->config->reset(vi->vdev);
>> @@ -1201,10 +1450,9 @@ static void remove_vq_common(struct virtnet_info *vi)
>>   	/* Free unused buffers in both send and recv, if any. */
>>   	free_unused_bufs(vi);
>>
>> -	vi->vdev->config->del_vqs(vi->vdev);
>> +	free_receive_bufs(vi);
>>
>> -	while (vi->pages)
>> -		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
>> +	virtnet_del_vqs(vi);
>>   }
>>
>>   static void __devexit virtnet_remove(struct virtio_device *vdev)
>> @@ -1230,6 +1478,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>>   static int virtnet_freeze(struct virtio_device *vdev)
>>   {
>>   	struct virtnet_info *vi = vdev->priv;
>> +	int i;
>>
>>   	/* Prevent config work handler from accessing the device */
>>   	mutex_lock(&vi->config_lock);
>> @@ -1237,10 +1486,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   	mutex_unlock(&vi->config_lock);
>>
>>   	netif_device_detach(vi->dev);
>> -	cancel_delayed_work_sync(&vi->refill);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>>
>>   	if (netif_running(vi->dev))
>> -		napi_disable(&vi->napi);
>> +		for (i = 0; i<  vi->num_queue_pairs; i++)
>> +			napi_disable(&vi->rq[i]->napi);
>> +
>>
>>   	remove_vq_common(vi);
>>
>> @@ -1252,19 +1504,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
>>   static int virtnet_restore(struct virtio_device *vdev)
>>   {
>>   	struct virtnet_info *vi = vdev->priv;
>> -	int err;
>> +	int err, i;
>>
>> -	err = init_vqs(vi);
>> +	err = virtnet_setup_vqs(vi);
>>   	if (err)
>>   		return err;
>>
>>   	if (netif_running(vi->dev))
>> -		virtnet_napi_enable(vi);
>> +		for (i = 0; i<  vi->num_queue_pairs; i++)
>> +			virtnet_napi_enable(vi->rq[i]);
>>
>>   	netif_device_attach(vi->dev);
>>
>> -	if (!try_fill_recv(vi, GFP_KERNEL))
>> -		queue_delayed_work(system_nrt_wq,&vi->refill, 0);
>> +	for (i = 0; i<  vi->num_queue_pairs; i++)
>> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
>> +			queue_delayed_work(system_nrt_wq,
>> +					&vi->rq[i]->refill, 0);
>>
>>   	mutex_lock(&vi->config_lock);
>>   	vi->config_enable = true;
>> @@ -1287,7 +1542,7 @@ static unsigned int features[] = {
>>   	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
>>   	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
>>   	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
>> -	VIRTIO_NET_F_GUEST_ANNOUNCE,
>> +	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
>>   };
>>
>>   static struct virtio_driver virtio_net_driver = {
>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
>> index 1bc7e30..60f09ff 100644
>> --- a/include/linux/virtio_net.h
>> +++ b/include/linux/virtio_net.h
>> @@ -61,6 +61,8 @@ struct virtio_net_config {
>>   	__u8 mac[6];
>>   	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
>>   	__u16 status;
>> +	/* Total number of RX/TX queues */
>> +	__u16 num_queues;
>>   } __attribute__((packed));
>>
>>   /* This is the first element of the scatter-gather list.  If you don't
>> -- 
>> 1.7.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-21 12:02     ` Sasha Levin
@ 2012-07-23  5:54       ` Jason Wang
  2012-07-23  9:28         ` Sasha Levin
  2012-07-29  9:44       ` Michael S. Tsirkin
  1 sibling, 1 reply; 46+ messages in thread
From: Jason Wang @ 2012-07-23  5:54 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michael S. Tsirkin, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem,
	akong, kvm, sri

On 07/21/2012 08:02 PM, Sasha Levin wrote:
> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>> -	err = init_vqs(vi);
>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>> +		vi->has_cvq = true;
>>>> +
>> How about we disable multiqueue if there's no cvq?
>> Will make logic a bit simpler, won't it?
> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>

Yes, it does not depends on cvq. Cvq were just used to negotiate the 
number of queues a guest wishes to use which is really useful (at least 
for now). Since multiqueue can not out-perform for single queue in every 
kinds of workloads or benchmark, so we want to let guest driver use 
single queue by default even when multiqueue were enabled by management 
software and let use to enalbe it through ethtool. So user could not 
feel regression when it switch to use a multiqueue capable driver and 
backend.

So the only difference is the user experiences.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-23  5:54       ` Jason Wang
@ 2012-07-23  9:28         ` Sasha Levin
  2012-07-30  3:29           ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Sasha Levin @ 2012-07-23  9:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem,
	akong, kvm, sri

On 07/23/2012 07:54 AM, Jason Wang wrote:
> On 07/21/2012 08:02 PM, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>> -    err = init_vqs(vi);
>>>>> +    if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>> +        vi->has_cvq = true;
>>>>> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>>
> 
> Yes, it does not depends on cvq. Cvq were just used to negotiate the number of queues a guest wishes to use which is really useful (at least for now). Since multiqueue can not out-perform for single queue in every kinds of workloads or benchmark, so we want to let guest driver use single queue by default even when multiqueue were enabled by management software and let use to enalbe it through ethtool. So user could not feel regression when it switch to use a multiqueue capable driver and backend.

Why would you limit it to a single vq if the user has specified a different number of vqs (>1) in the virtio-net device config?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue
  2012-07-05 11:40   ` Sasha Levin
  2012-07-06  3:17     ` Jason Wang
@ 2012-07-26  8:20     ` Paolo Bonzini
  2012-07-30  3:30       ` Jason Wang
  1 sibling, 1 reply; 46+ messages in thread
From: Paolo Bonzini @ 2012-07-26  8:20 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Jason Wang, krkumar2, habanero, mashirle, kvm, mst, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, sri

Il 05/07/2012 13:40, Sasha Levin ha scritto:
> @@ -275,7 +274,7 @@ static void vm_del_vq(struct virtqueue *vq)
>         vring_del_virtqueue(vq);
>  
>         /* Select and deactivate the queue */
> -       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
> +       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
>         writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>  

This accesses vq after vring_del_virtqueue has freed it.

Paolo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
  2012-07-05 10:29 ` [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue Jason Wang
@ 2012-07-27 14:38   ` Paolo Bonzini
  2012-07-29 20:40     ` Michael S. Tsirkin
  2012-08-09 15:13   ` Paolo Bonzini
  1 sibling, 1 reply; 46+ messages in thread
From: Paolo Bonzini @ 2012-07-27 14:38 UTC (permalink / raw)
  To: Jason Wang, mst, Nicholas A. Bellinger
  Cc: mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, kvm, sri

Il 05/07/2012 12:29, Jason Wang ha scritto:
> Sometimes, virtio device need to configure irq affiniry hint to maximize the
> performance. Instead of just exposing the irq of a virtqueue, this patch
> introduce an API to set the affinity for a virtqueue.
> 
> The api is best-effort, the affinity hint may not be set as expected due to
> platform support, irq sharing or irq type. Currently, only pci method were
> implemented and we set the affinity according to:
> 
> - if device uses INTX, we just ignore the request
> - if device has per vq vector, we force the affinity hint
> - if the virtqueues share MSI, make the affinity OR over all affinities
>  requested
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Hmm, I don't see any benefit from this patch, I need to use
irq_set_affinity (which however is not exported) to actually bind IRQs
to CPUs.  Example:

with irq_set_affinity_hint:
 43:   89  107  100   97   PCI-MSI-edge   virtio0-request
 44:  178  195  268  199   PCI-MSI-edge   virtio0-request
 45:   97  100   97  155   PCI-MSI-edge   virtio0-request
 46:  234  261  213  218   PCI-MSI-edge   virtio0-request

with irq_set_affinity:
 43:  721    0    0    1   PCI-MSI-edge   virtio0-request
 44:    0  746    0    1   PCI-MSI-edge   virtio0-request
 45:    0    0  658    0   PCI-MSI-edge   virtio0-request
 46:    0    0    1  547   PCI-MSI-edge   virtio0-request

I gathered these quickly after boot, but real benchmarks show the same
behavior, and performance gets actually worse with virtio-scsi
multiqueue+irq_set_affinity_hint than with irq_set_affinity.

I also tried adding IRQ_NO_BALANCING, but the only effect is that I
cannot set the affinity

The queue steering algorithm I use in virtio-scsi is extremely simple
and based on your tx code.  See how my nice pinning is destroyed:

# taskset -c 0 dd if=/dev/sda bs=1M count=1000 of=/dev/null iflag=direct
# cat /proc/interrupts
 43:  2690 2709 2691 2696   PCI-MSI-edge      virtio0-request
 44:   109  122  199  124   PCI-MSI-edge      virtio0-request
 45:   170  183  170  237   PCI-MSI-edge      virtio0-request
 46:   143  166  125  125   PCI-MSI-edge      virtio0-request

All my requests come from CPU#0 and thus go to the first virtqueue, but
the interrupts are serviced all over the place.

Did you set the affinity manually in your experiments, or perhaps there
is a difference between scsi and networking... (interrupt mitigation?)

Paolo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-21 12:02     ` Sasha Levin
  2012-07-23  5:54       ` Jason Wang
@ 2012-07-29  9:44       ` Michael S. Tsirkin
  2012-07-30  3:26         ` Jason Wang
  2012-07-30 13:00         ` Sasha Levin
  1 sibling, 2 replies; 46+ messages in thread
From: Michael S. Tsirkin @ 2012-07-29  9:44 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Jason Wang, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem,
	akong, kvm, sri

On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote:
> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
> >> -	err = init_vqs(vi);
> >> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> >> > +		vi->has_cvq = true;
> >> > +
> > How about we disable multiqueue if there's no cvq?
> > Will make logic a bit simpler, won't it?
> 
> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?

Well !cvq support is a legacy feature: the reason we support it
in driver is to avoid breaking on old hosts. Adding more code to that
path just doesn't make much sense since old hosts won't have mq.

-- 
MST

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-23  5:48     ` Jason Wang
@ 2012-07-29  9:50       ` Michael S. Tsirkin
  2012-07-30  5:15         ` Jason Wang
  0 siblings, 1 reply; 46+ messages in thread
From: Michael S. Tsirkin @ 2012-07-29  9:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On Mon, Jul 23, 2012 at 01:48:35PM +0800, Jason Wang wrote:
> >>+	}
> >>
> >>-	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> >>+	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
> >>+					 (const char **)names);
> >>+	if (ret)
> >>+		goto err;
> >>+
> >>+	if (vi->has_cvq)
> >>  		vi->cvq = vqs[2];
> >>
> >>-		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> >>-			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
> >>+	for (i = 0; i<  vi->num_queue_pairs * 2; i += 2) {
> >>+		int j = i == 0 ? i : i + vi->has_cvq;
> >>+		vi->rq[i / 2]->vq = vqs[j];
> >>+		vi->sq[i / 2]->vq = vqs[j + 1];
> >Same here.
> 
> Consider the code is really simple, seem no need to use helpers.

Well it was not simple to at least one reader :)
The problem is not this logic is complex,
it is that it is spread all over the code.

If we had e.g. vnet_tx_vqn_to_queuenum vnet_tx_queuenum_to_vqn
and same for rx, then the logic would all be
in one place, and have a tidy comment on top explaining
the VQ numbering scheme.

-- 
MST

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
  2012-07-27 14:38   ` Paolo Bonzini
@ 2012-07-29 20:40     ` Michael S. Tsirkin
  2012-07-30  6:27       ` Paolo Bonzini
  0 siblings, 1 reply; 46+ messages in thread
From: Michael S. Tsirkin @ 2012-07-29 20:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jason Wang, Nicholas A. Bellinger, mashirle, krkumar2, habanero,
	rusty, netdev, linux-kernel, virtualization, edumazet, tahm,
	jwhan, davem, kvm, sri

On Fri, Jul 27, 2012 at 04:38:11PM +0200, Paolo Bonzini wrote:
> Il 05/07/2012 12:29, Jason Wang ha scritto:
> > Sometimes, virtio device need to configure irq affiniry hint to maximize the
> > performance. Instead of just exposing the irq of a virtqueue, this patch
> > introduce an API to set the affinity for a virtqueue.
> > 
> > The api is best-effort, the affinity hint may not be set as expected due to
> > platform support, irq sharing or irq type. Currently, only pci method were
> > implemented and we set the affinity according to:
> > 
> > - if device uses INTX, we just ignore the request
> > - if device has per vq vector, we force the affinity hint
> > - if the virtqueues share MSI, make the affinity OR over all affinities
> >  requested
> > 
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> 
> Hmm, I don't see any benefit from this patch, I need to use
> irq_set_affinity (which however is not exported) to actually bind IRQs
> to CPUs.  Example:
> 
> with irq_set_affinity_hint:
>  43:   89  107  100   97   PCI-MSI-edge   virtio0-request
>  44:  178  195  268  199   PCI-MSI-edge   virtio0-request
>  45:   97  100   97  155   PCI-MSI-edge   virtio0-request
>  46:  234  261  213  218   PCI-MSI-edge   virtio0-request
> 
> with irq_set_affinity:
>  43:  721    0    0    1   PCI-MSI-edge   virtio0-request
>  44:    0  746    0    1   PCI-MSI-edge   virtio0-request
>  45:    0    0  658    0   PCI-MSI-edge   virtio0-request
>  46:    0    0    1  547   PCI-MSI-edge   virtio0-request
> 
> I gathered these quickly after boot, but real benchmarks show the same
> behavior, and performance gets actually worse with virtio-scsi
> multiqueue+irq_set_affinity_hint than with irq_set_affinity.
> 
> I also tried adding IRQ_NO_BALANCING, but the only effect is that I
> cannot set the affinity
> 
> The queue steering algorithm I use in virtio-scsi is extremely simple
> and based on your tx code.  See how my nice pinning is destroyed:
> 
> # taskset -c 0 dd if=/dev/sda bs=1M count=1000 of=/dev/null iflag=direct
> # cat /proc/interrupts
>  43:  2690 2709 2691 2696   PCI-MSI-edge      virtio0-request
>  44:   109  122  199  124   PCI-MSI-edge      virtio0-request
>  45:   170  183  170  237   PCI-MSI-edge      virtio0-request
>  46:   143  166  125  125   PCI-MSI-edge      virtio0-request
> 
> All my requests come from CPU#0 and thus go to the first virtqueue, but
> the interrupts are serviced all over the place.
> 
> Did you set the affinity manually in your experiments, or perhaps there
> is a difference between scsi and networking... (interrupt mitigation?)
> 
> Paolo


You need to run irqbalancer in guest to make it actually work. Do you?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-29  9:44       ` Michael S. Tsirkin
@ 2012-07-30  3:26         ` Jason Wang
  2012-07-30 13:00         ` Sasha Levin
  1 sibling, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-30  3:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sasha Levin, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem,
	akong, kvm, sri

On 07/29/2012 05:44 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>> -	err = init_vqs(vi);
>>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>> +		vi->has_cvq = true;
>>>>> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
> Well !cvq support is a legacy feature: the reason we support it
> in driver is to avoid breaking on old hosts. Adding more code to that
> path just doesn't make much sense since old hosts won't have mq.
>

After some thought about this, maybe there's no need to the cvq for the 
negotiation if we want support only two modes ( 1 tx/rx queue pair and N 
tx/rx queue pairs). We can do this just through the feature bit negotiation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-23  9:28         ` Sasha Levin
@ 2012-07-30  3:29           ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-30  3:29 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Michael S. Tsirkin, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem,
	akong, kvm, sri

On 07/23/2012 05:28 PM, Sasha Levin wrote:
> On 07/23/2012 07:54 AM, Jason Wang wrote:
>> On 07/21/2012 08:02 PM, Sasha Levin wrote:
>>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>>> -    err = init_vqs(vi);
>>>>>> +    if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>>> +        vi->has_cvq = true;
>>>>>> +
>>>> How about we disable multiqueue if there's no cvq?
>>>> Will make logic a bit simpler, won't it?
>>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>>>
>> Yes, it does not depends on cvq. Cvq were just used to negotiate the number of queues a guest wishes to use which is really useful (at least for now). Since multiqueue can not out-perform for single queue in every kinds of workloads or benchmark, so we want to let guest driver use single queue by default even when multiqueue were enabled by management software and let use to enalbe it through ethtool. So user could not feel regression when it switch to use a multiqueue capable driver and backend.
> Why would you limit it to a single vq if the user has specified a different number of vqs (>1) in the virtio-net device config?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

The only reason is to prevent the user from seeing the regression. The 
performance of small packet sending is wrose than single queue, it tends 
to send more but small packets when multiqueue is enabled. If we make 
multiqueue bahave as good as single queue, we can remove this limit.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue
  2012-07-26  8:20     ` Paolo Bonzini
@ 2012-07-30  3:30       ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-30  3:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sasha Levin, krkumar2, habanero, mashirle, kvm, mst, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, sri

On 07/26/2012 04:20 PM, Paolo Bonzini wrote:
> Il 05/07/2012 13:40, Sasha Levin ha scritto:
>> @@ -275,7 +274,7 @@ static void vm_del_vq(struct virtqueue *vq)
>>          vring_del_virtqueue(vq);
>>
>>          /* Select and deactivate the queue */
>> -       writel(info->queue_index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
>> +       writel(virtqueue_get_queue_index(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_SEL);
>>          writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>>
> This accesses vq after vring_del_virtqueue has freed it.
>
> Paolo
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Yes, so need a temporary variable before vring_del_virtqueue().

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-29  9:50       ` Michael S. Tsirkin
@ 2012-07-30  5:15         ` Jason Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Jason Wang @ 2012-07-30  5:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, akong, kvm, sri

On 07/29/2012 05:50 PM, Michael S. Tsirkin wrote:
> On Mon, Jul 23, 2012 at 01:48:35PM +0800, Jason Wang wrote:
>>>> +	}
>>>>
>>>> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
>>>> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
>>>> +					 (const char **)names);
>>>> +	if (ret)
>>>> +		goto err;
>>>> +
>>>> +	if (vi->has_cvq)
>>>>   		vi->cvq = vqs[2];
>>>>
>>>> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>>>> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
>>>> +	for (i = 0; i<   vi->num_queue_pairs * 2; i += 2) {
>>>> +		int j = i == 0 ? i : i + vi->has_cvq;
>>>> +		vi->rq[i / 2]->vq = vqs[j];
>>>> +		vi->sq[i / 2]->vq = vqs[j + 1];
>>> Same here.
>> Consider the code is really simple, seem no need to use helpers.
> Well it was not simple to at least one reader :)
> The problem is not this logic is complex,
> it is that it is spread all over the code.
>
> If we had e.g. vnet_tx_vqn_to_queuenum vnet_tx_queuenum_to_vqn
> and same for rx, then the logic would all be
> in one place, and have a tidy comment on top explaining
> the VQ numbering scheme.
>

Looks reasonable, thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
  2012-07-29 20:40     ` Michael S. Tsirkin
@ 2012-07-30  6:27       ` Paolo Bonzini
  2012-08-09 15:14         ` Paolo Bonzini
  0 siblings, 1 reply; 46+ messages in thread
From: Paolo Bonzini @ 2012-07-30  6:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, habanero, kvm, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri

Il 29/07/2012 22:40, Michael S. Tsirkin ha scritto:
>> > Did you set the affinity manually in your experiments, or perhaps there
>> > is a difference between scsi and networking... (interrupt mitigation?)
> 
> You need to run irqbalancer in guest to make it actually work. Do you?

Yes, of course, now on to debugging that one.  I just wanted to ask
before the weekend if I was missing something as obvious as that.

Paolo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
  2012-07-29  9:44       ` Michael S. Tsirkin
  2012-07-30  3:26         ` Jason Wang
@ 2012-07-30 13:00         ` Sasha Levin
  1 sibling, 0 replies; 46+ messages in thread
From: Sasha Levin @ 2012-07-30 13:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem,
	akong, kvm, sri

On 07/29/2012 11:44 AM, Michael S. Tsirkin wrote:
> On Sat, Jul 21, 2012 at 02:02:58PM +0200, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>> -	err = init_vqs(vi);
>>>>> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>> +		vi->has_cvq = true;
>>>>> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>>
>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
> 
> Well !cvq support is a legacy feature: the reason we support it
> in driver is to avoid breaking on old hosts. Adding more code to that
> path just doesn't make much sense since old hosts won't have mq.

Is it really a legacy feature? The spec suggests that its an optional queue which is not necessary for the operation of the device.

Which is why we never implemented it in lkvm - we weren't interested in any of the features it provided at that time and we could provide high performance with vhost support even without it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
  2012-07-05 10:29 ` [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue Jason Wang
  2012-07-27 14:38   ` Paolo Bonzini
@ 2012-08-09 15:13   ` Paolo Bonzini
  2012-08-09 15:35     ` Avi Kivity
  1 sibling, 1 reply; 46+ messages in thread
From: Paolo Bonzini @ 2012-08-09 15:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, mashirle, krkumar2, habanero, rusty, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, kvm, sri

Il 05/07/2012 12:29, Jason Wang ha scritto:
> Sometimes, virtio device need to configure irq affiniry hint to maximize the
> performance. Instead of just exposing the irq of a virtqueue, this patch
> introduce an API to set the affinity for a virtqueue.
> 
> The api is best-effort, the affinity hint may not be set as expected due to
> platform support, irq sharing or irq type. Currently, only pci method were
> implemented and we set the affinity according to:
> 
> - if device uses INTX, we just ignore the request
> - if device has per vq vector, we force the affinity hint
> - if the virtqueues share MSI, make the affinity OR over all affinities
>  requested
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

It looks like both I and Jason will need these patches during the 3.7
merge window, and from different trees (net-next vs. scsi).  How do we
synchronize?

Paolo

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
  2012-07-30  6:27       ` Paolo Bonzini
@ 2012-08-09 15:14         ` Paolo Bonzini
  0 siblings, 0 replies; 46+ messages in thread
From: Paolo Bonzini @ 2012-08-09 15:14 UTC (permalink / raw)
  Cc: Michael S. Tsirkin, krkumar2, habanero, kvm, netdev, mashirle,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, sri

Il 30/07/2012 08:27, Paolo Bonzini ha scritto:
>>>> >> > Did you set the affinity manually in your experiments, or perhaps there
>>>> >> > is a difference between scsi and networking... (interrupt mitigation?)
>> > 
>> > You need to run irqbalancer in guest to make it actually work. Do you?
> Yes, of course, now on to debugging that one.  I just wanted to ask
> before the weekend if I was missing something as obvious as that.

It was indeed an irqbalance bug, it is fixed now upstream.

Paolo


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue
  2012-08-09 15:13   ` Paolo Bonzini
@ 2012-08-09 15:35     ` Avi Kivity
  0 siblings, 0 replies; 46+ messages in thread
From: Avi Kivity @ 2012-08-09 15:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jason Wang, mst, mashirle, krkumar2, habanero, rusty, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, kvm,
	sri

On 08/09/2012 06:13 PM, Paolo Bonzini wrote:
> Il 05/07/2012 12:29, Jason Wang ha scritto:
>> Sometimes, virtio device need to configure irq affiniry hint to maximize the
>> performance. Instead of just exposing the irq of a virtqueue, this patch
>> introduce an API to set the affinity for a virtqueue.
>> 
>> The api is best-effort, the affinity hint may not be set as expected due to
>> platform support, irq sharing or irq type. Currently, only pci method were
>> implemented and we set the affinity according to:
>> 
>> - if device uses INTX, we just ignore the request
>> - if device has per vq vector, we force the affinity hint
>> - if the virtqueues share MSI, make the affinity OR over all affinities
>>  requested
>> 
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> 
> It looks like both I and Jason will need these patches during the 3.7
> merge window, and from different trees (net-next vs. scsi).  How do we
> synchronize?

Get one of them to promise not to rebase, merge it, and base your
patches on top of the merge.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2012-08-09 15:37 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-05 10:29 [net-next RFC V5 0/5] Multiqueue virtio-net Jason Wang
2012-07-05 10:29 ` [net-next RFC V5 1/5] virtio_net: Introduce VIRTIO_NET_F_MULTIQUEUE Jason Wang
2012-07-05 10:29 ` [net-next RFC V5 2/5] virtio_ring: move queue_index to vring_virtqueue Jason Wang
2012-07-05 11:40   ` Sasha Levin
2012-07-06  3:17     ` Jason Wang
2012-07-26  8:20     ` Paolo Bonzini
2012-07-30  3:30       ` Jason Wang
2012-07-05 10:29 ` [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue Jason Wang
2012-07-27 14:38   ` Paolo Bonzini
2012-07-29 20:40     ` Michael S. Tsirkin
2012-07-30  6:27       ` Paolo Bonzini
2012-08-09 15:14         ` Paolo Bonzini
2012-08-09 15:13   ` Paolo Bonzini
2012-08-09 15:35     ` Avi Kivity
2012-07-05 10:29 ` [net-next RFC V5 4/5] virtio_net: multiqueue support Jason Wang
2012-07-05 20:02   ` Amos Kong
2012-07-06  7:45     ` Jason Wang
2012-07-20 13:40   ` Michael S. Tsirkin
2012-07-21 12:02     ` Sasha Levin
2012-07-23  5:54       ` Jason Wang
2012-07-23  9:28         ` Sasha Levin
2012-07-30  3:29           ` Jason Wang
2012-07-29  9:44       ` Michael S. Tsirkin
2012-07-30  3:26         ` Jason Wang
2012-07-30 13:00         ` Sasha Levin
2012-07-23  5:48     ` Jason Wang
2012-07-29  9:50       ` Michael S. Tsirkin
2012-07-30  5:15         ` Jason Wang
2012-07-05 10:29 ` [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq Jason Wang
2012-07-05 12:51   ` Sasha Levin
2012-07-05 20:07     ` Amos Kong
2012-07-06  7:46       ` Jason Wang
2012-07-06  3:20     ` Jason Wang
2012-07-06  6:38       ` Stephen Hemminger
2012-07-06  9:26         ` Jason Wang
2012-07-06  8:10       ` Sasha Levin
2012-07-09 20:13   ` Ben Hutchings
2012-07-20 12:33   ` Michael S. Tsirkin
2012-07-23  5:32     ` Jason Wang
2012-07-05 17:45 ` [net-next RFC V5 0/5] Multiqueue virtio-net Rick Jones
2012-07-06  7:42   ` Jason Wang
2012-07-06 16:23     ` Rick Jones
2012-07-09  3:23       ` Jason Wang
2012-07-09 16:46         ` Rick Jones
2012-07-08  8:19 ` Ronen Hod
2012-07-09  5:35   ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).