From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Fastabend Subject: [RFC PATCH 00/13] Series short description Date: Wed, 17 Aug 2016 12:33:04 -0700 Message-ID: <20160817193120.27032.20918.stgit@john-Precision-Tower-5810> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: john.r.fastabend@intel.com, netdev@vger.kernel.org, john.fastabend@gmail.com, davem@davemloft.net To: xiyou.wangcong@gmail.com, jhs@mojatatu.com, alexei.starovoitov@gmail.com, eric.dumazet@gmail.com, brouer@redhat.com Return-path: Received: from mail-oi0-f65.google.com ([209.85.218.65]:35854 "EHLO mail-oi0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752406AbcHQTqQ (ORCPT ); Wed, 17 Aug 2016 15:46:16 -0400 Received: by mail-oi0-f65.google.com with SMTP id b22so13061950oii.3 for ; Wed, 17 Aug 2016 12:46:16 -0700 (PDT) Sender: netdev-owner@vger.kernel.org List-ID: I've been working on this for a bit now figured its time for a v2 RFC. As usual any comments, suggestions, observations, musings, etc are appreciated. Latest round of lockless qdisc patch set with performance metric primarily using pktgen to inject pkts into the qdisc layer. Some simple netperf tests below as well but those need to be done correctly. This v2 RFC version fixes a couple flaws in the original series. The first major one was that the per_cpu accounting of qlen is not correct with respect to the qdisc bypass. Using per cpu counters for qlen allows a flow to be enqueuing on the packets into the qdisc and then get scheduled on another core and bypass the qdisc completely if that core is not in use. I've reworked the logic to use an atomic which is _correct_ now but unfortunately costs a lot in performance. With a single pfifo_fast and 12 threads of pktgen I still see a ~200k pps improvement even with atomic accounting so it is still a win but nothing like the +1Mpps without the atomic accounting. On the mq tests it atomic vs per cpu seems to be in the noise I believe because mq qdisc is already aligned with a pfifo_fast qdisc per core with the XPS setup I'm running mapping 1:1. Any thoughts around this would be interesting to hear. My general thinking around this is to submit the atomic version for inclusion and then start to improve it with a few items listed below. Additionally I've added a __netif_schedule() to the bad_skb_tx path otherwise I observed a pkt getting stuck on the bad_txq_cpu path on the pointer and sitting in the qdisc structure until it was kicked again from another pkt or netif_schedule. And on the netif_schedule() topic to support per cpu handling of gso and bad_txq_cpu we have to allow the netif_schedule() logic to fire on a per cpu model as well. Otherwise a bunch of small stylistic changes were made and I still need to do another pass to catch checkpatch warnings/errors and try to do a bit more cleanup around the statistics if/else branching. This series also has both the atomic qlen code and the per cpu qlen code as I continue to think up some scheme around the atomic qlen issue. But this series seems to be working. Future work is the following, - convert all qdiscs over to per cpu handling and cleanup the rather ugly if/else statistics handling. Although a bit of work its mechanical and should help some. - I'm looking at fq_codel to see how to make it "lockless". - It seems we can drop the TX_HARD_LOCK on cases where the nic exposes a queue per core now that we have enqueue/dequeue decoupled. The idea being a bunch of threads enqueue and per core dequeue logic runs. Requires XPS to be setup. - qlen improvements somehow - look at improvements to the skb_array structure. We can look at drop in replacements and/or improving it. Below is the data I took from pktgen, ./samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -t $NUM -i eth3 I did a run of 4 each time and took the total summation of each thread. There are four different runs for the mq and pfifo_fast cases. "without qlen atomic" uses per queue qlen values and allows bypassing the qdisc via bypass flag, this is incorrect but shows the impact of having an atomic in the mix. "with qlen atomic" shows the correct implementation with atomics and bypass enabled. And finally "without qlen atomic and no bypass" uses per cpu qlen values and disables bypass to ensure ooo packets are not created. To be clear the submitted patches here are the "with qlen atomic" metrics. nolock pfifo_fast (without qlen atomic) 1: 1440293 1421602 1409553 1393469 1424543 2: 1754890 1819292 1727948 1797711 1743427 4: 3282665 3344095 3315220 3332777 3348972 8: 2940079 1644450 2950777 2922085 2946310 12: 2042084 2610060 2857581 3493162 3104611 nolock pfifo_fast (with qlen atomic) 1: 1425231 1417176 1402862 1432880 2: 1631437 1633398 1630867 1628816 4: 1704383 1709900 1706274 1710198 8: 1348672 1344343 1339072 1334288 12: 1262988 1280724 1262237 1262615 nolock pfifo_fast (without qlen atomic and no bypass) 1: 1435796 1458522 1471855 1455658 2: 1880642 1876359 1872879 1884578 4: 1922935 1914589 1912832 1912116 8: 1585055 1576887 1577086 1570236 12: 1479273 1450706 1447056 1466330 lock (pfifo_fast) 1: 1471479 1469142 1458825 1456788 1453952 2: 1746231 1749490 1753176 1753780 1755959 4: 1119626 1120515 1121478 1119220 1121115 8: 1001471 999308 1000318 1000776 1000384 12: 989269 992122 991590 986581 990430 nolock (mq with per cpu qlen) 1: 1435952 1459523 1448860 1385451 1435031 2: 2850662 2855702 2859105 2855443 2843382 4: 5288135 5271192 5252242 5270192 5311642 8: 10042731 10018063 9891813 9968382 9956727 12: 13265277 13384199 13438955 13363771 13436198 nolock (mq with qlen atomic) 1: 1558253 1562285 1555037 1558422 2: 2917449 2952852 2921697 2892313 4: 5518243 5375300 5625724 5219599 8: 10183153 10169389 10163161 10202530 12: 13877976 13459987 13081520 13996757 nolock (mq with !bypass and per cpu qlen) 1: 1369110 1379992 1359407 1397014 2: 2575546 2557471 2580782 2593226 4: 4632570 4871850 4830725 4968439 8: 8974135 8951107 9134641 9084347 12: 12982673 12737426 12808364 lock (mq) 1: 1448374 1444208 1437459 1437088 1452453 2: 2687963 2679221 2651059 2691630 2667479 4: 5153884 4684153 5091728 4635261 4902381 8: 9292395 9625869 9681835 9711651 9660498 12: 13553918 13682410 14084055 13946138 13724726 ###################################################### A few arbitrary netperf sessions... (TBD lots of sessions, etc). nolock (mq with !bypass and per cpu qlen) root@john-Precision-Tower-5810:~# netperf -H 22.1 -t TCP_RR -- -s 128K -S 128K -b 0 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.1 () port 0 AF_INET : demo : first burst 0 q Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262144 262144 1 1 10.00 19910.37 262144 262144 nolock (pfifo_fast with !bypass and per cpu qlen) root@john-Precision-Tower-5810:~# netperf -H 22.1 -t TCP_RR -- -s 128K -S 128K -b 0 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.1 () port 0 AF_INET : demo : first burst 0 fgLocal /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262144 262144 1 1 10.00 20358.90 262144 262144 nolock (mq with qlen atomic) root@john-Precision-Tower-5810:/home/john/git/kernel.org/master# netperf -H 22.1 -t TCP_RR -- -s 128K -S 128K -b 0 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.1 () port 0 AF_INET : demo : first burst 0 kLocal /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262144 262144 1 1 10.00 20202.38 262144 262144 nolock (pfifo_fast with qlen_atomic) root@john-Precision-Tower-5810:/home/john/git/kernel.org/master# netperf -H 22.1 -t TCP_RR -- -s 128K -S 128K -b 0 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 22.1 () port 0 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262144 262144 1 1 10.00 20059.41 262144 262144 lock (mq) TBD lock (pfifo_fast) TBD --- John Fastabend (13): net: sched: allow qdiscs to handle locking net: sched: qdisc_qlen for per cpu logic net: sched: provide per cpu qstat helpers net: sched: provide atomic qlen helpers for bypass case net: sched: a dflt qdisc may be used with per cpu stats net: sched: per cpu gso handlers net: sched: support qdisc_reset on NOLOCK qdisc net: sched: support skb_bad_tx with lockless qdisc net: sched: helper to sum qlen net: sched: lockless support for netif_schedule net: sched: pfifo_fast use alf_queue net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio include/net/gen_stats.h | 3 include/net/pkt_sched.h | 4 include/net/sch_generic.h | 127 ++++++++++++++ net/core/dev.c | 60 +++++-- net/core/gen_stats.c | 9 + net/sched/sch_api.c | 21 ++ net/sched/sch_generic.c | 404 +++++++++++++++++++++++++++++++++++---------- net/sched/sch_mq.c | 25 ++- net/sched/sch_mqprio.c | 61 ++++--- 9 files changed, 577 insertions(+), 137 deletions(-)