From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.6 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 09C65C4363D for ; Fri, 25 Sep 2020 17:30:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A39132083B for ; Fri, 25 Sep 2020 17:30:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="iq0mTrIG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728423AbgIYRa0 (ORCPT ); Fri, 25 Sep 2020 13:30:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53806 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727402AbgIYRa0 (ORCPT ); Fri, 25 Sep 2020 13:30:26 -0400 Received: from mail-wr1-x442.google.com (mail-wr1-x442.google.com [IPv6:2a00:1450:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9EC83C0613CE for ; Fri, 25 Sep 2020 10:30:25 -0700 (PDT) Received: by mail-wr1-x442.google.com with SMTP id k15so4465650wrn.10 for ; Fri, 25 Sep 2020 10:30:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=cVCi5U5UDi+O6E7dwZje87schtHTk4EklzUmV7uiVVY=; b=iq0mTrIGOyNVugbYOHheGhfC7a5i/A71Vh0oD1bfXuvWHpjTcfnLdjidtDIsBqt964 c6+GTSy5JJbf+WylcgxQV9Wtk6mKT+i4rorxH1TsxYFXo6kXfla1fUpTl1XT0oPY+USi vyZuq3JbVJHDcVFCngNF3rp+jR6wGsZkJKq9R6hhQSAE6NhmWSngTMU731GTNuzHRdPw KUasOYuHAhahXAMT+B+wZ135Sg5zWXDAhnn7NLRHGAR51yb7Fe8Ptwn1OjkF31qZFaVn TjwCKbmRNf31COdjxFEWTGu9i7jrm9luesOooc8Rri+edV6j7vV8wVopU1/V2hZ7vyWb 5HFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=cVCi5U5UDi+O6E7dwZje87schtHTk4EklzUmV7uiVVY=; b=Y92c2NoVpgvkr2HY9oUYWUAi7hrcfUl0KMsMZX0cDPQfLFSJbR7oxudbcQMwQKdPKE RBND45LKs6bT0vMaSAvPcyvuaoaWKi1A/LNTdArl3AZ5Jnp0GI0uy8PCAYp6M0SI78cZ +t6LSnIJUTY3l+j/vLhYxIcAvdxF8+W2vamXa+wh/JJQcYmBEGQKkSbvu3qjqUG1mvGe 9xhpl1X8GJ8pizQrEAJV0X0CeVKzGAreX8u6eiReWR2oJrajX4L5PYXy9fXl/Hsao12z W0dBdUuoeCVz4zsMKwAfXQWZnP+wLY/p3G1xGB+hqysj6EwJPooJQSBhXpAvMQOGqmww HpmA== X-Gm-Message-State: AOAM532yQWZlBVzk7FlcYFFW/8GtX0JPl2Y0Ht3YUiK8FLD15a/H0n3+ S1F1cxXQIPOzNsXCrAEOzKA= X-Google-Smtp-Source: ABdhPJxKb6jA+pWmKQUezM4dVUvunOcKkA6Z8oEJAmCA1B6VwrfCr2t7miL+7tBo/rhuwHrLqJUHRw== X-Received: by 2002:adf:ef45:: with SMTP id c5mr5478844wrp.37.1601055024191; Fri, 25 Sep 2020 10:30:24 -0700 (PDT) Received: from [192.168.8.147] ([37.173.173.126]) by smtp.gmail.com with ESMTPSA id v9sm4044266wrv.35.2020.09.25.10.30.21 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 25 Sep 2020 10:30:23 -0700 (PDT) Subject: Re: [RFC PATCH net-next 0/6] implement kthread based napi poll To: Wei Wang , Magnus Karlsson Cc: "David S . Miller" , Network Development , Jakub Kicinski , Eric Dumazet , Paolo Abeni , Hannes Frederic Sowa , Felix Fietkau , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= References: <20200914172453.1833883-1-weiwan@google.com> From: Eric Dumazet Message-ID: Date: Fri, 25 Sep 2020 19:30:19 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 9/25/20 7:15 PM, Wei Wang wrote: > On Fri, Sep 25, 2020 at 6:48 AM Magnus Karlsson > wrote: >> >> On Mon, Sep 14, 2020 at 7:26 PM Wei Wang wrote: >>> >>> The idea of moving the napi poll process out of softirq context to a >>> kernel thread based context is not new. >>> Paolo Abeni and Hannes Frederic Sowa has proposed patches to move napi >>> poll to kthread back in 2016. And Felix Fietkau has also proposed >>> patches of similar ideas to use workqueue to process napi poll just a >>> few weeks ago. >>> >>> The main reason we'd like to push forward with this idea is that the >>> scheduler has poor visibility into cpu cycles spent in softirq context, >>> and is not able to make optimal scheduling decisions of the user threads. >>> For example, we see in one of the application benchmark where network >>> load is high, the CPUs handling network softirqs has ~80% cpu util. And >>> user threads are still scheduled on those CPUs, despite other more idle >>> cpus available in the system. And we see very high tail latencies. In this >>> case, we have to explicitly pin away user threads from the CPUs handling >>> network softirqs to ensure good performance. >>> With napi poll moved to kthread, scheduler is in charge of scheduling both >>> the kthreads handling network load, and the user threads, and is able to >>> make better decisions. In the previous benchmark, if we do this and we >>> pin the kthreads processing napi poll to specific CPUs, scheduler is >>> able to schedule user threads away from these CPUs automatically. >>> >>> And the reason we prefer 1 kthread per napi, instead of 1 workqueue >>> entity per host, is that kthread is more configurable than workqueue, >>> and we could leverage existing tuning tools for threads, like taskset, >>> chrt, etc to tune scheduling class and cpu set, etc. Another reason is >>> if we eventually want to provide busy poll feature using kernel threads >>> for napi poll, kthread seems to be more suitable than workqueue. >>> >>> In this patch series, I revived Paolo and Hannes's patch in 2016 and >>> left them as the first 2 patches. Then there are changes proposed by >>> Felix, Jakub, Paolo and myself on top of those, with suggestions from >>> Eric Dumazet. >>> >>> In terms of performance, I ran tcp_rr tests with 1000 flows with >>> various request/response sizes, with RFS/RPS disabled, and compared >>> performance between softirq vs kthread. Host has 56 hyper threads and >>> 100Gbps nic. >>> >>> req/resp QPS 50%tile 90%tile 99%tile 99.9%tile >>> softirq 1B/1B 2.19M 284us 987us 1.1ms 1.56ms >>> kthread 1B/1B 2.14M 295us 987us 1.0ms 1.17ms >>> >>> softirq 5KB/5KB 1.31M 869us 1.06ms 1.28ms 2.38ms >>> kthread 5KB/5KB 1.32M 878us 1.06ms 1.26ms 1.66ms >>> >>> softirq 1MB/1MB 10.78K 84ms 166ms 234ms 294ms >>> kthread 1MB/1MB 10.83K 82ms 173ms 262ms 320ms >>> >>> I also ran one application benchmark where the user threads have more >>> work to do. We do see good amount of tail latency reductions with the >>> kthread model. >> >> I really like this RFC and would encourage you to submit it as a >> patch. Would love to see it make it into the kernel. >> > > Thanks for the feedback! I am preparing an official patchset for this > and will send them out soon. > >> I see the same positive effects as you when trying it out with AF_XDP >> sockets. Made some simple experiments where I sent 64-byte packets to >> a single AF_XDP socket. Have not managed to figure out how to do >> percentiles on my load generator, so this is going to be min, avg and >> max only. The application using the AF_XDP socket just performs a mac >> swap on the packet and sends it back to the load generator that then >> measures the round trip latency. The kthread is taskset to the same >> core as ksoftirqd would run on. So in each experiment, they always run >> on the same core id (which is not the same as the application). >> >> Rate 12 Mpps with 0% loss. >> Latencies (us) Delay Variation between packets >> min avg max avg max >> sofirq 11.0 17.1 78.4 0.116 63.0 >> kthread 11.2 17.1 35.0 0.116 20.9 >> >> Rate ~58 Mpps (Line rate at 40 Gbit/s) with substantial loss >> Latencies (us) Delay Variation between packets >> min avg max avg max >> softirq 87.6 194.9 282.6 0.062 25.9 >> kthread 86.5 185.2 271.8 0.061 22.5 >> >> For the last experiment, I also get 1.5% to 2% higher throughput with >> your kthread approach. Moreover, just from the per-second throughput >> printouts from my application, I can see that the kthread numbers are >> more stable. The softirq numbers can vary quite a lot between each >> second, around +-3%. But for the kthread approach, they are nice and >> stable. Have not examined why. >> > > Thanks for sharing the results! > >> One thing I noticed though, and I do not know if this is an issue, is >> that the switching between the two modes does not occur at high packet >> rates. I have to lower the packet rate to something that makes the >> core work less than 100% for it to switch between ksoftirqd to kthread >> and vice versa. They just seem too busy to switch at 100% load when >> changing the "threaded" sysfs variable. >> > > I think the reason for this is when load is high, napi_poll() probably > always exhausts the predefined napi->weight. So it will keep > re-polling in the current context. The switch could only happen the > next time ___napi_schedule() is called. A similar problem happens when /proc/irq/{..}/smp_affinity is changed. Few drivers actually detect the affinity has changed (and does not include current cpu), and force an napi poll complete/exit, so that a new hardware interrupt is allowed and routed to another cpu. Presumably the softirq -> kthread transition could be enforced if really needed.