From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DFB51C4727E for ; Thu, 1 Oct 2020 22:12:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9E06D2074B for ; Thu, 1 Oct 2020 22:12:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="tYW4JiXt" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733140AbgJAWMe (ORCPT ); Thu, 1 Oct 2020 18:12:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44708 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726731AbgJAWMe (ORCPT ); Thu, 1 Oct 2020 18:12:34 -0400 Received: from mail-io1-xd44.google.com (mail-io1-xd44.google.com [IPv6:2607:f8b0:4864:20::d44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9EE57C0613D0 for ; Thu, 1 Oct 2020 15:12:32 -0700 (PDT) Received: by mail-io1-xd44.google.com with SMTP id k25so175653ioh.7 for ; Thu, 01 Oct 2020 15:12:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Apr1Mlsynbtc1thNfgqJW8WOK7Dg4Mrmswctp9uIGUg=; b=tYW4JiXtD/IKQMEIuy9yKyV3T5/PKONc9lOscUhWEeB9euQBSPIwsp7pBRMKEiVaGy a17osqruxqt+crUWDhpvpYON+JQfdSMrs6GPj43xN9Uu1AhR7ZaDXCddD/So39V0akO6 6FdZr+3JY9atmR/LxnDp1Y4gCx/XEkuhUJg+UGIgoRJx4ZhVidcTJYhl0zdGGonEyTYl Bdu+BG2KWMsrjbqg+AhdYJMD5vRRPB179qaVpPVEX7fA9Yu+QXGnB5U6ioEOT4T1XdW/ p2fDWA3PcHELqXQxXekBQ7IIwpK+MtfjqMIOBDxVBu+Vj/H5R20HXeJoebHwCsUZHgjG 7Y3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Apr1Mlsynbtc1thNfgqJW8WOK7Dg4Mrmswctp9uIGUg=; b=oNO5AQrLl+SotO/iIAHCINYpBTtvMm2I8+p2xnAY4qnZ1bB6XGWycInI6jCUKqpWqN d4/7l0Orsw/BX51VN7mO43lD+/D9rfTEqPmM2BEG+6tu/K1QeqJ/SoLUvafhlynw23na YkwUpaUjBOvUmGiLs0I1K/W7D027GYhD8XLpesPI7jmubyvUVBb7QT3WtbLFOPwCb0Oi t4VlLm6iGH388ivGTB0c2qrSRJe6htH806V/1VTGw1C+N74M4bFAfqtuAIW/n0Yokw8j 1XFHVsxwQlnpYBpLgorcTgivSLHwDls/KdEAg+hqMPrJTqYnOmwbrV3I8Bkdh8J0DjHx UGUQ== X-Gm-Message-State: AOAM531o/2FVNB+aL21DnAMGfLrcn79xdIYQCLLseZ0ryWUEG2kf6An1 6eM1ZqtMI6580l235WivMG0cy3QJZaHTuSkQkuLT/w== X-Google-Smtp-Source: ABdhPJwBfLr3d3Vg2AYYieVS1Tjo7IJR2wXYKhmT34HThmXFZRZi55S7trscu9UTbht7GhG1k3mjZ0x8Coe/S+4f3L8= X-Received: by 2002:a05:6602:154e:: with SMTP id h14mr7264649iow.17.1601590351532; Thu, 01 Oct 2020 15:12:31 -0700 (PDT) MIME-Version: 1.0 References: <20200930192140.4192859-1-weiwan@google.com> <20200930130839.427eafa9@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> <20201001132607.21bcaa17@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> In-Reply-To: <20201001132607.21bcaa17@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> From: Wei Wang Date: Thu, 1 Oct 2020 15:12:20 -0700 Message-ID: Subject: Re: [PATCH net-next 0/5] implement kthread based napi poll To: Jakub Kicinski Cc: Eric Dumazet , "David S . Miller" , netdev , Hannes Frederic Sowa , Paolo Abeni , Felix Fietkau Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Thu, Oct 1, 2020 at 1:26 PM Jakub Kicinski wrote: > > On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote: > > On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski wrote: > > > On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote: > > > > With napi poll moved to kthread, scheduler is in charge of scheduling both > > > > the kthreads handling network load, and the user threads, and is able to > > > > make better decisions. In the previous benchmark, if we do this and we > > > > pin the kthreads processing napi poll to specific CPUs, scheduler is > > > > able to schedule user threads away from these CPUs automatically. > > > > > > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > > > entity per host, is that kthread is more configurable than workqueue, > > > > and we could leverage existing tuning tools for threads, like taskset, > > > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > > > if we eventually want to provide busy poll feature using kernel threads > > > > for napi poll, kthread seems to be more suitable than workqueue. > > > > > > As I said in my reply to the RFC I see better performance with the > > > workqueue implementation, so I would hold off until we have more > > > conclusive results there, as this set adds fairly strong uAPI that > > > we'll have to support for ever. > > > > We can make incremental changes, the kthread implementation looks much > > nicer to us. > > Having done two implementation of something more wq-like now > I can say with some confidence that it's quite likely not a > simple extension of this model. And since we'll likely need > to support switching at runtime there will be a fast-path > synchronization overhead. > > > The unique work queue is a problem on server class platforms, with > > NUMA placement. > > We now have servers with NIC on different NUMA nodes. > > Are you saying that the wq code is less NUMA friendly than unpinned > threads? > > > We can not introduce a new model that will make all workload better > > without any tuning. > > If you really think you can do that, think again. > > Has Wei tested the wq implementation with real workloads? > > All the cover letter has is some basic netperf runs and a vague > sentence saying "real workload also improved". > Yes. I did a round of testing with workqueue as well. The "real workload" I mentioned is a google internal application benchmark which involves networking as well as disk ops. There are 2 types of tests there. 1 is sustained tests, where the ops/s is being pushed to very high, and keeps the overall cpu usage to > 80%, with various sizes of payload. In this type of test case, I see a better result with the kthread model compared to workqueue in the latency metrics, and similar CPU savings, with some tuning of the kthreads. (e.g., we limit the kthreads to a pool of CPUs to run on, to avoid mixture with application threads. I did the same for workqueue as well to be fair.) The other is trace based tests, where the load is based on the actual trace taken from the real servers. This kind of test has less load and ops/s overall. (~25% total cpu usage on the host) In this test case, I observe a similar amount of latency savings with both kthread and workqueue, but workqueue seems to have better cpu saving here, possibly due to less # of threads woken up to process the load. And one reason we would like to push forward with 1 kthread per NAPI, is we are also trying to do busy polling with the kthread. And it seems a good model to have 1 kthread dedicated to 1 NAPI to begin with. > I think it's possible to get something that will be a better default > for 90% of workloads. Our current model predates SMP by two decades. > It's pretty bad. > > I'm talking about upstream defaults, obviously, maybe you're starting > from a different baseline configuration than the rest of the world.. > > > Even the old ' fix' (commit 4cd13c21b207e80ddb1144c576500098f2d5f882 > > "softirq: Let ksoftirqd do its job" ) > > had severe issues for latency sensitive jobs. > > > > We need to be able to opt-in to threads, and let process scheduler > > take decisions. > > If we believe the process scheduler takes bad decision, it should be > > reported to scheduler experts. > > I wouldn't expect that the scheduler will learn all by itself how to > group processes that run identical code for cache efficiency, and how > to schedule at 10us scale. I hope I'm wrong. > > > I fully support this implementation, I do not want to wait for yet > > another 'work queue' model or scheduler classes. > > I can't sympathize. I don't understand why you're trying to rush this. > And you're not giving me enough info about your target config to be able > to understand your thinking.