From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B28FC622 for ; Thu, 8 Jun 2023 00:39:54 +0000 (UTC) Received: from mail-qv1-xf35.google.com (mail-qv1-xf35.google.com [IPv6:2607:f8b0:4864:20::f35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AC0D2682 for ; Wed, 7 Jun 2023 17:39:52 -0700 (PDT) Received: by mail-qv1-xf35.google.com with SMTP id 6a1803df08f44-62606e67c0dso891796d6.2 for ; Wed, 07 Jun 2023 17:39:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686184791; x=1688776791; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=uTx3BpjebejifzfngoNrlbTBlfkivcsKVKIXVNKp5rM=; b=JE1jcXDfSEOvjt6aRXS7gYGwYy1fUzP8JFgGpUZVgH1+l0NffNw/vrnNMU9+zvJXXE X5E66xEABxynaLzdS+MNSSrYoUo6XecxW8d2urY9LENsXGYy1+vnNMKqL4HTKB63t5wJ rG2orjrPrkwVG4dMg/20uRPK08i6V37sshUm2qVDjwXmAGSCCOCledJfyFrElAzqLAIP BoitHqWd7JXANmyEzPUdQpmx/CsNSUgHQY5IQ4mnr3wD/mXiAGdqW80lxlWdkS+uBT8c 8I9uhTdiFAMeueuoMQu5KFWeTS//fK+v928pht6GIaAX1gem266j1Fcl2Yzke054cIoT CrUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686184791; x=1688776791; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=uTx3BpjebejifzfngoNrlbTBlfkivcsKVKIXVNKp5rM=; b=UvNWMDFRsVVPHITZOA4f1MQW5UKXZeCJR3yXMn5Fz1ixbv8IYlHH2hj5f/2fVZyrT5 vLiddqpoPYa2DMW59N6Fu4X/p9WwjVSbollI7zpINT30eXagj9woAVCD1qA2iFk9Z7Kv R5cOyg3vpElwuc0mXs+dBNGM/AeldpduIYyQJSghfItCpALbuJh2R/4LzgfuEQt0vAnQ 3rPKtUozdjoz9aFw8uMD5lFK+FTquw0ddpK3ANWZxwvcs206QjjWJMfF2YmKn1/Ntb7V /ZLCpbEazsdae4QvNVKsvyu77rDYK0ARgS0KWz4srRgVjfZTZjOENQIvGTFBvNJgNYU3 nSNg== X-Gm-Message-State: AC+VfDxPSCJaRdLGXUhxtBF9TVpALAZXhyzbhP6X/DdiT5QUwRwMn0ia 8GyEa28DES/QoD2xPtRrow== X-Google-Smtp-Source: ACHHUZ4fGij+iE20gn0pvBPSymiE7Suwo7giivj0nf2XHGJ1Aq8rzf0Bz/i97RG07I+08GBSbPICDw== X-Received: by 2002:a05:6214:dcc:b0:625:75dd:236d with SMTP id 12-20020a0562140dcc00b0062575dd236dmr18991qvt.23.1686184791481; Wed, 07 Jun 2023 17:39:51 -0700 (PDT) Received: from C02FL77VMD6R.googleapis.com ([2600:1700:d860:12b0:7cbc:7b16:951c:eeaa]) by smtp.gmail.com with ESMTPSA id y9-20020a0cf149000000b006262304b2fesm21210qvl.79.2023.06.07.17.39.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Jun 2023 17:39:51 -0700 (PDT) Date: Wed, 7 Jun 2023 17:39:46 -0700 From: Peilin Ye To: Vlad Buslov Cc: Jamal Hadi Salim , Jakub Kicinski , Pedro Tammela , "David S. Miller" , Eric Dumazet , Paolo Abeni , Cong Wang , Jiri Pirko , Peilin Ye , Daniel Borkmann , John Fastabend , Hillf Danton , netdev@vger.kernel.org, Cong Wang Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Message-ID: References: <20230526193324.41dfafc8@kernel.org> <87jzwrxrz8.fsf@nvidia.com> <87fs7fxov6.fsf@nvidia.com> <87bki2xb3d.fsf@nvidia.com> <877csny9rd.fsf@nvidia.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <877csny9rd.fsf@nvidia.com> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote: > >> >> If livelock with concurrent filters insertion is an issue, then it can > >> >> be remedied by setting a new Qdisc->flags bit > >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with > >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter > >> >> insertion coming after the flag is set to synchronize on rtnl lock. > >> > > >> > Thanks for the suggestion! I'll try this approach. > >> > > >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of > >> > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive > >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without > >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for > >> > even longer?). > >> > >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find() > >> already returns -EINVAL when q->refcnt is zero, so maybe returning > >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is > >> set is also fine? Would be much easier to implement as opposed to moving > >> rtnl_lock there. > > > > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m > > 16G, CONFIG_NR_CPUS=8). I tried deleting the ingress Qdisc (let's call it > > "request A") while it has a lot of ongoing filter requests, and here's the > > result: > > > > #1 #2 #3 #4 > > ---------------------------------------------------------- > > a. refcnt 89 93 230 571 > > b. replayed 167,568 196,450 336,291 878,027 > > c. time real 0m2.478s 0m2.746s 0m3.693s 0m9.461s > > user 0m0.000s 0m0.000s 0m0.000s 0m0.000s > > sys 0m0.623s 0m0.681s 0m1.119s 0m2.770s > > > > a. is the Qdisc refcnt when A calls qdisc_graft() for the first time; > > b. is the number of times A has been replayed; > > c. is the time(1) output for A. > > > > a. and b. are collected from printk() output. This is better than before, > > but A could still be replayed for hundreds of thousands of times and hang > > for a few seconds. > > I don't get where does few seconds waiting time come from. I'm probably > missing something obvious here, but the waiting time should be the > maximum filter op latency of new/get/del filter request that is already > in-flight (i.e. already passed qdisc_is_destroying() check) and it > should take several orders of magnitude less time. Yeah I agree, here's what I did: In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic loop: $ echo "1 1 32" > /sys/bus/netdevsim/new_device $ tc qdisc add dev eth1 ingress $ for (( i=1; i<=3000; i++ )) > do > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 & > done When the loop is running, I delete the Qdisc in Terminal 2: $ time tc qdisc delete dev eth1 ingress Which took seconds on average. However, if I specify a unique "prio" when adding filters in that loop, e.g.: $ for (( i=1; i<=3000; i++ )) > do > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 & > done ^^^^^^^ Then deleting the Qdisc in Terminal 2 becomes a lot faster: real 0m0.712s user 0m0.000s sys 0m0.152s In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did yet another test [1], which looks a lot better. When I didn't specify "prio", sometimes that rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns -EEXIST. Is it because that concurrent add-filter requests auto-allocated the same "prio" number, so they collided with each other? Do you think this is related to why it's slow? Thanks, Peilin Ye [1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances in -batch mode that keeps adding a unique filter (with "prio" and "handle" specified) then deletes it. Again, when they are running I delete the ingress Qdisc, and here's the result: #1 #2 #3 #4 ---------------------------------------------------------- a. refcnt 64 63 64 64 b. replayed 169 5,630 887 3,442 c. time real 0m0.171s 0m0.147s 0m0.186s 0m0.111s user 0m0.000s 0m0.009s 0m0.001s 0m0.000s sys 0m0.112s 0m0.108s 0m0.115s 0m0.104s