From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9C3FC433DF for ; Mon, 18 May 2020 14:16:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6D70520674 for ; Mon, 18 May 2020 14:16:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="unknown key version" (0-bit key) header.d=tlapnet.cz header.i=@tlapnet.cz header.b="ZGeiG5OJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727943AbgEROQv (ORCPT ); Mon, 18 May 2020 10:16:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55654 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726998AbgEROQv (ORCPT ); Mon, 18 May 2020 10:16:51 -0400 Received: from mail-qt1-x82c.google.com (mail-qt1-x82c.google.com [IPv6:2607:f8b0:4864:20::82c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E943C061A0C for ; Mon, 18 May 2020 07:16:50 -0700 (PDT) Received: by mail-qt1-x82c.google.com with SMTP id c24so8145911qtw.7 for ; Mon, 18 May 2020 07:16:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tlapnet.cz; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QEufxHro/U0BeUthE5FzPld1eQ9zNAA+DMVMxeZKT+8=; b=ZGeiG5OJ2Bz/lnrJnqTZ7H+2FU2FksP92iADpiCmM6/FKAJuukpt4f9B8C5hqVkFdn ZQm9rNtRdYLWIHB/kQHkBhuQMwNtSquaBnSfmxgfoOODy/PgwlD+dVq8OQa5MbjtlJ/v w4SVrBX0TKEfV+IyZNl3fy+1+jmGUgq21ezG1xV/zfX87SZ7y60wjBZSklPWota6pVHt uYDnZIhIMdK3rofT7A/7SMHRWV8FDC9ll+9YkQpLntoLT7wnYa6JTMYQyzjbc8XNPWgq xbMo6p+QpOvwJPKiXMXjbtP1RYMkQMptmY0Zw20W81zxdh8CFHpQ7DzWYygCBD50rsTs +FJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QEufxHro/U0BeUthE5FzPld1eQ9zNAA+DMVMxeZKT+8=; b=juR8TJWs5eQcx5P6YMxWNME9v/URblxIcTTqe8GjQp7gSIXbcaUnyK7NLlEVLJ+gbu 02rXmeXRcktbKEj9v7bxthP8XH0IGt7Om+lYUnbErEhppsrOZ8A3O8OEQjC3DDF4N/qL L1w5k6gq31eQ15DN9H4BKNLyomUUJPcl9LPzodTXH+1dSkvP2gmIwrzgjdk2l1vACZM8 kCcAw6G4wswHGUbxv4QjSkw6gh2drLHnYveL2WqOOtBZD6+m4C9tkOuzt4iJ48qbsH3y FGlnFox8G+vMxSdLgPiNrXPaC6NfmIdL9I58UK50d3kWY88WJ0DvqUh3I3FhdcArP5W4 kruQ== X-Gm-Message-State: AOAM533DTgADlivqmW1swnrfMnjM1SbsYU/HaeMDWdTHK3iHbxyiYgmn UT7zpHj4zPSjoMOLwx3FZjv5IvInJhT8ZdEyROymTQ== X-Google-Smtp-Source: ABdhPJzlj6OCKsSAy1/MX/895X24MRMD2MLKifWr01ftuHySS4iuJqlPSgakW43HeAtwxlSK+kERYbSurAlhkJoVNuk= X-Received: by 2002:ac8:4987:: with SMTP id f7mr16195065qtq.160.1589811409609; Mon, 18 May 2020 07:16:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?Q?V=C3=A1clav_Zindulka?= Date: Mon, 18 May 2020 16:16:38 +0200 Message-ID: Subject: Re: iproute2: tc deletion freezes whole server To: Cong Wang Cc: Linux Kernel Network Developers Content-Type: multipart/mixed; boundary="00000000000054a02205a5ecd058" Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org --00000000000054a02205a5ecd058 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, May 17, 2020 at 9:35 PM Cong Wang wrote: > > On Fri, May 8, 2020 at 6:59 AM V=C3=A1clav Zindulka > wrote: > > > > > > > > > > I tried to emulate your test case in my VM, here is the script I = use: > > > > > > > > > > =3D=3D=3D=3D > > > > > ip li set dev dummy0 up > > > > > tc qd add dev dummy0 root handle 1: htb default 1 > > > > > for i in `seq 1 1000` > > > > > do > > > > > tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit = ceil 1.5mbit > > > > > tc qd add dev dummy0 parent 1:$i fq_codel > > > > > done > > > > > > > > > > time tc qd del dev dummy0 root > > > > > =3D=3D=3D=3D > > > > > > > > > > And this is the result: > > > > > > > > > > Before my patch: > > > > > real 0m0.488s > > > > > user 0m0.000s > > > > > sys 0m0.325s > > > > > > > > > > After my patch: > > > > > real 0m0.180s > > > > > user 0m0.000s > > > > > sys 0m0.132s > > > > > > > > My results with your test script. > > > > > > > > before patch: > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root > > > > real 1.63 > > > > user 0.00 > > > > sys 1.63 > > > > > > > > after patch: > > > > /usr/bin/time -p tc qdisc del dev enp1s0f0 root > > > > real 1.55 > > > > user 0.00 > > > > sys 1.54 > > > > > > > > > This is an obvious improvement, so I have no idea why you didn't > > > > > catch any difference. > > > > > > > > We use hfsc instead of htb. I don't know whether it may cause any > > > > difference. I can provide you with my test scripts if necessary. > > > > > > Yeah, you can try to replace the htb with hfsc in my script, > > > I didn't spend time to figure out hfsc parameters. > > > > class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2 > > 13107200 ul m1 0 d 0 m2 13107200 > > > > but it behaves the same as htb... > > > > > My point here is, if I can see the difference with merely 1000 > > > tc classes, you should see a bigger difference with hundreds > > > of thousands classes in your setup. So, I don't know why you > > > saw a relatively smaller difference. > > > > I saw a relatively big difference. It was about 1.5s faster on my huge > > setup which is a lot. Yet maybe the problem is caused by something > > What percentage? IIUC, without patch it took you about 11s, so > 1.5s faster means 13% improvement for you? My whole setup needs 22.17 seconds to delete with an unpatched kernel. With your patches applied it is 21.08. So it varies between 1 - 1.5s. Improvement is about 5 - 6%. > > else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx > > queues. SFP+ interfaces have much higher limits. 8 or even 64 possible > > queues. I've tried to increase the number of queues using ethtool from > > 4 to 8 and decreased to 2. But there was no difference. It was about > > 1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your > > patches applied. I've tried it for ifb and RJ45 interfaces where it > > took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your > > patches applied, which is strange, but it may be caused by the fact it > > was very fast even before. > > That is odd. In fact, this is highly related to number of TX queues, > because the existing code resets the qdisc's once for each TX > queue, so the more TX queues you have, the more resets kernel > will do, that is the more time it will take. Can't the problem be caused that reset is done on active and inactive queues every time? It would explain why it had no effect in decreasing and increasing the number of active queues. Yet it doesn't explain why Intel card (82599ES) with 64 possible queues has exactly the same problem as Mellanox (ConnectX-4 LX) with 8 possible queues. I've been playing with this problem today. Every deleted fq_codel qdisc, root and non root, requires a network adapter to empty all possible queues. But not just the active ones, but it cleared all possible queues. Event those the adapter can't even use. For every SFP+ I have tested it calls fq_codel_reset() and would call any other reset function. 64 times for egress qdisc and 64 times for ingress qdisc. Even when ingress is not defined. Solution to this whole problem would be to let reset only activated queues. On the RJ45 interface I've tested there are 8 possible queues. So the reset is done 8 times for every deleted qdisc. 16 in total, since ingress and egress are processed all the time. So a little bit of calculation. My initial setup contained 13706 qdisc rules. For ifb it means 13706 for egress and 13706 for ingress. 27412 reset calls because there can be only one transmit queue for the ifb interface. Average time spent between fq_codel_reset() according to my initial perf reports is somewhat between 7 - 16 micro seconds. 27412 * 0.000008 =3D 0.219296 s. For RJ45 interface it does 8 calls for every qdisc. 13706 * 8 * 2 =3D 219296 resets. 219296 * 0.000008 =3D 1.754368 s. It is still ok. For SFP+ interface it is 64 resets per qdisc rule. 13706 * 64 * 2 =3D 1754368. So we are very close to the huge number I noticed initially using printk. And now 1754368 * 0.000008 =3D 14.034944s. In case of slowest calls 1754368 * 0.000016 =3D 28.069888. Woala. Gotcha. So my final judgement is - don't empty something we don't use anyway. For Intel it may be reasonable, it can have all 64 queues defined. Mellanox has its limit at 8. But it still is being reset 64 times. We mostly decrease the number of queues to 4. Sometimes 2 according to the CPU used. Yet every CPU had to handle 64 resets. With the attached patch I'm down to 1.7 seconds - more than 90% improvement :-) Can you please check it and pass it to proper places? According to debugging printk messages it empties only active queues. Thank you for all your help and effort. > > I plan to address this later on top of the existing patches. > > Thanks. --00000000000054a02205a5ecd058 Content-Type: text/x-patch; charset="US-ASCII"; name="netdevice_num_tx_queues.patch" Content-Disposition: attachment; filename="netdevice_num_tx_queues.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_kacjv5cn0 ZGlmZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvbmV0ZGV2aWNlLmggYi9pbmNsdWRlL2xpbnV4L25l dGRldmljZS5oCi0tLSBhL2luY2x1ZGUvbGludXgvbmV0ZGV2aWNlLmgKKysrIGIvaW5jbHVkZS9s aW51eC9uZXRkZXZpY2UuaApAQCAtMjEzNSw3ICsyMTM1LDcgQEAKIHsKIAl1bnNpZ25lZCBpbnQg aTsKIAotCWZvciAoaSA9IDA7IGkgPCBkZXYtPm51bV90eF9xdWV1ZXM7IGkrKykKKwlmb3IgKGkg PSAwOyBpIDwgZGV2LT5yZWFsX251bV90eF9xdWV1ZXM7IGkrKykKIAkJZihkZXYsICZkZXYtPl90 eFtpXSwgYXJnKTsKIH0K --00000000000054a02205a5ecd058--