From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C4E71C00528 for ; Sat, 5 Aug 2023 00:19:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229596AbjHEATm (ORCPT ); Fri, 4 Aug 2023 20:19:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39468 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229564AbjHEATl (ORCPT ); Fri, 4 Aug 2023 20:19:41 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBCC04694; Fri, 4 Aug 2023 17:19:40 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 8541862185; Sat, 5 Aug 2023 00:19:40 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E1B2DC433C9; Sat, 5 Aug 2023 00:19:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1691194779; bh=ZT1E4tYRXqbQsqihOH2CWjK8qBcGifYeeRSfKM5zIZ0=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=eNB9CSV0rl8s9kc4RnWMGHEx+Gk3/Bm1SXIVOLNAUcT03f2n418gs9Y3TFk3126Z6 Ksvij8javjbpZ8IFRzwkxSD7jopKWO9wVE14HT5Cx/xGB2kUn2CaJ/GcRSuK19PI9z IMH8Yqzsl2LeyfYL5rMvJKrO93tvzubKjsbTElYaUQ7p3EjsOEG9G7he/4zzL0cr4b wfTcabeFlhI2NE7N/DAFWd6e10YZjh2UEU74B6PL2kLd4F1rcZkkSPHn+jPV384Akc yN3CZi2mRKgVB5XCMngODjCEza1RyV55w5r2WK2oNsP4qw3Er6VAKEWLwKblkLKU6O lYO2j2bet51iw== Received: by mail-ed1-f41.google.com with SMTP id 4fb4d7f45d1cf-51bece5d935so3352247a12.1; Fri, 04 Aug 2023 17:19:39 -0700 (PDT) X-Gm-Message-State: AOJu0Yx3s8L3bBOUMzvDrxd5XlAWJ9W8ghJidK5D1Dcm6SGSzJOkHYwd dF3jPvgR892IiVTkbObWcbT8VpU3Mhz2dKVnGBY= X-Google-Smtp-Source: AGHT+IEAmBRV778+9HRRu7cirwjztkUnM6/dZPsTbQSkHH9lz59F/kxxLBqF21ZKkF44osjd3hP1F1Sg5EeoVzMEA0k= X-Received: by 2002:a05:6402:1b05:b0:523:f4c:afe7 with SMTP id by5-20020a0564021b0500b005230f4cafe7mr2660325edb.38.1691194777750; Fri, 04 Aug 2023 17:19:37 -0700 (PDT) MIME-Version: 1.0 References: <20210514200743.3026725-1-alex.kogan@oracle.com> <20210514200743.3026725-4-alex.kogan@oracle.com> <20230803085004.GF212435@hirez.programming.kicks-ass.net> <20230803115610.GC214207@hirez.programming.kicks-ass.net> <20230804082531.GL212435@hirez.programming.kicks-ass.net> <20230804182312.GO212435@hirez.programming.kicks-ass.net> In-Reply-To: <20230804182312.GO212435@hirez.programming.kicks-ass.net> From: Guo Ren Date: Fri, 4 Aug 2023 20:19:25 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock To: Peter Zijlstra Cc: Alex Kogan , linux@armlinux.org.uk, mingo@redhat.com, will.deacon@arm.com, arnd@arndb.de, longman@redhat.com, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, tglx@linutronix.de, bp@alien8.de, hpa@zytor.com, x86@kernel.org, guohanjun@huawei.com, jglauber@marvell.com, steven.sistare@oracle.com, daniel.m.jordan@oracle.com, dave.dice@oracle.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-arch@vger.kernel.org On Fri, Aug 4, 2023 at 2:24=E2=80=AFPM Peter Zijlstra wrote: > > On Fri, Aug 04, 2023 at 10:17:35AM -0400, Guo Ren wrote: > > > > See, this is where the ARM64 WFE would come in handy; I don't suppose > > > RISC-V has anything like that? > > Em... arm64 smp_cond_load only could save power consumption or release > > the pipeline resources of an SMT processor. When (Node1 cpu64) is in > > the WFE state, it still needs (Node0 cpu1) to write the value to give > > a cross-NUMA signal. So I didn't see what WFE related to reducing > > cross-Numa transactions, or I missed something. Sorry > > The benefit is that WFE significantly reduces the memory traffic. Since > it 'suspends' the core and waits for a write-notification instead of > busy polling the memory location you get a ton less loads. Em... I had a different observation: When a long lock queue appeared by a store buffer delay problem in the lock torture test, we observed all interconnects get into a quiet state, and there was no more memory traffic. All the cores are loop-loading "different" cacheline from their L1 cache, caused by queued_spinlock. So I don't see any memory traffics on the bus. For the LL + WFE, AFAIK, LL is a load instruction that would grab the cacheline from the bus into the L1-cache and set the reservation set (arm may call it exclusive-monitor). If any cacheline invalidation requests (readunique/cleanunique/...) come in, WFE would retire, and the reservation set would be cleared. So from a cacheline perspective, there is no difference between "LL+WFE" and "looping loads." Let's see two scenarios of LL+WFE, multi-cores, and muti-threadings of one = core: - In the multi-cores case, WFE didn't give any more benefits than the loop loading from my perspective. Because the only thing WFE could do is to "suspend core" (I borrowed your word here), but it can't be deep sleep because the response from WFE is the most prior thing. As you said, we should prevent "terribly contended" situations, so WFE must keep fast reactions in the pipeline, not deep sleep. That's WFI stuff. And loop loading also could reduce power consumption through the proper micro-arch design: When the pipeline gets into a loop loading state, the loop buffer mechanism start, no instructions fetch happens, the frontend component can suspend for a while, and the only working components are "loop buffer" and "LSU load path." Other components could suspend for a while. So loop loading is not as terrible as you thought. - In the multi-threading of one core case, introducing an ISA instruction (WFE) to solve the loop loading problem is worthwhile because the thread could release the resource of the processor's pipe line. > -- Best Regards Guo Ren