From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD825C43334 for ; Wed, 20 Jul 2022 15:39:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239560AbiGTPjp (ORCPT ); Wed, 20 Jul 2022 11:39:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233842AbiGTPjo (ORCPT ); Wed, 20 Jul 2022 11:39:44 -0400 Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AB1E491C7 for ; Wed, 20 Jul 2022 08:39:43 -0700 (PDT) Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-31e64ca5161so39469177b3.1 for ; Wed, 20 Jul 2022 08:39:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JRcq3vKjIzJm9TVcaSce4sIDa9ZUrBaFgnnM4S7LNls=; b=KDo9ERkEPtznpuEHkYtWEOmYc/0qDr8bRu3EUez0IyX5/TKh8u5qZjnFAT2I4QfW9e ZMV0w1MMyUxfvRthzXQqkj33GhOJWLpqg7tC5ovWDjXoHPRDKr9EsCtlHlPSOfIpFx3D CXME498rdwmBEyTqTDuj2YvpweKS48GnrAqYmWnGH0cec8LzfEml3gIuCufit0mRKQYy bhkjqwrGOHKuMDfcfqbKXW1MoRee3DzPTOvyd4fX5Spr0cTN7I+DpvQ8KDEQHuD1KvZf okqE4UWVk+RHjtzl+ezoVyGLHMgwAygcaKP5BfZt7hvpynO42mZSfTTK9LYQs8LKGZOO R3fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JRcq3vKjIzJm9TVcaSce4sIDa9ZUrBaFgnnM4S7LNls=; b=h/l0x7iifQ8WYP16obVqQG14qmDPgcpJaStnka6mhQK0BMqVNb/dadoiu9EseR3qBL WIDBpArViar4KSz2z01K3HFEDjg5ZqY4bfYVHmI3DPDZoQnNUtp71PlZ5G0/y0c3Skc2 z5HatRXJZjC2dn686PNqZ1tvID/VwPanvJ8BCeCXvz5PC8ZOilgEtX7U8eAPt0nCi6h6 Scg1WRC5MTCO4PmJFiAQc8FAd1QV4UVhOiEpaP/C1QHmL/i511nanFxA7qhEHl7M7x7B mO8t+Oj3UZFi4gUtzEzPEICeSzAb8nHbNbmxXbiAm/HoiIHLjaS229rGAXy6d1MzfMKw WNmQ== X-Gm-Message-State: AJIora/ZVN1vPG0azW83+pRQ6IvCkuqDCJOsKRoEFtTB3XJdpZCQDhoM OM3dd4iQIhiY3G2yrjCr3Xt9b6m/9u6xWSXpi2xbXw== X-Google-Smtp-Source: AGRyM1vLajKuouK35jbILBRb3Kg0zRLUMXuWQoMOQdNY3/mojOBheS8cAO8ZsFxrXr5sAhOImjaLAsQQKMEv/xrTRkE= X-Received: by 2002:a81:5794:0:b0:31d:e7b3:b8a3 with SMTP id l142-20020a815794000000b0031de7b3b8a3mr32091574ywb.333.1658331582641; Wed, 20 Jul 2022 08:39:42 -0700 (PDT) MIME-Version: 1.0 References: <20220704150514.48816-1-elver@google.com> <20220704150514.48816-5-elver@google.com> In-Reply-To: From: Marco Elver Date: Wed, 20 Jul 2022 17:39:06 +0200 Message-ID: Subject: Re: [PATCH v3 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints To: Ian Rogers Cc: Peter Zijlstra , Frederic Weisbecker , Ingo Molnar , Thomas Gleixner , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Dmitry Vyukov , Michael Ellerman , linuxppc-dev@lists.ozlabs.org, linux-perf-users@vger.kernel.org, x86@kernel.org, linux-sh@vger.kernel.org, kasan-dev@googlegroups.com, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-sh@vger.kernel.org On Wed, 20 Jul 2022 at 17:29, Ian Rogers wrote: > > On Mon, Jul 4, 2022 at 8:06 AM Marco Elver wrote: > > > > On a machine with 256 CPUs, running the recently added perf breakpoint > > benchmark results in: > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > > | # Running 'breakpoint/thread' benchmark: > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism > > | Total time: 236.418 [sec] > > | > > | 123134.794271 usecs/op > > | 7880626.833333 usecs/op/cpu > > > > The benchmark tests inherited breakpoint perf events across many > > threads. > > > > Looking at a perf profile, we can see that the majority of the time is > > spent in various hw_breakpoint.c functions, which execute within the > > 'nr_bp_mutex' critical sections which then results in contention on that > > mutex as well: > > > > 37.27% [kernel] [k] osq_lock > > 34.92% [kernel] [k] mutex_spin_on_owner > > 12.15% [kernel] [k] toggle_bp_slot > > 11.90% [kernel] [k] __reserve_bp_slot > > > > The culprit here is task_bp_pinned(), which has a runtime complexity of > > O(#tasks) due to storing all task breakpoints in the same list and > > iterating through that list looking for a matching task. Clearly, this > > does not scale to thousands of tasks. > > > > Instead, make use of the "rhashtable" variant "rhltable" which stores > > multiple items with the same key in a list. This results in average > > runtime complexity of O(1) for task_bp_pinned(). > > > > With the optimization, the benchmark shows: > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > > | # Running 'breakpoint/thread' benchmark: > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism > > | Total time: 0.208 [sec] > > | > > | 108.422396 usecs/op > > | 6939.033333 usecs/op/cpu > > > > On this particular setup that's a speedup of ~1135x. > > > > While one option would be to make task_struct a breakpoint list node, > > this would only further bloat task_struct for infrequently used data. > > Furthermore, after all optimizations in this series, there's no evidence > > it would result in better performance: later optimizations make the time > > spent looking up entries in the hash table negligible (we'll reach the > > theoretical ideal performance i.e. no constraints). > > > > Signed-off-by: Marco Elver > > Reviewed-by: Dmitry Vyukov > > --- > > v2: > > * Commit message tweaks. > > --- > > include/linux/perf_event.h | 3 +- > > kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++------------- > > 2 files changed, 37 insertions(+), 22 deletions(-) > > > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > > index 01231f1d976c..e27360436dc6 100644 > > --- a/include/linux/perf_event.h > > +++ b/include/linux/perf_event.h > > @@ -36,6 +36,7 @@ struct perf_guest_info_callbacks { > > }; > > > > #ifdef CONFIG_HAVE_HW_BREAKPOINT > > +#include > > #include > > #endif > > > > @@ -178,7 +179,7 @@ struct hw_perf_event { > > * creation and event initalization. > > */ > > struct arch_hw_breakpoint info; > > - struct list_head bp_list; > > + struct rhlist_head bp_list; > > nit: perhaps it would be more intention revealing here to rename this > to bp_hashtable? The naming convention for uses of rhlist_head appears to be either 'list' or 'node' (also inside lib/rhashtable.c). I think this makes sense because internally this struct is used to just append to the bucket's list. > Acked-by: Ian Rogers Thanks! -- Marco From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2BC1BCCA480 for ; Wed, 20 Jul 2022 15:40:21 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4Lp0Jg5Snqz3dq1 for ; Thu, 21 Jul 2022 01:40:19 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20210112 header.b=KDo9ERkE; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=google.com (client-ip=2607:f8b0:4864:20::112b; helo=mail-yw1-x112b.google.com; envelope-from=elver@google.com; receiver=) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20210112 header.b=KDo9ERkE; dkim-atps=neutral Received: from mail-yw1-x112b.google.com (mail-yw1-x112b.google.com [IPv6:2607:f8b0:4864:20::112b]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4Lp0J2321Sz3blV for ; Thu, 21 Jul 2022 01:39:46 +1000 (AEST) Received: by mail-yw1-x112b.google.com with SMTP id 00721157ae682-31e0d4ad6caso117480407b3.10 for ; Wed, 20 Jul 2022 08:39:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JRcq3vKjIzJm9TVcaSce4sIDa9ZUrBaFgnnM4S7LNls=; b=KDo9ERkEPtznpuEHkYtWEOmYc/0qDr8bRu3EUez0IyX5/TKh8u5qZjnFAT2I4QfW9e ZMV0w1MMyUxfvRthzXQqkj33GhOJWLpqg7tC5ovWDjXoHPRDKr9EsCtlHlPSOfIpFx3D CXME498rdwmBEyTqTDuj2YvpweKS48GnrAqYmWnGH0cec8LzfEml3gIuCufit0mRKQYy bhkjqwrGOHKuMDfcfqbKXW1MoRee3DzPTOvyd4fX5Spr0cTN7I+DpvQ8KDEQHuD1KvZf okqE4UWVk+RHjtzl+ezoVyGLHMgwAygcaKP5BfZt7hvpynO42mZSfTTK9LYQs8LKGZOO R3fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JRcq3vKjIzJm9TVcaSce4sIDa9ZUrBaFgnnM4S7LNls=; b=RcRb36uuQvU35iGjRD4Ea1zISrpLRLM8RBhvsynUZbh13shqmOkaW06c8qgwcQcxtb yCp+nTy0XDny3AlwyQy0Vjo31FbnRihRNyDYip4K3z4qEz2PoiXACyJkTNWcrb/gf0+G /OMuOEXm7IgRxV622VDNKmOVdzFRJSHc2lYK2sM8q5jvCr3FI8CrAmPUREBac1P184Tw 5cLjKmbs7ydqLiTc/JT1ADGVTHq9Mwaj2tMICHUWcLN1SrqksziwbSq8tBLG0Y9mvMJM yZGO0hgBD0p2yD2x8WLq5RTHbO1Vp48qe+3lKzHCKWUQPUuLz+BQ7OYGCXxhClfJQ2qO NHQw== X-Gm-Message-State: AJIora8cdFGaZMtc1g2/XetvSLAf57ed4UFNUV1UAHx8qFsAu1dszFOz goGy9PtCOJpTtDNvwsKD3oGpeE/uZ8ZKRh9+sy3Uiw== X-Google-Smtp-Source: AGRyM1vLajKuouK35jbILBRb3Kg0zRLUMXuWQoMOQdNY3/mojOBheS8cAO8ZsFxrXr5sAhOImjaLAsQQKMEv/xrTRkE= X-Received: by 2002:a81:5794:0:b0:31d:e7b3:b8a3 with SMTP id l142-20020a815794000000b0031de7b3b8a3mr32091574ywb.333.1658331582641; Wed, 20 Jul 2022 08:39:42 -0700 (PDT) MIME-Version: 1.0 References: <20220704150514.48816-1-elver@google.com> <20220704150514.48816-5-elver@google.com> In-Reply-To: From: Marco Elver Date: Wed, 20 Jul 2022 17:39:06 +0200 Message-ID: Subject: Re: [PATCH v3 04/14] perf/hw_breakpoint: Optimize list of per-task breakpoints To: Ian Rogers Content-Type: text/plain; charset="UTF-8" X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , linux-sh@vger.kernel.org, Peter Zijlstra , Frederic Weisbecker , x86@kernel.org, linuxppc-dev@lists.ozlabs.org, Arnaldo Carvalho de Melo , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Alexander Shishkin , kasan-dev@googlegroups.com, Namhyung Kim , Thomas Gleixner , Jiri Olsa , Ingo Molnar , Dmitry Vyukov Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On Wed, 20 Jul 2022 at 17:29, Ian Rogers wrote: > > On Mon, Jul 4, 2022 at 8:06 AM Marco Elver wrote: > > > > On a machine with 256 CPUs, running the recently added perf breakpoint > > benchmark results in: > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > > | # Running 'breakpoint/thread' benchmark: > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism > > | Total time: 236.418 [sec] > > | > > | 123134.794271 usecs/op > > | 7880626.833333 usecs/op/cpu > > > > The benchmark tests inherited breakpoint perf events across many > > threads. > > > > Looking at a perf profile, we can see that the majority of the time is > > spent in various hw_breakpoint.c functions, which execute within the > > 'nr_bp_mutex' critical sections which then results in contention on that > > mutex as well: > > > > 37.27% [kernel] [k] osq_lock > > 34.92% [kernel] [k] mutex_spin_on_owner > > 12.15% [kernel] [k] toggle_bp_slot > > 11.90% [kernel] [k] __reserve_bp_slot > > > > The culprit here is task_bp_pinned(), which has a runtime complexity of > > O(#tasks) due to storing all task breakpoints in the same list and > > iterating through that list looking for a matching task. Clearly, this > > does not scale to thousands of tasks. > > > > Instead, make use of the "rhashtable" variant "rhltable" which stores > > multiple items with the same key in a list. This results in average > > runtime complexity of O(1) for task_bp_pinned(). > > > > With the optimization, the benchmark shows: > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > > | # Running 'breakpoint/thread' benchmark: > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism > > | Total time: 0.208 [sec] > > | > > | 108.422396 usecs/op > > | 6939.033333 usecs/op/cpu > > > > On this particular setup that's a speedup of ~1135x. > > > > While one option would be to make task_struct a breakpoint list node, > > this would only further bloat task_struct for infrequently used data. > > Furthermore, after all optimizations in this series, there's no evidence > > it would result in better performance: later optimizations make the time > > spent looking up entries in the hash table negligible (we'll reach the > > theoretical ideal performance i.e. no constraints). > > > > Signed-off-by: Marco Elver > > Reviewed-by: Dmitry Vyukov > > --- > > v2: > > * Commit message tweaks. > > --- > > include/linux/perf_event.h | 3 +- > > kernel/events/hw_breakpoint.c | 56 ++++++++++++++++++++++------------- > > 2 files changed, 37 insertions(+), 22 deletions(-) > > > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > > index 01231f1d976c..e27360436dc6 100644 > > --- a/include/linux/perf_event.h > > +++ b/include/linux/perf_event.h > > @@ -36,6 +36,7 @@ struct perf_guest_info_callbacks { > > }; > > > > #ifdef CONFIG_HAVE_HW_BREAKPOINT > > +#include > > #include > > #endif > > > > @@ -178,7 +179,7 @@ struct hw_perf_event { > > * creation and event initalization. > > */ > > struct arch_hw_breakpoint info; > > - struct list_head bp_list; > > + struct rhlist_head bp_list; > > nit: perhaps it would be more intention revealing here to rename this > to bp_hashtable? The naming convention for uses of rhlist_head appears to be either 'list' or 'node' (also inside lib/rhashtable.c). I think this makes sense because internally this struct is used to just append to the bucket's list. > Acked-by: Ian Rogers Thanks! -- Marco