From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2A5B2C43334 for ; Tue, 28 Jun 2022 16:01:49 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4LXTqc40Zmz3c8X for ; Wed, 29 Jun 2022 02:01:48 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20210112 header.b=dD7Wqxrx; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=google.com (client-ip=2607:f8b0:4864:20::b32; helo=mail-yb1-xb32.google.com; envelope-from=elver@google.com; receiver=) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20210112 header.b=dD7Wqxrx; dkim-atps=neutral Received: from mail-yb1-xb32.google.com (mail-yb1-xb32.google.com [IPv6:2607:f8b0:4864:20::b32]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4LXTpy5X1Rz2ypf for ; Wed, 29 Jun 2022 02:01:14 +1000 (AEST) Received: by mail-yb1-xb32.google.com with SMTP id i15so23060595ybp.1 for ; Tue, 28 Jun 2022 09:01:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JSgkWuUAADlOM6Jw2JGSpNYb73+UZZR8Ny928REi7Do=; b=dD7WqxrxIulFXhD/bExMZnWi60wwO8Kj+M+kIed6KYNLHB0WWLvTKi3YQglVoCYW4P +nCIP4SR+cwnQqQfuATDns64AdE53d0F59geSv0p/m5YXnhqp4LycwIMn8fTF4Q77Gv9 JOpUOTuY3a9w8gtDFPgVQ2AYVTdCpBjjnDrH38boB6yaYsKF0FG2/J9PB7qqpl9vKHJa 5PbEManR+KQaodnyqs64a5Pz/K9SPx1dRdNN2bbRiZzRZ65LxBDyShencH0aKN3YaGNN DodJhWIcA2oLlhzqE0ACp0cNtsDgcWQAS9pfcEWt337XNT7UXLPRvQp9CgVf6ZrMiAkK tJBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JSgkWuUAADlOM6Jw2JGSpNYb73+UZZR8Ny928REi7Do=; b=6RHMEBN8BXHi91ThOOondSIR9weItagGR/+OkK/2Fco8UltTPXM/EWFDJLRgIs0u5p lTqis2+8I/kx11LqRE9hOct2i0pk5YHk6Vvf7DOg/JXIHvrl6Nb3rwi8Mk2Vtf9A+K+3 NrYvwCYHgPH8VgD+2Oun81gYLS3t3w3tYYcPzZWFJsaoJX6MbXOElkyDYV5ge6SIlZqE K729TVvCJGaTLN+F+5FoMiZrtuQQ7CsF+1JyzM/zpcNqwymCF29J90BIyjWNTFqNY92c 6nzSnp2qs0krQvkdnfYLlSnZUnZ2Q/5a9ABmVGqrF/tieX9E41Q0wSdVNVtqJryPJ1dm T+Xg== X-Gm-Message-State: AJIora+amfeYXLZyyxp9r+kjRX3nLtqFZC0hm7pg5Kq4FC4fTl7bqMv4 paEabSywORID4uKZNSTCvgx5+CtBO/lVbXqaIEBUVA== X-Google-Smtp-Source: AGRyM1sZ36lYEgCFXIjvvLhfzCJJ1REYB15gJrq5Pp+y7LUoSX97r40Vy8EqU/H/mu0brC3/C7znR71H1FQdTL/t3qo= X-Received: by 2002:a25:cc56:0:b0:66c:d0f6:2f0e with SMTP id l83-20020a25cc56000000b0066cd0f62f0emr12156904ybf.168.1656432071758; Tue, 28 Jun 2022 09:01:11 -0700 (PDT) MIME-Version: 1.0 References: <20220628095833.2579903-1-elver@google.com> <20220628095833.2579903-14-elver@google.com> In-Reply-To: From: Marco Elver Date: Tue, 28 Jun 2022 18:00:34 +0200 Message-ID: Subject: Re: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets To: Dmitry Vyukov Content-Type: text/plain; charset="UTF-8" X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , linux-sh@vger.kernel.org, Peter Zijlstra , Frederic Weisbecker , x86@kernel.org, linuxppc-dev@lists.ozlabs.org, Arnaldo Carvalho de Melo , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Alexander Shishkin , kasan-dev@googlegroups.com, Namhyung Kim , Thomas Gleixner , Jiri Olsa , Ingo Molnar Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On Tue, 28 Jun 2022 at 17:45, Dmitry Vyukov wrote: > > On Tue, 28 Jun 2022 at 11:59, Marco Elver wrote: > > > > We can still see that a majority of the time is spent hashing task pointers: > > > > ... > > 16.98% [kernel] [k] rhashtable_jhash2 > > ... > > > > Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus), > > calling task_bp_pinned() for each CPU, even if task_bp_pinned() is > > CPU-independent. The reason for this is to update the per-CPU > > 'tsk_pinned' histogram. > > > > To optimize the CPU-independent case to O(1), keep a separate > > CPU-independent 'tsk_pinned_all' histogram. > > > > The major source of complexity are transitions between "all > > CPU-independent task breakpoints" and "mixed CPU-independent and > > CPU-dependent task breakpoints". The code comments list all cases that > > require handling. > > > > After this optimization: > > > > | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 > > | Total time: 1.758 [sec] > > | > > | 34.336621 usecs/op > > | 4395.087500 usecs/op/cpu > > > > 38.08% [kernel] [k] queued_spin_lock_slowpath > > 10.81% [kernel] [k] smp_cfm_core_cond > > 3.01% [kernel] [k] update_sg_lb_stats > > 2.58% [kernel] [k] osq_lock > > 2.57% [kernel] [k] llist_reverse_order > > 1.45% [kernel] [k] find_next_bit > > 1.21% [kernel] [k] flush_tlb_func_common > > 1.01% [kernel] [k] arch_install_hw_breakpoint > > > > Showing that the time spent hashing keys has become insignificant. > > > > With the given benchmark parameters, that's an improvement of 12% > > compared with the old O(#cpus) version. > > > > And finally, using the less aggressive parameters from the preceding > > changes, we now observe: > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > > | Total time: 0.067 [sec] > > | > > | 35.292187 usecs/op > > | 2258.700000 usecs/op/cpu > > > > Which is an improvement of 12% compared to without the histogram > > optimizations (baseline is 40 usecs/op). This is now on par with the > > theoretical ideal (constraints disabled), and only 12% slower than no > > breakpoints at all. > > > > Signed-off-by: Marco Elver > > Reviewed-by: Dmitry Vyukov > > I don't see any bugs. But the code is quite complex. Does it make > sense to add some asserts to the histogram type? E.g. counters don't > underflow, weight is not negative (e.g. accidentally added -1 returned > from task_bp_pinned()). Not sure if it will be enough to catch all > types of bugs, though. > Could kunit tests check that histograms are all 0's at the end? > > I am not just about the current code (which may be correct), but also > future modifications to this code. I'll think of some more options. bp_slots_histogram_max*() already has asserts (WARN about underflow; some with KCSAN help). The main thing I did to raise my own confidence in the code is inject bugs and see if the KUnit test catches it. If it didn't I extended the tests. I'll do that some more maybe. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0861ACCA479 for ; Tue, 28 Jun 2022 16:02:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348162AbiF1QCA (ORCPT ); Tue, 28 Jun 2022 12:02:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348275AbiF1QBo (ORCPT ); Tue, 28 Jun 2022 12:01:44 -0400 Received: from mail-yb1-xb2e.google.com (mail-yb1-xb2e.google.com [IPv6:2607:f8b0:4864:20::b2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5568B37A3F for ; Tue, 28 Jun 2022 09:01:12 -0700 (PDT) Received: by mail-yb1-xb2e.google.com with SMTP id p136so16776621ybg.4 for ; Tue, 28 Jun 2022 09:01:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JSgkWuUAADlOM6Jw2JGSpNYb73+UZZR8Ny928REi7Do=; b=dD7WqxrxIulFXhD/bExMZnWi60wwO8Kj+M+kIed6KYNLHB0WWLvTKi3YQglVoCYW4P +nCIP4SR+cwnQqQfuATDns64AdE53d0F59geSv0p/m5YXnhqp4LycwIMn8fTF4Q77Gv9 JOpUOTuY3a9w8gtDFPgVQ2AYVTdCpBjjnDrH38boB6yaYsKF0FG2/J9PB7qqpl9vKHJa 5PbEManR+KQaodnyqs64a5Pz/K9SPx1dRdNN2bbRiZzRZ65LxBDyShencH0aKN3YaGNN DodJhWIcA2oLlhzqE0ACp0cNtsDgcWQAS9pfcEWt337XNT7UXLPRvQp9CgVf6ZrMiAkK tJBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JSgkWuUAADlOM6Jw2JGSpNYb73+UZZR8Ny928REi7Do=; b=WYCoYmWQ89Sih04MNaO0ldVaKomMGuz2ynFhQgrHJwGpdJ2QFXUsif4VFgdwKQo4s+ mhJuo3yVvkx2wFzBJI+6fa0QSCRauPm8mLX9ve1NwMu3gQTbRhR7ntCfR7A3+6pgbEXD UjXkyBCsG8hmyNh4Mx7n3GFDbds3ZskCRn5Tg1LhU0HaWMe0BnZ/fFAKvkW5o7LmsvkT RoqcUM5ePb6I60zeJVvR8pDp3598zKFHNq8+ughjvsbIzSSixl5LS+Pb/17q/lFY5IsS BLndjuVtx8YNU55lyaZZnMQxXsJE/SwSlzDV+xGEXgdv8ks9vv9CxIQ6nhauMUuUYCCZ HtmQ== X-Gm-Message-State: AJIora8o4Dxcjb9sGKx2IhQlKZ5yUXZvX9GGjKd9l6BPHxmxATWzJH4g pEhcKII5U6L11+jdunD14qFSLQ4seF1urSxCejRhvw== X-Google-Smtp-Source: AGRyM1sZ36lYEgCFXIjvvLhfzCJJ1REYB15gJrq5Pp+y7LUoSX97r40Vy8EqU/H/mu0brC3/C7znR71H1FQdTL/t3qo= X-Received: by 2002:a25:cc56:0:b0:66c:d0f6:2f0e with SMTP id l83-20020a25cc56000000b0066cd0f62f0emr12156904ybf.168.1656432071758; Tue, 28 Jun 2022 09:01:11 -0700 (PDT) MIME-Version: 1.0 References: <20220628095833.2579903-1-elver@google.com> <20220628095833.2579903-14-elver@google.com> In-Reply-To: From: Marco Elver Date: Tue, 28 Jun 2022 18:00:34 +0200 Message-ID: Subject: Re: [PATCH v2 13/13] perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets To: Dmitry Vyukov Cc: Peter Zijlstra , Frederic Weisbecker , Ingo Molnar , Thomas Gleixner , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Michael Ellerman , linuxppc-dev@lists.ozlabs.org, linux-perf-users@vger.kernel.org, x86@kernel.org, linux-sh@vger.kernel.org, kasan-dev@googlegroups.com, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-sh@vger.kernel.org On Tue, 28 Jun 2022 at 17:45, Dmitry Vyukov wrote: > > On Tue, 28 Jun 2022 at 11:59, Marco Elver wrote: > > > > We can still see that a majority of the time is spent hashing task pointers: > > > > ... > > 16.98% [kernel] [k] rhashtable_jhash2 > > ... > > > > Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus), > > calling task_bp_pinned() for each CPU, even if task_bp_pinned() is > > CPU-independent. The reason for this is to update the per-CPU > > 'tsk_pinned' histogram. > > > > To optimize the CPU-independent case to O(1), keep a separate > > CPU-independent 'tsk_pinned_all' histogram. > > > > The major source of complexity are transitions between "all > > CPU-independent task breakpoints" and "mixed CPU-independent and > > CPU-dependent task breakpoints". The code comments list all cases that > > require handling. > > > > After this optimization: > > > > | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 > > | Total time: 1.758 [sec] > > | > > | 34.336621 usecs/op > > | 4395.087500 usecs/op/cpu > > > > 38.08% [kernel] [k] queued_spin_lock_slowpath > > 10.81% [kernel] [k] smp_cfm_core_cond > > 3.01% [kernel] [k] update_sg_lb_stats > > 2.58% [kernel] [k] osq_lock > > 2.57% [kernel] [k] llist_reverse_order > > 1.45% [kernel] [k] find_next_bit > > 1.21% [kernel] [k] flush_tlb_func_common > > 1.01% [kernel] [k] arch_install_hw_breakpoint > > > > Showing that the time spent hashing keys has become insignificant. > > > > With the given benchmark parameters, that's an improvement of 12% > > compared with the old O(#cpus) version. > > > > And finally, using the less aggressive parameters from the preceding > > changes, we now observe: > > > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 > > | Total time: 0.067 [sec] > > | > > | 35.292187 usecs/op > > | 2258.700000 usecs/op/cpu > > > > Which is an improvement of 12% compared to without the histogram > > optimizations (baseline is 40 usecs/op). This is now on par with the > > theoretical ideal (constraints disabled), and only 12% slower than no > > breakpoints at all. > > > > Signed-off-by: Marco Elver > > Reviewed-by: Dmitry Vyukov > > I don't see any bugs. But the code is quite complex. Does it make > sense to add some asserts to the histogram type? E.g. counters don't > underflow, weight is not negative (e.g. accidentally added -1 returned > from task_bp_pinned()). Not sure if it will be enough to catch all > types of bugs, though. > Could kunit tests check that histograms are all 0's at the end? > > I am not just about the current code (which may be correct), but also > future modifications to this code. I'll think of some more options. bp_slots_histogram_max*() already has asserts (WARN about underflow; some with KCSAN help). The main thing I did to raise my own confidence in the code is inject bugs and see if the KUnit test catches it. If it didn't I extended the tests. I'll do that some more maybe.