Re: [PATCH v3 0/3] kasan, arm64, scs: collect stack traces from Shadow Call Stack

From: Andrey Konovalov <andreyknvl@gmail.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: andrey.konovalov@linux.dev, Marco Elver <elver@google.com>,
	Alexander Potapenko <glider@google.com>,
	Dmitry Vyukov <dvyukov@google.com>,
	Andrey Ryabinin <ryabinin.a.a@gmail.com>,
	kasan-dev <kasan-dev@googlegroups.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Vincenzo Frascino <vincenzo.frascino@arm.com>,
	Sami Tolvanen <samitolvanen@google.com>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Peter Collingbourne <pcc@google.com>,
	Evgenii Stepanov <eugenis@google.com>,
	Florian Mayer <fmayer@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrey Konovalov <andreyknvl@google.com>
Subject: Re: [PATCH v3 0/3] kasan, arm64, scs: collect stack traces from Shadow Call Stack
Date: Sun, 22 May 2022 00:30:52 +0200	[thread overview]
Message-ID: <CA+fCnZcM-1oxVeZSPHnnwy-9CiksZhWfqEbms-yg22hRjr7EFw@mail.gmail.com> (raw)
In-Reply-To: <YlgVa+AP0g4IYvzN@lakrids>

On Thu, Apr 14, 2022 at 2:37 PM Mark Rutland <mark.rutland@arm.com> wrote:
>

Hi Mark,

Sorry for the delayed response, it took some time getting my hands on
hardware for testing these changes.

> Just to be clear: QEMU TCG mode is *in no way* representative of HW
> performance, and has drastically different performance characteristics
> compared to real HW. Please be very clear when you are quoting
> performance figures from QEMU TCG mode.
>
> Previously you said you were trying to optimize this so that some
> version of KASAN could be enabled in production builds, and the above is
> not a suitable benchmark system for that.

Understood.

My expectation was that performance numbers from QEMU would be close
to hardware. I knew that there are instructions that take longer to be
emulated, but I expected that they would be uniformly spread across
the code.

However, your explanation proved this wrong. This indeed doesn't apply
when measuring the performance of a piece of code with a different
density of function calls.

Thank you for the detailed explanation! Those QEMU arguments will
definitely be handy when I need a faster QEMU setup.

> Is that *actually* what you're trying to enable, or are you just trying
> to speed up running instances under QEMU (e.g. for arm64 Syzkaller runs
> on GCE)?

No, I'm not trying to speed up QEMU. QEMU was just the only setup that
I had access to at that moment.

The goal is to allow enabling stack trace collection in production on
HW_TAGS-enabled devices once those are created.

[...]

> While the SCS unwinder is still faster, the difference is nowhere near
> as pronounced. As I mentioned before, there are changes that we can make
> to the regular unwinder to close that gap somewhat, some of which I
> intend to make as part of ongoing cleanup/rework in that area.

I tried running the same experiments on Pixel 6.

Unfortunately, I was only able to test the OUTLINE SW_TAGS mode
(without STACK instrumentation, as HW_TAGS doesn't support STACK at
the moment.) All of the other modes either fail to flash or fail to
boot with AOSP on Pixel 6 :(

The results are (timestamps were measured when "ALSA device list" was
printed to the kernel log):

sw_tags outline nostacks: 2.218
sw_tags outline: 2.516 (+13.4%)
sw_tags outline nosanitize: 2.364 (+6.5%)
sw_tags outline nosanitize __set_bit: 2.364 (+6.5%)
sw_tags outline nosanitize scs: 2.236 (+0.8%)

Used markings:

nostacks: patch from master-no-stack-traces applied
nosanitize: KASAN_SANITIZE_stacktrace.o := n
__set_bit: set_bit -> __set_bit change applied
scs: patches from up-scs-stacks-v3 applied

First, disabling instrumentation of stacktrace.c is indeed a great
idea for software KASAN modes! I will send a patch for this later.

Changing set_bit to __set_bit seems to make no difference on Pixel 6.

The awesome part is that the overhead of collecting stack traces with
SCS and even saving them into the stack depot is less than 1%.

However once again note, that this is for OUTLINE SW_TAGS without STACK.

> I haven't bothered testing HW_TAGS, because the performance
> characteristics of emulated MTE are also nothing like that of a real HW
> implementation.
>
> So, given that and the problems I mentioned before, I don't think
> there's a justification for adding a separate SCS unwinder. As before,
> I'm still happy to try to make the regular unwinder faster (and I'm
> happy to make changes which benefit QEMU TCG mode if those don't harm
> the maintainability of the unwinder).
>
> NAK to adding an SCS-specific unwinder, regardless of where in the
> source tree that is placed.

I see.

Perhaps, it makes sense to wait until there's HW_TAGS-enabled hardware
available before continuing to look into this. At the end, the
performance overhead for that setup is what matters.

I'll look into improving the performance of the existing unwinder a
bit more. However, I don't think I'll be able to speed it up to < 1%.
Which means that we'll likely need a sample-based approach for HW_TAGS
stack collection to reduce the overhead.

Thank you!