Re: [PATCH v6] arm64: implement ftrace with regs

From: Julien Thierry <julien.thierry@arm.com>
To: Mark Rutland <mark.rutland@arm.com>,
	Balbir Singh <bsingharora@gmail.com>
Cc: Torsten Duwe <duwe@lst.de>, Will Deacon <will.deacon@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Arnd Bergmann <arnd@arndb.de>,
	AKASHI Takahiro <takahiro.akashi@linaro.org>,
	Amit Daniel Kachhap <amit.kachhap@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, live-patching@vger.kernel.org
Subject: Re: [PATCH v6] arm64: implement ftrace with regs
Date: Wed, 16 Jan 2019 18:01:01 +0000	[thread overview]
Message-ID: <82f231a8-c757-da97-bbce-33ac6199a4d9@arm.com> (raw)
In-Reply-To: <e82df4bd-804e-de46-d1d9-93b56f9db9c8@arm.com>

On 16/01/2019 15:56, Julien Thierry wrote:
> On 14/01/2019 12:26, Mark Rutland wrote:
>> On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote:
>>> On Fri, Jan 04, 2019 at 05:50:18PM +0000, Mark Rutland wrote:
>>>> Hi Torsten,
>>>>
>>>> On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
>>>>> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
>>>>> of each function. Replace the first NOP thus generated with a quick LR
>>>>> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
>>>>> to ftrace, does not clobber the value. Ftrace will then generate the
>>>>> standard stack frames.
>>>
>>> Do we know what the overhead would be, if this was a link time change
>>> for the first instruction?
>>
>> No, but it should be possible to benchamrk that for a given workload,
>> which is what I'd like to see.
>>
> 
> So, I hacked up something to have the -fpachable-function-entry=2 in the
> build and then have ftrace_init() patch in the "mov x9, lr" in the first
> nop of the function preludes.
> 
> I tested it on a 8 x Cortex A-57 machine and compared with a version
> that just has the two nops in the function prelude.
> 
> On workloads like hackbench, the average difference is within the noise
> (<1%). Time results below are in seconds.
> 
> 	+------------+--------------------+
> 	| "nop; nop" | "mov x9, lr; nop"  |
> 	+------------+--------------------+
> 	|     43.497 |             42.694 |
> 	|     43.464 |             43.148 |
> 	|     43.599 |             43.131 |
> 	|     43.785 |              43.63 |
> 	|     43.458 |             43.281 |
> 	|       44.3 |             43.328 |
> 	|     43.541 |             43.059 |
> 	|     43.529 |             43.298 |
> 	|      43.58 |             43.937 |
> 	|     43.385 |             43.122 |
> 	|     43.514 |             43.825 |
> 	|     45.508 |             43.268 |
> 	|     43.757 |             43.316 |
> 	|     43.392 |             43.146 |
> 	|     44.029 |             43.236 |
> 	|     43.515 |             43.139 |
> 	|      43.22 |             43.108 |
> 	|     43.496 |             43.836 |
> 	|     43.669 |             43.083 |
> 	|     43.388 |              43.38 |
> 	+------------+--------------------+
> average	|    43.6813 |           43.29825 |
> 	+------------+--------------------+
> 
Here are also some results running hackbench on 4 x Cortex-A53 (pay no
attention to the fact that the timescales are similar, I changed the
number of iteration done by hackbench so it wouldn't take too long)

	+------------+-------------------+
	| "nop; nop" | "mov x9, lr; nop" |
	+------------+-------------------+
	|     43.815 |            44.455 |
	|     43.758 |            45.173 |
	|     44.075 |             43.95 |
	|     44.021 |            44.185 |
	|     43.959 |            44.826 |
	|     44.039 |            44.478 |
	|     43.836 |            44.626 |
	|     44.071 |            45.177 |
	|     43.619 |            45.033 |
	|     44.052 |            45.095 |
	|     43.903 |            44.802 |
	|     43.773 |            44.955 |
	|     43.908 |             45.02 |
	|     43.441 |            44.986 |
	|     44.167 |            45.182 |
	|     44.106 |            45.229 |
	|     43.974 |             45.07 |
	|     43.859 |            45.283 |
	|     43.706 |            44.892 |
	|     43.897 |            44.194 |
	+------------+-------------------+
average |     43.899 |            44.835 |
        +------------+-------------------+

So, in this case the performance take a ~2% hit from keeping the mov
always present in the function prelude instead of a nop.

Makes it a bit less obvious whether the always having that mov there
(whether patched at build time or run time) is good enough.

Cheers,

-- 
Julien Thierry