All of lore.kernel.org
 help / color / mirror / Atom feed
From: Julien Thierry <julien.thierry@arm.com>
To: Mark Rutland <mark.rutland@arm.com>,
	Balbir Singh <bsingharora@gmail.com>
Cc: Torsten Duwe <duwe@lst.de>, Will Deacon <will.deacon@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	Ingo Molnar <mingo@redhat.com>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Arnd Bergmann <arnd@arndb.de>,
	AKASHI Takahiro <takahiro.akashi@linaro.org>,
	Amit Daniel Kachhap <amit.kachhap@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, live-patching@vger.kernel.org
Subject: Re: [PATCH v6] arm64: implement ftrace with regs
Date: Wed, 16 Jan 2019 18:01:01 +0000	[thread overview]
Message-ID: <82f231a8-c757-da97-bbce-33ac6199a4d9@arm.com> (raw)
In-Reply-To: <e82df4bd-804e-de46-d1d9-93b56f9db9c8@arm.com>



On 16/01/2019 15:56, Julien Thierry wrote:
> On 14/01/2019 12:26, Mark Rutland wrote:
>> On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote:
>>> On Fri, Jan 04, 2019 at 05:50:18PM +0000, Mark Rutland wrote:
>>>> Hi Torsten,
>>>>
>>>> On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
>>>>> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
>>>>> of each function. Replace the first NOP thus generated with a quick LR
>>>>> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
>>>>> to ftrace, does not clobber the value. Ftrace will then generate the
>>>>> standard stack frames.
>>>
>>> Do we know what the overhead would be, if this was a link time change
>>> for the first instruction?
>>
>> No, but it should be possible to benchamrk that for a given workload,
>> which is what I'd like to see.
>>
> 
> So, I hacked up something to have the -fpachable-function-entry=2 in the
> build and then have ftrace_init() patch in the "mov x9, lr" in the first
> nop of the function preludes.
> 
> I tested it on a 8 x Cortex A-57 machine and compared with a version
> that just has the two nops in the function prelude.
> 
> On workloads like hackbench, the average difference is within the noise
> (<1%). Time results below are in seconds.
> 
> 	+------------+--------------------+
> 	| "nop; nop" | "mov x9, lr; nop"  |
> 	+------------+--------------------+
> 	|     43.497 |             42.694 |
> 	|     43.464 |             43.148 |
> 	|     43.599 |             43.131 |
> 	|     43.785 |              43.63 |
> 	|     43.458 |             43.281 |
> 	|       44.3 |             43.328 |
> 	|     43.541 |             43.059 |
> 	|     43.529 |             43.298 |
> 	|      43.58 |             43.937 |
> 	|     43.385 |             43.122 |
> 	|     43.514 |             43.825 |
> 	|     45.508 |             43.268 |
> 	|     43.757 |             43.316 |
> 	|     43.392 |             43.146 |
> 	|     44.029 |             43.236 |
> 	|     43.515 |             43.139 |
> 	|      43.22 |             43.108 |
> 	|     43.496 |             43.836 |
> 	|     43.669 |             43.083 |
> 	|     43.388 |              43.38 |
> 	+------------+--------------------+
> average	|    43.6813 |           43.29825 |
> 	+------------+--------------------+
> 
Here are also some results running hackbench on 4 x Cortex-A53 (pay no
attention to the fact that the timescales are similar, I changed the
number of iteration done by hackbench so it wouldn't take too long)

	+------------+-------------------+
	| "nop; nop" | "mov x9, lr; nop" |
	+------------+-------------------+
	|     43.815 |            44.455 |
	|     43.758 |            45.173 |
	|     44.075 |             43.95 |
	|     44.021 |            44.185 |
	|     43.959 |            44.826 |
	|     44.039 |            44.478 |
	|     43.836 |            44.626 |
	|     44.071 |            45.177 |
	|     43.619 |            45.033 |
	|     44.052 |            45.095 |
	|     43.903 |            44.802 |
	|     43.773 |            44.955 |
	|     43.908 |             45.02 |
	|     43.441 |            44.986 |
	|     44.167 |            45.182 |
	|     44.106 |            45.229 |
	|     43.974 |             45.07 |
	|     43.859 |            45.283 |
	|     43.706 |            44.892 |
	|     43.897 |            44.194 |
	+------------+-------------------+
average |     43.899 |            44.835 |
        +------------+-------------------+


So, in this case the performance take a ~2% hit from keeping the mov
always present in the function prelude instead of a nop.

Makes it a bit less obvious whether the always having that mov there
(whether patched at build time or run time) is good enough.

Cheers,

-- 
Julien Thierry

WARNING: multiple messages have this Message-ID (diff)
From: Julien Thierry <julien.thierry@arm.com>
To: Mark Rutland <mark.rutland@arm.com>,
	Balbir Singh <bsingharora@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	linux-kernel@vger.kernel.org,
	Steven Rostedt <rostedt@goodmis.org>,
	AKASHI Takahiro <takahiro.akashi@linaro.org>,
	Ingo Molnar <mingo@redhat.com>, Torsten Duwe <duwe@lst.de>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	Amit Daniel Kachhap <amit.kachhap@arm.com>,
	live-patching@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v6] arm64: implement ftrace with regs
Date: Wed, 16 Jan 2019 18:01:01 +0000	[thread overview]
Message-ID: <82f231a8-c757-da97-bbce-33ac6199a4d9@arm.com> (raw)
In-Reply-To: <e82df4bd-804e-de46-d1d9-93b56f9db9c8@arm.com>



On 16/01/2019 15:56, Julien Thierry wrote:
> On 14/01/2019 12:26, Mark Rutland wrote:
>> On Mon, Jan 14, 2019 at 11:13:59PM +1100, Balbir Singh wrote:
>>> On Fri, Jan 04, 2019 at 05:50:18PM +0000, Mark Rutland wrote:
>>>> Hi Torsten,
>>>>
>>>> On Fri, Jan 04, 2019 at 03:10:53PM +0100, Torsten Duwe wrote:
>>>>> Use -fpatchable-function-entry (gcc8) to add 2 NOPs at the beginning
>>>>> of each function. Replace the first NOP thus generated with a quick LR
>>>>> saver (move it to scratch reg x9), so the 2nd replacement insn, the call
>>>>> to ftrace, does not clobber the value. Ftrace will then generate the
>>>>> standard stack frames.
>>>
>>> Do we know what the overhead would be, if this was a link time change
>>> for the first instruction?
>>
>> No, but it should be possible to benchamrk that for a given workload,
>> which is what I'd like to see.
>>
> 
> So, I hacked up something to have the -fpachable-function-entry=2 in the
> build and then have ftrace_init() patch in the "mov x9, lr" in the first
> nop of the function preludes.
> 
> I tested it on a 8 x Cortex A-57 machine and compared with a version
> that just has the two nops in the function prelude.
> 
> On workloads like hackbench, the average difference is within the noise
> (<1%). Time results below are in seconds.
> 
> 	+------------+--------------------+
> 	| "nop; nop" | "mov x9, lr; nop"  |
> 	+------------+--------------------+
> 	|     43.497 |             42.694 |
> 	|     43.464 |             43.148 |
> 	|     43.599 |             43.131 |
> 	|     43.785 |              43.63 |
> 	|     43.458 |             43.281 |
> 	|       44.3 |             43.328 |
> 	|     43.541 |             43.059 |
> 	|     43.529 |             43.298 |
> 	|      43.58 |             43.937 |
> 	|     43.385 |             43.122 |
> 	|     43.514 |             43.825 |
> 	|     45.508 |             43.268 |
> 	|     43.757 |             43.316 |
> 	|     43.392 |             43.146 |
> 	|     44.029 |             43.236 |
> 	|     43.515 |             43.139 |
> 	|      43.22 |             43.108 |
> 	|     43.496 |             43.836 |
> 	|     43.669 |             43.083 |
> 	|     43.388 |              43.38 |
> 	+------------+--------------------+
> average	|    43.6813 |           43.29825 |
> 	+------------+--------------------+
> 
Here are also some results running hackbench on 4 x Cortex-A53 (pay no
attention to the fact that the timescales are similar, I changed the
number of iteration done by hackbench so it wouldn't take too long)

	+------------+-------------------+
	| "nop; nop" | "mov x9, lr; nop" |
	+------------+-------------------+
	|     43.815 |            44.455 |
	|     43.758 |            45.173 |
	|     44.075 |             43.95 |
	|     44.021 |            44.185 |
	|     43.959 |            44.826 |
	|     44.039 |            44.478 |
	|     43.836 |            44.626 |
	|     44.071 |            45.177 |
	|     43.619 |            45.033 |
	|     44.052 |            45.095 |
	|     43.903 |            44.802 |
	|     43.773 |            44.955 |
	|     43.908 |             45.02 |
	|     43.441 |            44.986 |
	|     44.167 |            45.182 |
	|     44.106 |            45.229 |
	|     43.974 |             45.07 |
	|     43.859 |            45.283 |
	|     43.706 |            44.892 |
	|     43.897 |            44.194 |
	+------------+-------------------+
average |     43.899 |            44.835 |
        +------------+-------------------+


So, in this case the performance take a ~2% hit from keeping the mov
always present in the function prelude instead of a nop.

Makes it a bit less obvious whether the always having that mov there
(whether patched at build time or run time) is good enough.

Cheers,

-- 
Julien Thierry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2019-01-16 18:01 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-04 14:10 [PATCH v6] arm64: implement ftrace with regs Torsten Duwe
2019-01-04 14:10 ` Torsten Duwe
2019-01-04 17:50 ` Mark Rutland
2019-01-04 17:50   ` Mark Rutland
2019-01-04 18:06   ` Steven Rostedt
2019-01-04 18:06     ` Steven Rostedt
2019-01-04 22:41     ` Torsten Duwe
2019-01-04 22:41       ` Torsten Duwe
2019-01-05 11:05       ` Torsten Duwe
2019-01-05 11:05         ` Torsten Duwe
2019-01-05 20:00         ` Steven Rostedt
2019-01-05 20:00           ` Steven Rostedt
2019-01-07 11:19       ` Mark Rutland
2019-01-07 11:19         ` Mark Rutland
2019-01-14 12:13   ` Balbir Singh
2019-01-14 12:13     ` Balbir Singh
2019-01-14 12:26     ` Mark Rutland
2019-01-14 12:26       ` Mark Rutland
2019-01-16 15:56       ` Julien Thierry
2019-01-16 15:56         ` Julien Thierry
2019-01-16 18:01         ` Julien Thierry [this message]
2019-01-16 18:01           ` Julien Thierry
2019-01-07  4:57 ` Amit Daniel Kachhap
2019-01-07  4:57   ` Amit Daniel Kachhap
2019-01-16  9:57 ` Julien Thierry
2019-01-16  9:57   ` Julien Thierry
2019-01-16 10:08   ` Julien Thierry
2019-01-16 10:08     ` Julien Thierry
2019-01-17 15:48   ` Torsten Duwe
2019-01-17 15:48     ` Torsten Duwe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82f231a8-c757-da97-bbce-33ac6199a4d9@arm.com \
    --to=julien.thierry@arm.com \
    --cc=amit.kachhap@arm.com \
    --cc=ard.biesheuvel@linaro.org \
    --cc=arnd@arndb.de \
    --cc=bsingharora@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=duwe@lst.de \
    --cc=jpoimboe@redhat.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=live-patching@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=rostedt@goodmis.org \
    --cc=takahiro.akashi@linaro.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.