[PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
@ 2012-12-06 18:11 Jon Medhurst (Tixy)
  2012-12-06 19:19 ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-06 18:11 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Russell King, Steven Rostedt, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

When the generic ftrace implementation modifies code for trace-points it
uses stop_machine() to call ftrace_modify_all_code() on one CPU. This
ultimately calls the ARM specific function ftrace_modify_code() which
updates the instruction and then does flush_icache_range(). As this
cache flushing only operates on the local CPU then other cores may end
up executing the old instruction if it's still in their icaches.

This may or may not cause problems for the use of ftrace on kernels
compiled for ARM instructions. However, Thumb2 instructions can straddle
two cache lines so its possible for half the old instruction to be in
the cache and half the new one, leading to the CPU executing garbage.

This patch fixes this situation by providing an arch-specific
implementation of arch_ftrace_update_code() which ensures that after one
core has modified all the code, the other cores invalidate their icaches
before continuing.

Signed-off-by: Jon Medhurst <tixy@linaro.org>
---
 arch/arm/kernel/ftrace.c |   34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
index 34e5664..38b670c 100644
--- a/arch/arm/kernel/ftrace.c
+++ b/arch/arm/kernel/ftrace.c
@@ -14,6 +14,7 @@

 #include <linux/ftrace.h>
 #include <linux/uaccess.h>
+#include <linux/stop_machine.h>

 #include <asm/cacheflush.h>
 #include <asm/opcodes.h>
@@ -156,6 +157,39 @@ int ftrace_make_nop(struct module *mod,
 	return ret;
 }

+struct afmc_data {
+	int command;
+	atomic_t cpu;
+	atomic_t done;
+};
+
+static int __arch_ftrace_modify_code(void *data)
+{
+	struct afmc_data *afmcd = data;
+
+	if (atomic_inc_return(&afmcd->cpu) == num_online_cpus()) {
+		/* Last cpu to get into this function does the actual work */
+		ftrace_modify_all_code(afmcd->command);
+		wmb();
+		atomic_set(&afmcd->done, true);
+	} else {
+		/* Other cpus wait for the code modifications to be done */
+		rmb();
+		while (!atomic_read(&afmcd->done))
+			cpu_relax();
+		/* Ensure icache is consistent with the code changes */
+		__flush_icache_all();
+	}
+
+	return 0;
+}
+
+void arch_ftrace_update_code(int command)
+{
+	struct afmc_data afmcd = { command };
+	stop_machine(__arch_ftrace_modify_code, &afmcd, cpu_online_mask);
+}
+
 int __init ftrace_dyn_arch_init(void *data)
 {
 	*(unsigned long *)data = 0;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-06 18:11 [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus Jon Medhurst (Tixy)
@ 2012-12-06 19:19 ` Steven Rostedt
  2012-12-07  9:22   ` Jon Medhurst (Tixy)
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2012-12-06 19:19 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: linux-arm-kernel, Russell King, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Thu, 2012-12-06 at 18:11 +0000, Jon Medhurst (Tixy) wrote:
> When the generic ftrace implementation modifies code for trace-points it
> uses stop_machine() to call ftrace_modify_all_code() on one CPU. This
> ultimately calls the ARM specific function ftrace_modify_code() which
> updates the instruction and then does flush_icache_range(). As this
> cache flushing only operates on the local CPU then other cores may end
> up executing the old instruction if it's still in their icaches.
> 
> This may or may not cause problems for the use of ftrace on kernels
> compiled for ARM instructions. However, Thumb2 instructions can straddle
> two cache lines so its possible for half the old instruction to be in
> the cache and half the new one, leading to the CPU executing garbage.

Hmm, your use of "may or may not" seems as you may not know this answer.
I wonder if you can use the break point method as x86 does now, and
remove the stop machine completely. Basically this is how it works:

add sw breakpoints to all locations to modify (the bp handler just does
a nop over the instruction).

send an IPI to all CPUs to flush their icache.

Modify the non breakpoint part of the instruction with the new
instruction.

send an IPI to all CPUs to flush their icache

Replace the breakpoint with the finished instruction.

Then you don't suffer the stomp_machine() latency hit. The system will
slow a bit due to the breakpoints but there wont be a huge "halt" in the
middle of processing.

-- Steve

> 
> This patch fixes this situation by providing an arch-specific
> implementation of arch_ftrace_update_code() which ensures that after one
> core has modified all the code, the other cores invalidate their icaches
> before continuing.
> 
> Signed-off-by: Jon Medhurst <tixy@linaro.org>
> ---



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-06 19:19 ` Steven Rostedt
@ 2012-12-07  9:22   ` Jon Medhurst (Tixy)
  2012-12-07 14:03     ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-07  9:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-arm-kernel, Russell King, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Thu, 2012-12-06 at 14:19 -0500, Steven Rostedt wrote:
> Hmm, your use of "may or may not" seems as you may not know this answer.
> I wonder if you can use the break point method as x86 does now, and
> remove the stop machine completely. Basically this is how it works:
> 
> add sw breakpoints to all locations to modify (the bp handler just does
> a nop over the instruction).
> 
> send an IPI to all CPUs to flush their icache.
> 
> Modify the non breakpoint part of the instruction with the new
> instruction.
> 
> send an IPI to all CPUs to flush their icache
> 
> Replace the breakpoint with the finished instruction.

If I understand correctly then this method can't work on ARM because a
'software breakpoint' is 'replace an instruction with a known undefined
instruction _of the same size_'. It haa to be the same size because code
like this:

	it eq       /* If condition code 'eq' true */
        insA        /* then execute this instruction */
        insB        /* Always execute this */

if we replace insA with a breakpoint which is shorter, then we have

	it eq       /* If condition code 'eq' true */
        bkpt        /* then execute the breakpoint */
        insA-part2  /* Always execute this garbage */
        insB        /* Always execute this */

and to complicate matters more, the 'it' instruction can make up to the
next four instructions conditional, so you can't reverse decode the
instruction stream reliably to even detect such code.

And further, it's implementation defined (up to who every creates the
silicon) whether an undefined instructions actually causes an abort when
it occurs in such an 'it' block, it may just execute as a nop.

Welcome to the work of ARM :-)

-- 
Tixy



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07  9:22   ` Jon Medhurst (Tixy)
@ 2012-12-07 14:03     ` Steven Rostedt
  2012-12-07 14:55       ` Jon Medhurst (Tixy)
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 14:03 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: linux-arm-kernel, Russell King, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 09:22 +0000, Jon Medhurst (Tixy) wrote:
> On Thu, 2012-12-06 at 14:19 -0500, Steven Rostedt wrote:
> > Hmm, your use of "may or may not" seems as you may not know this answer.
> > I wonder if you can use the break point method as x86 does now, and
> > remove the stop machine completely. Basically this is how it works:
> > 
> > add sw breakpoints to all locations to modify (the bp handler just does
> > a nop over the instruction).
> > 
> > send an IPI to all CPUs to flush their icache.
> > 
> > Modify the non breakpoint part of the instruction with the new
> > instruction.
> > 
> > send an IPI to all CPUs to flush their icache
> > 
> > Replace the breakpoint with the finished instruction.
> 
> If I understand correctly then this method can't work on ARM because a
> 'software breakpoint' is 'replace an instruction with a known undefined
> instruction _of the same size_'. It haa to be the same size because code
> like this:
> 
> 	it eq       /* If condition code 'eq' true */
>         insA        /* then execute this instruction */
>         insB        /* Always execute this */
> 
> if we replace insA with a breakpoint which is shorter, then we have
> 
> 	it eq       /* If condition code 'eq' true */
>         bkpt        /* then execute the breakpoint */
>         insA-part2  /* Always execute this garbage */

Why always execute the garbage? Do what we do in x86, where the
breakpoint is only 1 byte and the instruction being replaced is 5 bytes.
The breakpoint handler returns to the instruction after the
"garbage" (insB).

>         insB        /* Always execute this */
> 
> and to complicate matters more, the 'it' instruction can make up to the
> next four instructions conditional, so you can't reverse decode the
> instruction stream reliably to even detect such code.
> 
> And further, it's implementation defined (up to who every creates the
> silicon) whether an undefined instructions actually causes an abort when
> it occurs in such an 'it' block, it may just execute as a nop.
> 
> Welcome to the work of ARM :-)
> 

But also realize that function tracing is special :-) We have no cases
like this. The instruction being replaced is a call to mcount. In fact,
we replace it at boot with a nop. And this method only replaces that nop
into a call to function tracer, or replaces the call to function tracer
back to a nop. Always at the start of the function, and never involved
with conditionals. This limitation that function tracing imposes on what
we replace makes things a bit more sane in how we replace it.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 14:03     ` Steven Rostedt
@ 2012-12-07 14:55       ` Jon Medhurst (Tixy)
  2012-12-07 15:28         ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-07 14:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-arm-kernel, Russell King, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 09:03 -0500, Steven Rostedt wrote:
> On Fri, 2012-12-07 at 09:22 +0000, Jon Medhurst (Tixy) wrote:
> > On Thu, 2012-12-06 at 14:19 -0500, Steven Rostedt wrote:
> > > Hmm, your use of "may or may not" seems as you may not know this answer.
> > > I wonder if you can use the break point method as x86 does now, and
> > > remove the stop machine completely. Basically this is how it works:
> > > 
> > > add sw breakpoints to all locations to modify (the bp handler just does
> > > a nop over the instruction).
> > > 
> > > send an IPI to all CPUs to flush their icache.
> > > 
> > > Modify the non breakpoint part of the instruction with the new
> > > instruction.
> > > 
> > > send an IPI to all CPUs to flush their icache
> > > 
> > > Replace the breakpoint with the finished instruction.
> > 
> > If I understand correctly then this method can't work on ARM because a
> > 'software breakpoint' is 'replace an instruction with a known undefined
> > instruction _of the same size_'. It haa to be the same size because code
> > like this:
> > 
> > 	it eq       /* If condition code 'eq' true */
> >         insA        /* then execute this instruction */
> >         insB        /* Always execute this */
> > 
> > if we replace insA with a breakpoint which is shorter, then we have
> > 
> > 	it eq       /* If condition code 'eq' true */
> >         bkpt        /* then execute the breakpoint */
> >         insA-part2  /* Always execute this garbage */
> 
> Why always execute the garbage? Do what we do in x86, where the
> breakpoint is only 1 byte and the instruction being replaced is 5 bytes.

We don't get any say in the matter, if the condition is false, then the
CPU will skip over bkpt and go on to execute insA-part2, that's how the
instruction set works. If the condition is true, then it's
implementation defined whether the CPU will skip bkt or not.

The 'it' instruction is a separate instruction to insA, it's not any
kind of prefix used to make a more complex single instruction. You can
do something like:

	iteet eq   /* if-then-else-else-then */
        insA       /* executed if 'eq' */
        insB       /* executed if not 'eq' */
        insC       /* executed if not 'eq' */
        insD       /* executed if 'eq' */

which is five separate CPU instructions, and you can get interrupted
after any of them (the current state of conditional instruction
execution is stored in the status register). Replacing insB with a
shorter 'breakpoint' would give...

	iteet eq   /* if-then-else-else-then */
        insA       /* executed if 'eq' */
        bkpt       /* executed if not 'eq' (implementation defined) */
        insB-part2 /* executed if not 'eq' (garbage) */
        insC       /* executed if 'eq' */
        insD       /* always executed */

which is not good ;-)

> The breakpoint handler returns to the instruction after the
> "garbage" (insB).
> 
> >         insB        /* Always execute this */
> > 
> > and to complicate matters more, the 'it' instruction can make up to the
> > next four instructions conditional, so you can't reverse decode the
> > instruction stream reliably to even detect such code.
> > 
> > And further, it's implementation defined (up to who every creates the
> > silicon) whether an undefined instructions actually causes an abort when
> > it occurs in such an 'it' block, it may just execute as a nop.
> > 
> > Welcome to the work of ARM :-)
> > 
> 
> But also realize that function tracing is special :-) We have no cases
> like this. The instruction being replaced is a call to mcount. In fact,
> we replace it at boot with a nop. And this method only replaces that nop
> into a call to function tracer, or replaces the call to function tracer
> back to a nop. Always at the start of the function, and never involved
> with conditionals. This limitation that function tracing imposes on what
> we replace makes things a bit more sane in how we replace it.

Then perhaps the method you suggest will work on ARM :-). However, that
is not something I personally propose to implement at this time. (I was
doing my good Samaritan act by trying to fix the crashes which another
team was getting when trying to use ftrace.)

-- 
Tixy


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 14:55       ` Jon Medhurst (Tixy)
@ 2012-12-07 15:28         ` Steven Rostedt
  2012-12-07 15:40           ` Jon Medhurst (Tixy)
  2012-12-07 16:23           ` Russell King - ARM Linux
  0 siblings, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 15:28 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: linux-arm-kernel, Russell King, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 14:55 +0000, Jon Medhurst (Tixy) wrote:
>  
> > But also realize that function tracing is special :-) We have no cases
> > like this. The instruction being replaced is a call to mcount. In fact,
> > we replace it at boot with a nop. And this method only replaces that nop
> > into a call to function tracer, or replaces the call to function tracer
> > back to a nop. Always at the start of the function, and never involved
> > with conditionals. This limitation that function tracing imposes on what
> > we replace makes things a bit more sane in how we replace it.
> 
> Then perhaps the method you suggest will work on ARM :-). However, that
> is not something I personally propose to implement at this time. (I was
> doing my good Samaritan act by trying to fix the crashes which another
> team was getting when trying to use ftrace.)
> 

I'm not NACKing your previous patch, I was just suggesting to bring ARM
up to the future :-)

I have no problems with the patch, but I just want to put it out there
that there's better ways. It's part of the remove stomp_machine()
crusade ;-)

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 15:28         ` Steven Rostedt
@ 2012-12-07 15:40           ` Jon Medhurst (Tixy)
  2012-12-07 16:09             ` Steven Rostedt
  2012-12-07 16:23           ` Russell King - ARM Linux
  1 sibling, 1 reply; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-07 15:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-arm-kernel, Russell King, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 10:28 -0500, Steven Rostedt wrote:
> I'm not NACKing your previous patch, I was just suggesting to bring ARM
> up to the future :-)

ARM is the future ;-)

> 
> I have no problems with the patch, but I just want to put it out there
> that there's better ways. It's part of the remove stomp_machine()
> crusade ;-)

Indeed, when I first cam across stop_machine(), I though 'yuck!',
especially when I realised Linux scheduling seems to mean the system
goes idle until the the next tick triggers scheduling of these 'highest
priority' threads.

-- 
Tixy


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 15:40           ` Jon Medhurst (Tixy)
@ 2012-12-07 16:09             ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 16:09 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: linux-arm-kernel, Russell King, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 15:40 +0000, Jon Medhurst (Tixy) wrote:
> On Fri, 2012-12-07 at 10:28 -0500, Steven Rostedt wrote:
> > I'm not NACKing your previous patch, I was just suggesting to bring ARM
> > up to the future :-)
> 
> ARM is the future ;-)
> 

I had a strong feeling you would reply with that.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 15:28         ` Steven Rostedt
  2012-12-07 15:40           ` Jon Medhurst (Tixy)
@ 2012-12-07 16:23           ` Russell King - ARM Linux
  2012-12-07 16:36             ` Steven Rostedt
  1 sibling, 1 reply; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-12-07 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jon Medhurst (Tixy),
	linux-arm-kernel, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, Dec 07, 2012 at 10:28:54AM -0500, Steven Rostedt wrote:
> On Fri, 2012-12-07 at 14:55 +0000, Jon Medhurst (Tixy) wrote:
> >  
> > > But also realize that function tracing is special :-) We have no cases
> > > like this. The instruction being replaced is a call to mcount. In fact,
> > > we replace it at boot with a nop. And this method only replaces that nop
> > > into a call to function tracer, or replaces the call to function tracer
> > > back to a nop. Always at the start of the function, and never involved
> > > with conditionals. This limitation that function tracing imposes on what
> > > we replace makes things a bit more sane in how we replace it.
> > 
> > Then perhaps the method you suggest will work on ARM :-). However, that
> > is not something I personally propose to implement at this time. (I was
> > doing my good Samaritan act by trying to fix the crashes which another
> > team was getting when trying to use ftrace.)
> > 
> 
> I'm not NACKing your previous patch, I was just suggesting to bring ARM
> up to the future :-)
> 
> I have no problems with the patch, but I just want to put it out there
> that there's better ways. It's part of the remove stomp_machine()
> crusade ;-)

That's fine if there are better ways.  If your view is that this would
bring things "up to the future" consider this: what you suggest is possible
with the standard ARM 32-bit instruction set.  With the more modern Thumb
instruction set, because we now effectively have prefixes, where those
prefixes control the execution of the following instructions, what you
suggest becomes no longer possible.

So, it's not a question of bringing stuff up to the future at all... you
can call it a design regression of you will, but you're really making
demands about how CPUs work which are outside of your remit.

Think of this a bit like you changing the opcodes immediately following a
'LOCK' prefix on x86.  I suspect divorsing the following opcodes from its
prefix would be very bad for the instructions atomicity.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 16:23           ` Russell King - ARM Linux
@ 2012-12-07 16:36             ` Steven Rostedt
  2012-12-07 16:45               ` Russell King - ARM Linux
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 16:36 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jon Medhurst (Tixy),
	linux-arm-kernel, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 16:23 +0000, Russell King - ARM Linux wrote:

> That's fine if there are better ways.  If your view is that this would
> bring things "up to the future" consider this: what you suggest is possible
> with the standard ARM 32-bit instruction set.  With the more modern Thumb
> instruction set, because we now effectively have prefixes, where those
> prefixes control the execution of the following instructions, what you
> suggest becomes no longer possible.
> 
> So, it's not a question of bringing stuff up to the future at all... you
> can call it a design regression of you will, but you're really making
> demands about how CPUs work which are outside of your remit.
> 
> Think of this a bit like you changing the opcodes immediately following a
> 'LOCK' prefix on x86.  I suspect divorsing the following opcodes from its
> prefix would be very bad for the instructions atomicity.

But what about the limitations that the function tracer imposes on the
code that gets modified by stop_machine()?

1) the original code is simply a call to mcount

2) on boot up, that call gets converted into a nop

3) the code that gets changed will only be converting a nop to a call
into the function tracer, and back again.

IOW, it's a very limited subset of the ARM assembly that gets touched.
I'm not sure what the op codes are for the above, but I can imagine they
don't impose the prefixes as you described.

If that's the case, is it still possible to change to the breakpoint
method?

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 16:36             ` Steven Rostedt
@ 2012-12-07 16:45               ` Russell King - ARM Linux
  2012-12-07 17:13                 ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-12-07 16:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jon Medhurst (Tixy),
	linux-arm-kernel, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, Dec 07, 2012 at 11:36:40AM -0500, Steven Rostedt wrote:
> On Fri, 2012-12-07 at 16:23 +0000, Russell King - ARM Linux wrote:
> 
> > That's fine if there are better ways.  If your view is that this would
> > bring things "up to the future" consider this: what you suggest is possible
> > with the standard ARM 32-bit instruction set.  With the more modern Thumb
> > instruction set, because we now effectively have prefixes, where those
> > prefixes control the execution of the following instructions, what you
> > suggest becomes no longer possible.
> > 
> > So, it's not a question of bringing stuff up to the future at all... you
> > can call it a design regression of you will, but you're really making
> > demands about how CPUs work which are outside of your remit.
> > 
> > Think of this a bit like you changing the opcodes immediately following a
> > 'LOCK' prefix on x86.  I suspect divorsing the following opcodes from its
> > prefix would be very bad for the instructions atomicity.
> 
> But what about the limitations that the function tracer imposes on the
> code that gets modified by stop_machine()?
> 
> 1) the original code is simply a call to mcount
> 
> 2) on boot up, that call gets converted into a nop
> 
> 3) the code that gets changed will only be converting a nop to a call
> into the function tracer, and back again.
> 
> IOW, it's a very limited subset of the ARM assembly that gets touched.
> I'm not sure what the op codes are for the above, but I can imagine they
> don't impose the prefixes as you described.
> 
> If that's the case, is it still possible to change to the breakpoint
> method?

I have no idea; I've no idea how ftrace works on ARM.  That's something
other people use and deal with.  Last (and only) time I used the built-in
kernel tracing facilities I ended up giving up with it and going back to
using my sched-clock+record+printk based approaches instead on account
of the kernels built-in tracing being far too heavy.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 16:45               ` Russell King - ARM Linux
@ 2012-12-07 17:13                 ` Steven Rostedt
  2012-12-07 17:45                   ` Jon Medhurst (Tixy)
  2012-12-07 18:13                   ` Russell King - ARM Linux
  0 siblings, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 17:13 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jon Medhurst (Tixy),
	linux-arm-kernel, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 16:45 +0000, Russell King - ARM Linux wrote:
> On Fri, Dec 07, 2012 at 11:36:40AM -0500, Steven Rostedt wrote:

> > But what about the limitations that the function tracer imposes on the
> > code that gets modified by stop_machine()?
> > 
> > 1) the original code is simply a call to mcount
> > 
> > 2) on boot up, that call gets converted into a nop
> > 
> > 3) the code that gets changed will only be converting a nop to a call
> > into the function tracer, and back again.
> > 
> > IOW, it's a very limited subset of the ARM assembly that gets touched.
> > I'm not sure what the op codes are for the above, but I can imagine they
> > don't impose the prefixes as you described.
> > 
> > If that's the case, is it still possible to change to the breakpoint
> > method?
> 
> I have no idea; I've no idea how ftrace works on ARM.

I know how ftrace works on ARM, I'm just asking about the way the
architecture works in general. So to answer my question, you don't need
to know anything about ftrace. I'll make my question more general:

If I have a nop, that is a size of a call (branch and link), which is
near the beginning of a function and not part of any conditional, and I
want to convert it into a call (branch and link), would adding a
breakpoint to it, modifying it to the call, and then removing the
breakpoint be possible? Of course it would require syncing in between
steps, but my question is, if the above is possible on a thumb2 ARM
processor?

>   That's something
> other people use and deal with.  Last (and only) time I used the built-in
> kernel tracing facilities I ended up giving up with it and going back to
> using my sched-clock+record+printk based approaches instead on account
> of the kernels built-in tracing being far too heavy.

Too bad. Which tracing facilities did you use? Function tracing? And
IIRC, ARM originally had only the static tracing, which was extremely
heavy weight. Have you tried tracepoints? Also, have you tried my
favorite way of debugging: trace_printk(). It acts just like printk but
instead of recording to the console, it records into the ftrace buffer
which can be read via the /debug/tracing/trace or dumped to the console
with a sysrq-z.

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 17:13                 ` Steven Rostedt
@ 2012-12-07 17:45                   ` Jon Medhurst (Tixy)
  2012-12-07 18:06                     ` Steven Rostedt
  2012-12-10 10:04                     ` Will Deacon
  2012-12-07 18:13                   ` Russell King - ARM Linux
  1 sibling, 2 replies; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-07 17:45 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Russell King - ARM Linux, linux-arm-kernel, Ingo Molnar,
	Frederic Weisbecker, Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote:
> I'll make my question more general:
> 
> If I have a nop, that is a size of a call (branch and link), which is
> near the beginning of a function and not part of any conditional, and I
> want to convert it into a call (branch and link), would adding a
> breakpoint to it, modifying it to the call, and then removing the
> breakpoint be possible? Of course it would require syncing in between
> steps, but my question is, if the above is possible on a thumb2 ARM
> processor?

I believe so. The details are (repeating your earlier explanation) ...

1. Replace first half of nop with 16bit 'breakpoint' instruction.

2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the
icache for changed part of the nop instruction).

3. Replace second half of nop with second half of the call instruction.

4. Sync.

5. Replace the breakpoint with the first half of the call instruction.

6. Sync

And if any core execute the breakpoint instruction, then the handler
ensures execution continues at the instruction after the nop were trying
to replace.

However, wouldn't we need any of this breakpoint malarkey, why not just
just use a 16-bit branch instruction which branches over the second half
of the nop? :-)

-- 
Tixy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 17:45                   ` Jon Medhurst (Tixy)
@ 2012-12-07 18:06                     ` Steven Rostedt
  2012-12-07 18:17                       ` Steven Rostedt
  2012-12-07 18:18                       ` Jon Medhurst (Tixy)
  2012-12-10 10:04                     ` Will Deacon
  1 sibling, 2 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 18:06 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: Russell King - ARM Linux, linux-arm-kernel, Ingo Molnar,
	Frederic Weisbecker, Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 17:45 +0000, Jon Medhurst (Tixy) wrote:
> On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote:
> > I'll make my question more general:
> > 
> > If I have a nop, that is a size of a call (branch and link), which is
> > near the beginning of a function and not part of any conditional, and I
> > want to convert it into a call (branch and link), would adding a
> > breakpoint to it, modifying it to the call, and then removing the
> > breakpoint be possible? Of course it would require syncing in between
> > steps, but my question is, if the above is possible on a thumb2 ARM
> > processor?
> 
> I believe so. The details are (repeating your earlier explanation) ...
> 
> 1. Replace first half of nop with 16bit 'breakpoint' instruction.
> 
> 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the
> icache for changed part of the nop instruction).
> 
> 3. Replace second half of nop with second half of the call instruction.
> 
> 4. Sync.
> 
> 5. Replace the breakpoint with the first half of the call instruction.
> 
> 6. Sync
> 
> And if any core execute the breakpoint instruction, then the handler
> ensures execution continues at the instruction after the nop were trying
> to replace.

Exactly!

> 
> However, wouldn't we need any of this breakpoint malarkey, why not just
> just use a 16-bit branch instruction which branches over the second half
> of the nop? :-)

If you can get away with that, sure. Or better yet. If the arch supports
it, you can do what I did with powerpc. That was just replace the nop
with the 32bit branch, and the 32bit branch with a 32bit nop. No nops.
No multiple steps in between. I just did the swap of all function
tracepoints in one fell swoop, and then did the icache sync.

Now that's if the arch doesn't have issues with swapping code like this.
Can a 32bit branch-and-link be spread across cache lines? On x86 the
call is 5 bytes and can be. Thus, we were forced to do the breakpoint
because we don't know how the instructions are laid out on the cache
lines.

If 32bit can't be swapped but 16bit never crosses cache lines, then your
approach may also work.

-- Steve




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 17:13                 ` Steven Rostedt
  2012-12-07 17:45                   ` Jon Medhurst (Tixy)
@ 2012-12-07 18:13                   ` Russell King - ARM Linux
  2012-12-07 18:43                     ` Steven Rostedt
  1 sibling, 1 reply; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-12-07 18:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jon Medhurst (Tixy),
	linux-arm-kernel, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel

On Fri, Dec 07, 2012 at 12:13:56PM -0500, Steven Rostedt wrote:
> On Fri, 2012-12-07 at 16:45 +0000, Russell King - ARM Linux wrote:
> > On Fri, Dec 07, 2012 at 11:36:40AM -0500, Steven Rostedt wrote:
> 
> > > But what about the limitations that the function tracer imposes on the
> > > code that gets modified by stop_machine()?
> > > 
> > > 1) the original code is simply a call to mcount
> > > 
> > > 2) on boot up, that call gets converted into a nop
> > > 
> > > 3) the code that gets changed will only be converting a nop to a call
> > > into the function tracer, and back again.
> > > 
> > > IOW, it's a very limited subset of the ARM assembly that gets touched.
> > > I'm not sure what the op codes are for the above, but I can imagine they
> > > don't impose the prefixes as you described.
> > > 
> > > If that's the case, is it still possible to change to the breakpoint
> > > method?
> > 
> > I have no idea; I've no idea how ftrace works on ARM.
> 
> I know how ftrace works on ARM, I'm just asking about the way the
> architecture works in general. So to answer my question, you don't need
> to know anything about ftrace. I'll make my question more general:
> 
> If I have a nop, that is a size of a call (branch and link), which is
> near the beginning of a function and not part of any conditional, and I
> want to convert it into a call (branch and link), would adding a
> breakpoint to it, modifying it to the call, and then removing the
> breakpoint be possible? Of course it would require syncing in between
> steps, but my question is, if the above is possible on a thumb2 ARM
> processor?

So, you're asking me to wave hands in the air, make guesses and hope that
I hit the situation you're knowledgable of without actually telling me
anything.  Great - you really know how to frustrate people...

If you're saying that the nop was created at _compile_ time, to be a 32-bit
instruction then maybe - but you have a problem.  That 32-bit instruction
may stradle a 32-bit boundary (worse if it stradles a page), and _any_
changes to that instruction will not be atomic - other CPUs will see the
store as two separate operations which, given the right timing may create
an illegal instruction.

Even changing it to a breakpoint is potentially problematical.  So we'd
need to ensure that no other CPU was executing the code while we modify
it.

Now, if you're going to say that ftrace inserts a 32-bit nop with
appropriate alignment constraints at _compile_ time, then maybe that would
work, but then your update to the instruction might as well just be NOP->BL
because that's a word-write to an aligned address which will be atomic (in
so far as either the entire instruction has been updated _or_ none of the
instruction has been updated.)

In a previous email you intimated that these NOPs are inserted by ftrace at
boot time.  Given that these NOPs would have to be 32-bit instructions, I'd
hope that they're also replacing 32-bit instructions and not two 16-bit
instructions which might be prefixed by a "if-then" instruction.

Maybe now you'll provide some information on how ftrace works as you should
now realise that your "simple question" doesn't have a simple answer.

> >   That's something
> > other people use and deal with.  Last (and only) time I used the built-in
> > kernel tracing facilities I ended up giving up with it and going back to
> > using my sched-clock+record+printk based approaches instead on account
> > of the kernels built-in tracing being far too heavy.
> 
> Too bad. Which tracing facilities did you use? Function tracing? And
> IIRC, ARM originally had only the static tracing, which was extremely
> heavy weight. Have you tried tracepoints? Also, have you tried my
> favorite way of debugging: trace_printk(). It acts just like printk but
> instead of recording to the console, it records into the ftrace buffer
> which can be read via the /debug/tracing/trace or dumped to the console
> with a sysrq-z.

TBH I don't remember, it was a few years ago that I last had to measure
stuff.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 18:06                     ` Steven Rostedt
@ 2012-12-07 18:17                       ` Steven Rostedt
  2012-12-07 18:18                       ` Jon Medhurst (Tixy)
  1 sibling, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 18:17 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: Russell King - ARM Linux, linux-arm-kernel, Ingo Molnar,
	Frederic Weisbecker, Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 13:06 -0500, Steven Rostedt wrote:

> If you can get away with that, sure. Or better yet. If the arch supports
> it, you can do what I did with powerpc. That was just replace the nop
> with the 32bit branch, and the 32bit branch with a 32bit nop. No nops.

s/No nops/No breakpoints/

> No multiple steps in between. I just did the swap of all function
> tracepoints in one fell swoop, and then did the icache sync.
> 

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 18:06                     ` Steven Rostedt
  2012-12-07 18:17                       ` Steven Rostedt
@ 2012-12-07 18:18                       ` Jon Medhurst (Tixy)
  1 sibling, 0 replies; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-07 18:18 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Russell King - ARM Linux, linux-arm-kernel, Ingo Molnar,
	Frederic Weisbecker, Rabin Vincent, linux-kernel

On Fri, 2012-12-07 at 13:06 -0500, Steven Rostedt wrote:
> > 
> > However, wouldn't we need any of this breakpoint malarkey, why not just
> > just use a 16-bit branch instruction which branches over the second half
> > of the nop? :-)
> 
> If you can get away with that, sure. Or better yet. If the arch supports
> it, you can do what I did with powerpc. That was just replace the nop
> with the 32bit branch, and the 32bit branch with a 32bit nop. No nops.
> No multiple steps in between. I just did the swap of all function
> tracepoints in one fell swoop, and then did the icache sync.

But that gets back to the original problem that the 32bit instructions
can straddle two cache lines.

> 
> Now that's if the arch doesn't have issues with swapping code like this.
> Can a 32bit branch-and-link be spread across cache lines?

Yes, Thumb2 instructions are either 16 or 32bit and are aligned to 16bit
boundaries. So 16bit instructions never cross cache lines and all 32bit
instructions can.

>  On x86 the
> call is 5 bytes and can be. Thus, we were forced to do the breakpoint
> because we don't know how the instructions are laid out on the cache
> lines.
> 
> If 32bit can't be swapped but 16bit never crosses cache lines, then your
> approach may also work.

It should.

-- 
Tixy


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 18:13                   ` Russell King - ARM Linux
@ 2012-12-07 18:43                     ` Steven Rostedt
  2012-12-07 19:02                       ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 18:43 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jon Medhurst (Tixy),
	linux-arm-kernel, Ingo Molnar, Frederic Weisbecker,
	Rabin Vincent, linux-kernel, H. Peter Anvin

[ Added hpa, as he knows a bit about x86, and breakpoints ]

On Fri, 2012-12-07 at 18:13 +0000, Russell King - ARM Linux wrote:

> So, you're asking me to wave hands in the air, make guesses and hope that
> I hit the situation you're knowledgable of without actually telling me
> anything.  Great - you really know how to frustrate people...

Sorry, I thought I was telling you something. I guess we have a bit of a
disconnect. Jon did hit what I was trying to ask.

> 
> If you're saying that the nop was created at _compile_ time, to be a 32-bit

Actually the call is created at compile time. When compiled with -pg,
gcc will insert branch and link calls to a special function (actually
implemented in assembly) to "mcount". Looking at the arm implementation,
it seems that the branch and link is 32bits (just confirming). On boot
up, all calls to mcount are converted to 32bit nops.

> instruction then maybe - but you have a problem.  That 32-bit instruction
> may stradle a 32-bit boundary (worse if it stradles a page), and _any_
> changes to that instruction will not be atomic - other CPUs will see the
> store as two separate operations which, given the right timing may create
> an illegal instruction.

This is the same as x86.

> 
> Even changing it to a breakpoint is potentially problematical.  So we'd
> need to ensure that no other CPU was executing the code while we modify
> it.

This is not the same as x86, I guess because x86 has a one byte
breakpoint. Thus, it is stated in the x86 architecture (I believe,
Peter, you can correct me if I'm wrong), that the only "safe" thing that
can modify code, is a software breakpoint.

Are you saying that thumb does not guarantee even software breakpoints
from being added atomically? Doesn't that kill the purpose of a
breakpoint?

> 
> Now, if you're going to say that ftrace inserts a 32-bit nop with
> appropriate alignment constraints at _compile_ time, then maybe that would
> work, but then your update to the instruction might as well just be NOP->BL
> because that's a word-write to an aligned address which will be atomic (in
> so far as either the entire instruction has been updated _or_ none of the
> instruction has been updated.)

That's how it's done on powerpc.

> 
> In a previous email you intimated that these NOPs are inserted by ftrace at
> boot time.  Given that these NOPs would have to be 32-bit instructions, I'd
> hope that they're also replacing 32-bit instructions and not two 16-bit
> instructions which might be prefixed by a "if-then" instruction.

At compile time, it's a call to mcount, inserted by gcc, and at the
beginning of functions, thus they should never be prefixed by a
"if-then".

> 
> Maybe now you'll provide some information on how ftrace works as you should
> now realise that your "simple question" doesn't have a simple answer.

Maybe, I've described it above, but I'll repeat it in more detail here.

Ftrace uses the gcc -pg option that adds a call to 'mcount' to the
beginning of almost every function. It does not add mcount to inlined
functions, so these functions are truly functions and not inlined into
other functions. Also some functions are manually ignored in the kernel
when we annotate them with 'notrace'. Just to make things consistent, as
gcc doesn't always honor the 'inline' code, I've defined inline to also
contain 'notrace' as well.

The 'mcount' function has to be written in assembly. On arm, it's
implemented in arch/arm/kernel/entry-common.S. It looks like this:

ENTRY(mcount)
        stmdb   sp!, {lr}
        ldr     lr, [fp, #-4]
        ldmia   sp!, {pc}
ENDPROC(mcount)

Note, I removed the #ifdef/#else of CONFIG_DYNAMIC_FTRACE, because this
modification only happens with that config enabled. So I only showed
that version.

During compile time, the recordmcount.c code is run against all .o
files, and records the location of the mcount callers. It then creates a
table of those locations and links them back into the .o file.

On boot up (before SMP starts), this table is referenced, and all the
calls to mcount are converted into 32 bit nops. Before this conversion,
any code that hits the call will simply return back (as you can see by
the above mcount definition).

When we enable tracing, currently we use stop_machine(), all the nops in
functions to be traced are then converted to another call, but not to
mcount (as that just returns), but instead to 'ftrace_caller'. This
function is also written in assembly and it handles the tracing when
hit.

When tracing is disabled, the same thing happens but we convert the call
sites into nops.

Does this make more sense?

> 
> > >   That's something
> > > other people use and deal with.  Last (and only) time I used the built-in
> > > kernel tracing facilities I ended up giving up with it and going back to
> > > using my sched-clock+record+printk based approaches instead on account
> > > of the kernels built-in tracing being far too heavy.
> > 
> > Too bad. Which tracing facilities did you use? Function tracing? And
> > IIRC, ARM originally had only the static tracing, which was extremely
> > heavy weight. Have you tried tracepoints? Also, have you tried my
> > favorite way of debugging: trace_printk(). It acts just like printk but
> > instead of recording to the console, it records into the ftrace buffer
> > which can be read via the /debug/tracing/trace or dumped to the console
> > with a sysrq-z.
> 
> TBH I don't remember, it was a few years ago that I last had to measure
> stuff.

Yeah, things have improved since then :-)

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 18:43                     ` Steven Rostedt
@ 2012-12-07 19:02                       ` Will Deacon
  2012-12-07 20:01                         ` Steven Rostedt
  2012-12-10 11:04                         ` Jon Medhurst (Tixy)
  0 siblings, 2 replies; 35+ messages in thread
From: Will Deacon @ 2012-12-07 19:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Russell King - ARM Linux, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	H. Peter Anvin, linux-arm-kernel

Hi Steve,

On Fri, Dec 07, 2012 at 06:43:25PM +0000, Steven Rostedt wrote:
> On Fri, 2012-12-07 at 18:13 +0000, Russell King - ARM Linux wrote:
> > Even changing it to a breakpoint is potentially problematical.  So we'd
> > need to ensure that no other CPU was executing the code while we modify
> > it.
> 
> This is not the same as x86, I guess because x86 has a one byte
> breakpoint. Thus, it is stated in the x86 architecture (I believe,
> Peter, you can correct me if I'm wrong), that the only "safe" thing that
> can modify code, is a software breakpoint.
> 
> Are you saying that thumb does not guarantee even software breakpoints
> from being added atomically? Doesn't that kill the purpose of a
> breakpoint?

For ARMv7, there are small subsets of instructions for ARM and Thumb which
are guaranteed to be atomic wrt concurrent modification and execution of
the instruction stream between different processors:

Thumb:	The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
ARM:	The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.

but before your eyes light up at the presence of the BKPT instruction in
that list; we don't actually use that in Linux and instead leave it for
external (i.e. JTAG) debuggers so that they can operate without getting
tangled up with spurious traps from the OS. Linux actually picks its own
undefined instructions, which are obviously not included in the lists above.

Also note that the 16-bit limitation for Thumb instructions above can
actually be used to modify *half* of a BL instruction but, to keep things
exciting, the PC-relative immediate is split across the two halves. However,
you could in theory mess around with bottom 10 bits or so, depending on the
exact encoding...

Obviously this doesn't preclude the need for cache maintenance on both D and
I side, but the atomicity guarantees are as I've described above.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 19:02                       ` Will Deacon
@ 2012-12-07 20:01                         ` Steven Rostedt
  2012-12-10 11:04                         ` Jon Medhurst (Tixy)
  1 sibling, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-07 20:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: Russell King - ARM Linux, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	H. Peter Anvin, linux-arm-kernel

On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote:

> For ARMv7, there are small subsets of instructions for ARM and Thumb which
> are guaranteed to be atomic wrt concurrent modification and execution of
> the instruction stream between different processors:
> 
> Thumb:	The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> ARM:	The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.
> 
> but before your eyes light up at the presence of the BKPT instruction in
> that list; we don't actually use that in Linux and instead leave it for
> external (i.e. JTAG) debuggers so that they can operate without getting
> tangled up with spurious traps from the OS. Linux actually picks its own
> undefined instructions, which are obviously not included in the lists above.

My eyes actually lit up with the B instruction :-)  As Jon showed, we
could use a 16bit jump instead. Add the B to jump over the other half of
the call (to all places that you want to modify). Send a sync to all
CPUs to flush their caches. Modify the other half of the call, send
another sync, and then modify the first half.

> 
> Also note that the 16-bit limitation for Thumb instructions above can
> actually be used to modify *half* of a BL instruction but, to keep things
> exciting, the PC-relative immediate is split across the two halves. However,
> you could in theory mess around with bottom 10 bits or so, depending on the
> exact encoding...
> 
> Obviously this doesn't preclude the need for cache maintenance on both D and
> I side, but the atomicity guarantees are as I've described above.

Right. Thanks for the update.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 17:45                   ` Jon Medhurst (Tixy)
  2012-12-07 18:06                     ` Steven Rostedt
@ 2012-12-10 10:04                     ` Will Deacon
  2012-12-10 13:02                       ` Steven Rostedt
  2012-12-10 16:45                       ` Jon Medhurst (Tixy)
  1 sibling, 2 replies; 35+ messages in thread
From: Will Deacon @ 2012-12-10 10:04 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: Steven Rostedt, Russell King - ARM Linux, Frederic Weisbecker,
	linux-kernel, Rabin Vincent, Ingo Molnar, linux-arm-kernel

Hi Jon,

Back-pedalling a bit here, but I'm confused by one of your points below:

On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote:
> On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote:
> > I'll make my question more general:
> > 
> > If I have a nop, that is a size of a call (branch and link), which is
> > near the beginning of a function and not part of any conditional, and I
> > want to convert it into a call (branch and link), would adding a
> > breakpoint to it, modifying it to the call, and then removing the
> > breakpoint be possible? Of course it would require syncing in between
> > steps, but my question is, if the above is possible on a thumb2 ARM
> > processor?
> 
> I believe so. The details are (repeating your earlier explanation) ...
> 
> 1. Replace first half of nop with 16bit 'breakpoint' instruction.

Sort of -- you'd actually need 2x16-bit nops to make this work.

> 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the
> icache for changed part of the nop instruction).

Why do you need to use IPIs for I-cache invalidation on other cores? For
ARMv7 SMP (i.e. the multi-processing extensions) doing I-cache invalidation
by MVA to PoU will be broadcast to the applicable domain for the
shareability attributes of the address. So if you do icimvau with an
inner-shareable virtual address, it will be broadcast by the hardware.

> However, wouldn't we need any of this breakpoint malarkey, why not just
> just use a 16-bit branch instruction which branches over the second half
> of the nop? :-)

Yes, and I think if you do use two 16-bit nops, you can even get rid of all
the intermediate `sync' operations (I guess you might want one at the end if
you want the call to become visible at a particular point).

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-07 19:02                       ` Will Deacon
  2012-12-07 20:01                         ` Steven Rostedt
@ 2012-12-10 11:04                         ` Jon Medhurst (Tixy)
  2012-12-10 11:24                           ` Will Deacon
  1 sibling, 1 reply; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-10 11:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: Steven Rostedt, Russell King - ARM Linux, Frederic Weisbecker,
	linux-kernel, Rabin Vincent, Ingo Molnar, H. Peter Anvin,
	linux-arm-kernel

On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote:
> For ARMv7, there are small subsets of instructions for ARM and Thumb which
> are guaranteed to be atomic wrt concurrent modification and execution of
> the instruction stream between different processors:
> 
> Thumb:	The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> ARM:	The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.
> 

So this means for things like kprobes which can modify arbitrary kernel
code we are going to need to continue to always use some form of
stop_the_whole_system() function?

Also, kprobes currently uses patch_text() which only uses stop_machine
for Thumb2 instructions which straddle a word boundary, so this needs
changing?

-- 
Tixy


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 11:04                         ` Jon Medhurst (Tixy)
@ 2012-12-10 11:24                           ` Will Deacon
  2012-12-10 14:02                             ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2012-12-10 11:24 UTC (permalink / raw)
  To: Jon Medhurst (Tixy)
  Cc: Steven Rostedt, Russell King - ARM Linux, Frederic Weisbecker,
	linux-kernel, Rabin Vincent, Ingo Molnar, H. Peter Anvin,
	linux-arm-kernel

On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote:
> On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote:
> > For ARMv7, there are small subsets of instructions for ARM and Thumb which
> > are guaranteed to be atomic wrt concurrent modification and execution of
> > the instruction stream between different processors:
> > 
> > Thumb:	The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> > ARM:	The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.
> > 
> 
> So this means for things like kprobes which can modify arbitrary kernel
> code we are going to need to continue to always use some form of
> stop_the_whole_system() function?
> 
> Also, kprobes currently uses patch_text() which only uses stop_machine
> for Thumb2 instructions which straddle a word boundary, so this needs
> changing?

Yes; if you're modifying instructions other than those mentioned above, then
you'll need to synchronise the CPUs, update the instructions, perform
cache-maintenance on the writing CPU and then execute an isb on the
executing core (this last bit isn't needed if you're going to go through an
exception return to get back to the new code -- depends on how your
stop/resume code works).

For ftrace we can (hopefully) avoid a lot of this when we have known points
of modification.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 10:04                     ` Will Deacon
@ 2012-12-10 13:02                       ` Steven Rostedt
  2012-12-10 13:33                         ` Will Deacon
                                           ` (2 more replies)
  2012-12-10 16:45                       ` Jon Medhurst (Tixy)
  1 sibling, 3 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-10 13:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jon Medhurst (Tixy),
	Russell King - ARM Linux, Frederic Weisbecker, linux-kernel,
	Rabin Vincent, Ingo Molnar, linux-arm-kernel

On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote:
> Hi Jon,
> 
> Back-pedalling a bit here, but I'm confused by one of your points below:
> 
> On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote:
> > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote:
> > > I'll make my question more general:
> > > 
> > > If I have a nop, that is a size of a call (branch and link), which is
> > > near the beginning of a function and not part of any conditional, and I
> > > want to convert it into a call (branch and link), would adding a
> > > breakpoint to it, modifying it to the call, and then removing the
> > > breakpoint be possible? Of course it would require syncing in between
> > > steps, but my question is, if the above is possible on a thumb2 ARM
> > > processor?
> > 
> > I believe so. The details are (repeating your earlier explanation) ...
> > 
> > 1. Replace first half of nop with 16bit 'breakpoint' instruction.
> 
> Sort of -- you'd actually need 2x16-bit nops to make this work.

Why?

> 
> > 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the
> > icache for changed part of the nop instruction).
> 
> Why do you need to use IPIs for I-cache invalidation on other cores? For
> ARMv7 SMP (i.e. the multi-processing extensions) doing I-cache invalidation
> by MVA to PoU will be broadcast to the applicable domain for the
> shareability attributes of the address. So if you do icimvau with an
> inner-shareable virtual address, it will be broadcast by the hardware.
> 
> > However, wouldn't we need any of this breakpoint malarkey, why not just
> > just use a 16-bit branch instruction which branches over the second half
> > of the nop? :-)
> 
> Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> the intermediate `sync' operations (I guess you might want one at the end if
> you want the call to become visible at a particular point).

Wont work. We are replacing a 32bit call with a nop. That nop must also
be 32bits, because we could eventually replace the nop(s) with a 32bit
call. Basically, we can never allow the second 16bit part ever be the
next instruction. If the first 16bit nop is executed, and then the task
gets preempted. The nops get converted to a 32bit call. The task gets
scheduled again and now is executing the second 16bits of the 32bit call
and we get unexpected (probably crashing) results.

By having either a 16bit breakpoint whose handler returns after the
second 16bit part, or a 16bit jump that simply jumps over the second
half, then all this should work. When the CPU processes a 32bit
instruction, it either processes all or non of it, correct?

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 13:02                       ` Steven Rostedt
@ 2012-12-10 13:33                         ` Will Deacon
  2012-12-10 13:40                         ` Jamie Lokier
  2012-12-10 13:57                         ` Russell King - ARM Linux
  2 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2012-12-10 13:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jon Medhurst (Tixy),
	Russell King - ARM Linux, Frederic Weisbecker, linux-kernel,
	Rabin Vincent, Ingo Molnar, linux-arm-kernel

On Mon, Dec 10, 2012 at 01:02:17PM +0000, Steven Rostedt wrote:
> On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote:
> > Hi Jon,
> > 
> > Back-pedalling a bit here, but I'm confused by one of your points below:
> > 
> > On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote:
> > > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote:
> > > > I'll make my question more general:
> > > > 
> > > > If I have a nop, that is a size of a call (branch and link), which is
> > > > near the beginning of a function and not part of any conditional, and I
> > > > want to convert it into a call (branch and link), would adding a
> > > > breakpoint to it, modifying it to the call, and then removing the
> > > > breakpoint be possible? Of course it would require syncing in between
> > > > steps, but my question is, if the above is possible on a thumb2 ARM
> > > > processor?
> > > 
> > > I believe so. The details are (repeating your earlier explanation) ...
> > > 
> > > 1. Replace first half of nop with 16bit 'breakpoint' instruction.
> > 
> > Sort of -- you'd actually need 2x16-bit nops to make this work.
> 
> Why?

Because the architecture doesn't provide any guarantees about concurrent
modification of 32-bit nop instructions. If you stop the world every time,
fine, but that's what we're trying to avoid, right?

> > > However, wouldn't we need any of this breakpoint malarkey, why not just
> > > just use a 16-bit branch instruction which branches over the second half
> > > of the nop? :-)
> > 
> > Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> > the intermediate `sync' operations (I guess you might want one at the end if
> > you want the call to become visible at a particular point).
> 
> Wont work. We are replacing a 32bit call with a nop. That nop must also
> be 32bits, because we could eventually replace the nop(s) with a 32bit
> call. Basically, we can never allow the second 16bit part ever be the
> next instruction. If the first 16bit nop is executed, and then the task
> gets preempted. The nops get converted to a 32bit call. The task gets
> scheduled again and now is executing the second 16bits of the 32bit call
> and we get unexpected (probably crashing) results.

Damn, I didn't realise you wanted to put the 32-bit call back on
pre-emption. Still, the `sync' is not needed when patching in a b for a nop.

> By having either a 16bit breakpoint whose handler returns after the
> second 16bit part, or a 16bit jump that simply jumps over the second
> half, then all this should work. When the CPU processes a 32bit
> instruction, it either processes all or non of it, correct?

If you have two 16-bit nops, patching the first to branch over the second
will work.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 13:02                       ` Steven Rostedt
  2012-12-10 13:33                         ` Will Deacon
@ 2012-12-10 13:40                         ` Jamie Lokier
  2012-12-10 14:56                           ` Will Deacon
  2012-12-10 13:57                         ` Russell King - ARM Linux
  2 siblings, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2012-12-10 13:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Deacon, Jon Medhurst (Tixy),
	Russell King - ARM Linux, Frederic Weisbecker, linux-kernel,
	Rabin Vincent, Ingo Molnar, linux-arm-kernel

Steven Rostedt wrote:
> > Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> > the intermediate `sync' operations (I guess you might want one at the end if
> > you want the call to become visible at a particular point).
> 
> Wont work. We are replacing a 32bit call with a nop. That nop must also
> be 32bits, because we could eventually replace the nop(s) with a 32bit
> call. Basically, we can never allow the second 16bit part ever be the
> next instruction. If the first 16bit nop is executed, and then the task
> gets preempted. The nops get converted to a 32bit call. The task gets
> scheduled again and now is executing the second 16bits of the 32bit call
> and we get unexpected (probably crashing) results.
> 
> By having either a 16bit breakpoint whose handler returns after the
> second 16bit part, or a 16bit jump that simply jumps over the second
> half, then all this should work. When the CPU processes a 32bit
> instruction, it either processes all or non of it, correct?

Sounds good, except what Will wrote a few days ago:

On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote:
> For ARMv7, there are small subsets of instructions for ARM and Thumb which
> are guaranteed to be atomic wrt concurrent modification and execution of
> the instruction stream between different processors:
>
> Thumb:      The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> ARM:        The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.

Thumb 32-bit ftrace call isn't in the above list.

Questions: does the above concurrent modification guarantee require
both the old instruction _and_ the new one to be among those listed,
or is it enough to be just the new one (for example when setting a
normal software breakpoint, that would be useful)?  Can it be the old
one and not the new (for example when removing a software breakpoint,
that would be useful)?  Does that subset mean replacing any of the
listed instructions by any of the others is ok, or any of the listed
with another of the same type?

(I guess as a matter of architecture design, it makes sense to
guarantee only a short list, because of occasions when the hardware,
or a software emulation through traps, or a simulation, might read the
instruction memory more than once.)

This is what makes me wonder, if it's safe to replace the 32-bit
mcount call with a 16-bit short jump:

> On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote:
> > So this means for things like kprobes which can modify arbitrary kernel
> > code we are going to need to continue to always use some form of
> > stop_the_whole_system() function?
> >
> > Also, kprobes currently uses patch_text() which only uses stop_machine
> > for Thumb2 instructions which straddle a word boundary, so this needs
> > changing?

Will Deacon replied:
> Yes; if you're modifying instructions other than those mentioned above, then
> you'll need to synchronise the CPUs, update the instructions, perform
> cache-maintenance on the writing CPU and then execute an isb on the
> executing core (this last bit isn't needed if you're going to go through an
> exception return to get back to the new code -- depends on how your
> stop/resume code works).

If I've understood that exchange, it implies that using patch_text()
to replace an instruction not in the list of special ones, with a trap
or jump, isn't ok?  And so it's ok to replace the NOP with a short
branch (since 16-bit "B" is in the list), but it's not ok to replace
16-bit "B" with the 32-bit ftrace call; and the same going the other way?

Best,
-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 13:02                       ` Steven Rostedt
  2012-12-10 13:33                         ` Will Deacon
  2012-12-10 13:40                         ` Jamie Lokier
@ 2012-12-10 13:57                         ` Russell King - ARM Linux
  2012-12-10 14:06                           ` Steven Rostedt
  2 siblings, 1 reply; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-12-10 13:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Deacon, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	linux-arm-kernel

On Mon, Dec 10, 2012 at 08:02:17AM -0500, Steven Rostedt wrote:
> On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote:
> > Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> > the intermediate `sync' operations (I guess you might want one at the end if
> > you want the call to become visible at a particular point).
> 
> Wont work. We are replacing a 32bit call with a nop. That nop must also
> be 32bits, because we could eventually replace the nop(s) with a 32bit
> call.

... which, if it's misaligned to a 32-bit boundary, which can happen with
Thumb-2 code, will require the replacement to be done atomically; you will
need to use stop_machine() to ensure that other CPUs don't try to execute
the instruction mid-way through modification... as I have already
explained in my previous mails.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 11:24                           ` Will Deacon
@ 2012-12-10 14:02                             ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-10 14:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jon Medhurst (Tixy),
	Russell King - ARM Linux, Frederic Weisbecker, linux-kernel,
	Rabin Vincent, Ingo Molnar, H. Peter Anvin, linux-arm-kernel

On Mon, 2012-12-10 at 11:24 +0000, Will Deacon wrote:
> On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote:
> > On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote:
> > > For ARMv7, there are small subsets of instructions for ARM and Thumb which
> > > are guaranteed to be atomic wrt concurrent modification and execution of
> > > the instruction stream between different processors:
> > > 
> > > Thumb:	The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> > > ARM:	The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.
> > > 
> > 
> > So this means for things like kprobes which can modify arbitrary kernel
> > code we are going to need to continue to always use some form of
> > stop_the_whole_system() function?
> > 
> > Also, kprobes currently uses patch_text() which only uses stop_machine
> > for Thumb2 instructions which straddle a word boundary, so this needs
> > changing?
> 
> Yes; if you're modifying instructions other than those mentioned above, then
> you'll need to synchronise the CPUs, update the instructions, perform
> cache-maintenance on the writing CPU and then execute an isb on the
> executing core (this last bit isn't needed if you're going to go through an
> exception return to get back to the new code -- depends on how your
> stop/resume code works).

Yeah, kprobe optimizing will probably require stop_machine() always, as
it's modifying random code, or adding breakpoints into random places.
That's another adventure to deal with at another time.

> 
> For ftrace we can (hopefully) avoid a lot of this when we have known points
> of modification.

I'm also thinking about tracepoints which behave almost the same as
ftrace. They have nop place holders too. They happen to be 32bits too,
but may only need to be 16 bit. The way tracepoints work is with the use
of asm goto. For example we have:

arch/arm/include/asm/jump_label.h

#ifdef CONFIG_THUMB2_KERNEL
#define JUMP_LABEL_NOP	"nop.w"
#else
#define JUMP_LABEL_NOP	"nop"
#endif

static __always_inline bool arch_static_branch(struct static_key *key)
{
	asm goto("1:\n\t"
		 JUMP_LABEL_NOP "\n\t"
		 ".pushsection __jump_table,  \"aw\"\n\t"
		 ".word 1b, %l[l_yes], %c0\n\t"
		 ".popsection\n\t"
		 : :  "i" (key) :  : l_yes);

	return false;
l_yes:
	return true;
}

Tracepoints use the jump-label "static branch" logic, which uses a gcc
4.6 feature called asm goto. The asm goto allows the internal asm to
reference a label outside the asm stamement and the compiler is aware
that the asm statement may jump to that label. Thus the compiler treats
that asm statement as a possible branch to the given label and it wont
optimize required statements after the asm, if they are needed for the
jump to the label.

Now in include/linux/tracepoint.h we have:

	static inline void trace_##name(proto)				\
	{								\
		if (static_key_false(&__tracepoint_##name.key))		\
			__DO_TRACE(&__tracepoint_##name,		\
				TP_PROTO(data_proto),			\
				TP_ARGS(data_args),			\
				TP_CONDITION(cond),,);			\
	}								\

Where the static_key_false() is an "unlikely" version of the
static_branch() that tells gcc the result of the if statement goes into
the unlikely location (end of function perhaps).

But this doesn't guarantee that it becomes part of some if statement, so
this doesn't have all the limitations that ftrace mcount call has.

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 13:57                         ` Russell King - ARM Linux
@ 2012-12-10 14:06                           ` Steven Rostedt
  2012-12-10 14:07                             ` Russell King - ARM Linux
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2012-12-10 14:06 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Will Deacon, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	linux-arm-kernel

On Mon, 2012-12-10 at 13:57 +0000, Russell King - ARM Linux wrote:
> On Mon, Dec 10, 2012 at 08:02:17AM -0500, Steven Rostedt wrote:
> > On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote:
> > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> > > the intermediate `sync' operations (I guess you might want one at the end if
> > > you want the call to become visible at a particular point).
> > 
> > Wont work. We are replacing a 32bit call with a nop. That nop must also
> > be 32bits, because we could eventually replace the nop(s) with a 32bit
> > call.
> 
> ... which, if it's misaligned to a 32-bit boundary, which can happen with
> Thumb-2 code, will require the replacement to be done atomically; you will
> need to use stop_machine() to ensure that other CPUs don't try to execute
> the instruction mid-way through modification... as I have already
> explained in my previous mails.

If there's no way to modify a 32bit operation without stop_machine(),
ever with a breakpoint, than we can stop the discussion here. ARM will
forever require stop_machine() for use with tracepoints and ftrace. Too
bad, as ARM was the x86 competitor. Here's something that x86 has a one
up on ARM.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 14:06                           ` Steven Rostedt
@ 2012-12-10 14:07                             ` Russell King - ARM Linux
  2012-12-10 14:46                               ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-12-10 14:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Deacon, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	linux-arm-kernel

On Mon, Dec 10, 2012 at 09:06:05AM -0500, Steven Rostedt wrote:
> On Mon, 2012-12-10 at 13:57 +0000, Russell King - ARM Linux wrote:
> > On Mon, Dec 10, 2012 at 08:02:17AM -0500, Steven Rostedt wrote:
> > > On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote:
> > > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> > > > the intermediate `sync' operations (I guess you might want one at the end if
> > > > you want the call to become visible at a particular point).
> > > 
> > > Wont work. We are replacing a 32bit call with a nop. That nop must also
> > > be 32bits, because we could eventually replace the nop(s) with a 32bit
> > > call.
> > 
> > ... which, if it's misaligned to a 32-bit boundary, which can happen with
> > Thumb-2 code, will require the replacement to be done atomically; you will
> > need to use stop_machine() to ensure that other CPUs don't try to execute
> > the instruction mid-way through modification... as I have already
> > explained in my previous mails.
> 
> If there's no way to modify a 32bit operation without stop_machine(),
> ever with a breakpoint, than we can stop the discussion here. ARM will
> forever require stop_machine() for use with tracepoints and ftrace. Too
> bad, as ARM was the x86 competitor. Here's something that x86 has a one
> up on ARM.

You think that kind of blackmail makes a difference?  Look closely at what
I've written - I didn't say that there's no way to modify any 32-bit
operation without stop_machine().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 14:07                             ` Russell King - ARM Linux
@ 2012-12-10 14:46                               ` Steven Rostedt
  2012-12-10 15:25                                 ` Russell King - ARM Linux
  0 siblings, 1 reply; 35+ messages in thread
From: Steven Rostedt @ 2012-12-10 14:46 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Will Deacon, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	linux-arm-kernel

On Mon, 2012-12-10 at 14:07 +0000, Russell King - ARM Linux wrote:
>  
> > If there's no way to modify a 32bit operation without stop_machine(),
> > ever with a breakpoint, than we can stop the discussion here. ARM will
> > forever require stop_machine() for use with tracepoints and ftrace. Too
> > bad, as ARM was the x86 competitor. Here's something that x86 has a one
> > up on ARM.
> 
> You think that kind of blackmail makes a difference?

I'm not trying to blackmail. How does one blackmail to change facts? I
was just showing my disappointment, as I'm starting to grow fond of ARM.
Although I wouldn't mind sending out some blackmail to the ARM HW
vendors (got any pictures of them cheating on their spouses?) to make
them come up with a standardize way to make all this easy :-)

>   Look closely at what
> I've written - I didn't say that there's no way to modify any 32-bit
> operation without stop_machine().

Again, you and I are having a disconnect. I'm not a HW expert. I'm
trying to get a total understanding of what you, Will, Jon and others
are trying to say.

Lets look at what you last said:

> ... which, if it's misaligned to a 32-bit boundary, which can happen with
> Thumb-2 code, will require the replacement to be done atomically; you will
> need to use stop_machine() to ensure that other CPUs don't try to execute
> the instruction mid-way through modification... as I have already
> explained in my previous mails.

I'm confused to what is wrong to "misaligned to a 32-bit boundery".
Isn't it best if it is on a 32-bit boundary? Or do you mean that it's
misaligned across a 32-bit boundary? I guess I just read it wrong.

Either way, I said there's probably no guarantee that the 32-bit calls
to mcount that gcc has inserted (or the tracepoints) are going to be
aligned to 32-bit boundaries. But I'm wondering if that's still a
problem. Let's look at the ways another CPU could get the 32-bit
instruction if it is misaligned, and across two different cache lines,
or even two different pages:

1) the CPU gets the full 32bits as it was on the other CPU, or how it
will be.

2) The CPU gets the first 16bits as it was on the other CPU an the
second 16bits with the update.

3) The CPU gets the first 16bits with the update and the second 16bits
as it use to be.

The first case isn't interesting, so lets jump to the 2 and 3rd cases.

On an update of a 32bit nop to a 16bit breakpoint or branch (jump over
second half).

As we only modify the first half of the 32bit operation, and the second
half still contains the second half of the 32bit nop, the other CPU will
see:

<first 16bits of 32bit nop><second 16bits of 32bit nop>

That doesn't look so bad. Or:

<breakpoint/branch><second 16bits of 32bit nop>

If it sees the breakpoint or branch, then it should just jump over the
second part of the nop, and not execute it. If this is an issue, then we
need stop_machine() otherwise, I'll continue.

Now before changing the second half of the 32bits, we send a sync (IPI
maybe) to all other CPUs, so that we now guarantee that all CPUs sees
the added breakpoint/branch.

As the breakpoint and branch will return pass the second half of the
nop, modifying it shouldn't bother the other CPUs. If it does, then we
have to stay with stop_machine(). If not, let me continue.

We need to send another sync to guarantee that all the other CPUs see
the second half of the jump to ftrace code.

Now the second half matches the second half of the jump to the ftrace
code, and the first half is still the breakpoint/branch. So lets change
that to the first half of the jump to the ftrace code. Now the other
CPUs will see:

<breakpoint/branch><second 16bits of jump to ftrace>

The above should still skip the second half. Or:

<first 16bits of jump to ftrace><second 16bits of jump to ftrace>

The above is what we want CPUs to see.

If my assumptions about the changes above can't be done, then yes we are
stuck with stop_machine(). If they can be done, then we have hope to
remove it.

-- Steve

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 13:40                         ` Jamie Lokier
@ 2012-12-10 14:56                           ` Will Deacon
  0 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2012-12-10 14:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Steven Rostedt, Jon Medhurst (Tixy),
	Russell King - ARM Linux, Frederic Weisbecker, linux-kernel,
	Rabin Vincent, Ingo Molnar, linux-arm-kernel

Hi Jamie,

Thanks for summarising the thread so far.

On Mon, Dec 10, 2012 at 01:40:01PM +0000, Jamie Lokier wrote:
> On Fri, 2012-12-07 at 19:02 +0000, Will Deacon wrote:
> > For ARMv7, there are small subsets of instructions for ARM and Thumb which
> > are guaranteed to be atomic wrt concurrent modification and execution of
> > the instruction stream between different processors:
> >
> > Thumb:      The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> > ARM:        The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.
> 
> Thumb 32-bit ftrace call isn't in the above list.
> 
> Questions: does the above concurrent modification guarantee require
> both the old instruction _and_ the new one to be among those listed,
> or is it enough to be just the new one (for example when setting a
> normal software breakpoint, that would be useful)?  Can it be the old
> one and not the new (for example when removing a software breakpoint,
> that would be useful)?  Does that subset mean replacing any of the
> listed instructions by any of the others is ok, or any of the listed
> with another of the same type?

Yes, the target instruction also has to be listed. However,  I only described
the simple cases above... there are a number of exceptions when it comes to
32-bit Thumb-2 encodings of BL (I hinted at one of them before):

	- The two halfwords of a 32-bit BL instruction can each be replaced
	  with the relevant halfword from another BL instruction. This
	  basically means you can change the immediate fields, as I
	  mentioned earlier.

	- The most-significant halfword of a 32-bit BL instruction can be
	  replaced with B, BKPT or SVC (i.e. not NOP).

	- A 16-bit B, BKPT, or SVC instruction can be replaced with the
	  most-significant halfword of a BL instruction.

The latter points mean we can effectively nop out bl instructions, and put
them back again.

> This is what makes me wonder, if it's safe to replace the 32-bit
> mcount call with a 16-bit short jump:
> 
> > On Mon, Dec 10, 2012 at 11:04:05AM +0000, Jon Medhurst (Tixy) wrote:
> > > So this means for things like kprobes which can modify arbitrary kernel
> > > code we are going to need to continue to always use some form of
> > > stop_the_whole_system() function?
> > >
> > > Also, kprobes currently uses patch_text() which only uses stop_machine
> > > for Thumb2 instructions which straddle a word boundary, so this needs
> > > changing?
> 
> Will Deacon replied:
> > Yes; if you're modifying instructions other than those mentioned above, then
> > you'll need to synchronise the CPUs, update the instructions, perform
> > cache-maintenance on the writing CPU and then execute an isb on the
> > executing core (this last bit isn't needed if you're going to go through an
> > exception return to get back to the new code -- depends on how your
> > stop/resume code works).
> 
> If I've understood that exchange, it implies that using patch_text()
> to replace an instruction not in the list of special ones, with a trap
> or jump, isn't ok?  And so it's ok to replace the NOP with a short
> branch (since 16-bit "B" is in the list), but it's not ok to replace
> 16-bit "B" with the 32-bit ftrace call; and the same going the other way?

Well, these restrictions only apply if you want to avoid synchronising the
CPUs. Hopefully my explanation above answers your questions!

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 14:46                               ` Steven Rostedt
@ 2012-12-10 15:25                                 ` Russell King - ARM Linux
  2012-12-10 16:31                                   ` Steven Rostedt
  0 siblings, 1 reply; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-12-10 15:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Will Deacon, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	linux-arm-kernel

On Mon, Dec 10, 2012 at 09:46:41AM -0500, Steven Rostedt wrote:
> Again, you and I are having a disconnect. I'm not a HW expert. I'm
> trying to get a total understanding of what you, Will, Jon and others
> are trying to say.

Well, there's people who think that you're intentionally trying to wind
me up (I'm not alone in this opinion; believe me, I checked with someone
else taking part in this thread and they said as much...)

> > ... which, if it's misaligned to a 32-bit boundary, which can happen with
> > Thumb-2 code, will require the replacement to be done atomically; you will
> > need to use stop_machine() to ensure that other CPUs don't try to execute
> > the instruction mid-way through modification... as I have already
> > explained in my previous mails.
> 
> I'm confused to what is wrong to "misaligned to a 32-bit boundery".
> Isn't it best if it is on a 32-bit boundary? Or do you mean that it's
> misaligned across a 32-bit boundary? I guess I just read it wrong.

What I mean is a store of 32-bit size to an address which is not
numerically an integer multiple of four.

To see why this is a problem, take a moment to think about how you'd
update a misaligned 32-bit value on a 32-bit bus with byte enables.
You need to do it as two transactions.

If your bus is 64-bits wide, then the problem potentially becomes one
where there's an issue if it crosses a 64-bit boundary.  Continue for
larger bus widths...

Now add in the effect of caching with its cache line boundaries, and
what the effects are if a write crosses the cache line boundary (which
means it ends up with two separate validity bits etc.)

Lastly, remember that ARM CPUs have a Harvard cache architecture; that
means that the data paths are entirely separate from the instruction
paths - and in some cases that goes all the way to the memory controller,
but that's not relevant.  The relevant point here is that the point in
the pathways where the instruction and data paths unite can be quite
some distance _outside_ of the CPU.

What this all means is that a misaligned 32-bit store can ultimately
appear as two separate 16-bit stores, which may be interleaved by
other bus activity.  Whether that is visible to other CPUs in a SMP
system as two separate 16-bit stores or not isn't well defined.

x86 in this regard is beautiful; it's fully coherent with everything.
It enforces correctness for almost every situation.  It manages this
by using a hell of a lot of logic to do interlocking and ensure
correct ordering.  If you want that from an ARM CPU then you'd probably
need a comparible amount of logic - and power - to be able to do that.

> Either way, I said there's probably no guarantee that the 32-bit calls
> to mcount that gcc has inserted (or the tracepoints) are going to be
> aligned to 32-bit boundaries.

Correct; there is no guarantee of that what so ever when building for
Thumb-2.

> But I'm wondering if that's still a
> problem. Let's look at the ways another CPU could get the 32-bit
> instruction if it is misaligned, and across two different cache lines,
> or even two different pages:
> 
> 
> 1) the CPU gets the full 32bits as it was on the other CPU, or how it
> will be.
> 
> 2) The CPU gets the first 16bits as it was on the other CPU an the
> second 16bits with the update.
> 
> 3) The CPU gets the first 16bits with the update and the second 16bits
> as it use to be.
> 
> 
> The first case isn't interesting, so lets jump to the 2 and 3rd cases.
> 
> On an update of a 32bit nop to a 16bit breakpoint or branch (jump over
> second half).

Err.  Let me remind you what you said in the message which I replied to
earlier today:

   We are replacing a 32bit call with a nop. That nop must also         
                      ^^^^^
   be 32bits, because we could eventually replace the nop(s) with a 32bit
      ^^^^^^          
   call.

Maybe that's sloppy language, but I tend to read what's written and
interpret it as written... so to now say about 16-bit breakpoint or
branch instructions to me sounds like changing the point of discussion.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 15:25                                 ` Russell King - ARM Linux
@ 2012-12-10 16:31                                   ` Steven Rostedt
  0 siblings, 0 replies; 35+ messages in thread
From: Steven Rostedt @ 2012-12-10 16:31 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Will Deacon, Jon Medhurst (Tixy),
	Frederic Weisbecker, linux-kernel, Rabin Vincent, Ingo Molnar,
	linux-arm-kernel

On Mon, 2012-12-10 at 15:25 +0000, Russell King - ARM Linux wrote:
> On Mon, Dec 10, 2012 at 09:46:41AM -0500, Steven Rostedt wrote:
> > Again, you and I are having a disconnect. I'm not a HW expert. I'm
> > trying to get a total understanding of what you, Will, Jon and others
> > are trying to say.
> 
> Well, there's people who think that you're intentionally trying to wind
> me up (I'm not alone in this opinion; believe me, I checked with someone
> else taking part in this thread and they said as much...)

I'm sorry that you and others feel that way. What benefit would I get to
do such a thing? That would be counter productive on my part, and
honestly, that's the last thing I want to do. I think it may just be
that my general personality and my way of writing may be of issue. Ask
others that deal with me. I'm not saying anything differently to you
than I do to anyone else. But that's just me. I'm really not taking any
of this personal nor am I trying to piss you or anyone else off. I'm
just trying to talk technical here.


> 
> > > ... which, if it's misaligned to a 32-bit boundary, which can happen with
> > > Thumb-2 code, will require the replacement to be done atomically; you will
> > > need to use stop_machine() to ensure that other CPUs don't try to execute
> > > the instruction mid-way through modification... as I have already
> > > explained in my previous mails.
> > 
> > I'm confused to what is wrong to "misaligned to a 32-bit boundery".
> > Isn't it best if it is on a 32-bit boundary? Or do you mean that it's
> > misaligned across a 32-bit boundary? I guess I just read it wrong.
> 
> What I mean is a store of 32-bit size to an address which is not
> numerically an integer multiple of four.
> 
> To see why this is a problem, take a moment to think about how you'd
> update a misaligned 32-bit value on a 32-bit bus with byte enables.
> You need to do it as two transactions.
> 
> If your bus is 64-bits wide, then the problem potentially becomes one
> where there's an issue if it crosses a 64-bit boundary.  Continue for
> larger bus widths...
> 
> Now add in the effect of caching with its cache line boundaries, and
> what the effects are if a write crosses the cache line boundary (which
> means it ends up with two separate validity bits etc.)
> 
> Lastly, remember that ARM CPUs have a Harvard cache architecture; that
> means that the data paths are entirely separate from the instruction
> paths - and in some cases that goes all the way to the memory controller,
> but that's not relevant.  The relevant point here is that the point in
> the pathways where the instruction and data paths unite can be quite
> some distance _outside_ of the CPU.
> 
> What this all means is that a misaligned 32-bit store can ultimately
> appear as two separate 16-bit stores, which may be interleaved by
> other bus activity.  Whether that is visible to other CPUs in a SMP
> system as two separate 16-bit stores or not isn't well defined.
> 
> x86 in this regard is beautiful; it's fully coherent with everything.
> It enforces correctness for almost every situation.  It manages this
> by using a hell of a lot of logic to do interlocking and ensure
> correct ordering.  If you want that from an ARM CPU then you'd probably
> need a comparible amount of logic - and power - to be able to do that.

I'm going to be bluntly honest here, and say that I do not fully
understand all the intrinsic operations that you've explained above.
This may be part of the frustration that you have with me. Not that I'm
trying to piss you off, but the fact that I'm not as well versed of all
the details that goes on inside an ARM processor. Yes, I've been spoiled
with x86, but I'm trying hard to understand the differences with ARM. 

I may be asking stupid questions that are very obvious to you, but to
me, this is a new world. Please have some patience with me.

> 
> > Either way, I said there's probably no guarantee that the 32-bit calls
> > to mcount that gcc has inserted (or the tracepoints) are going to be
> > aligned to 32-bit boundaries.
> 
> Correct; there is no guarantee of that what so ever when building for
> Thumb-2.
> 
> > But I'm wondering if that's still a
> > problem. Let's look at the ways another CPU could get the 32-bit
> > instruction if it is misaligned, and across two different cache lines,
> > or even two different pages:
> > 
> > 
> > 1) the CPU gets the full 32bits as it was on the other CPU, or how it
> > will be.
> > 
> > 2) The CPU gets the first 16bits as it was on the other CPU an the
> > second 16bits with the update.
> > 
> > 3) The CPU gets the first 16bits with the update and the second 16bits
> > as it use to be.
> > 
> > 
> > The first case isn't interesting, so lets jump to the 2 and 3rd cases.
> > 
> > On an update of a 32bit nop to a 16bit breakpoint or branch (jump over
> > second half).
> 
> Err.  Let me remind you what you said in the message which I replied to
> earlier today:
> 
>    We are replacing a 32bit call with a nop. That nop must also         
>                       ^^^^^
>    be 32bits, because we could eventually replace the nop(s) with a 32bit
>       ^^^^^^          
>    call.
> 
> Maybe that's sloppy language, but I tend to read what's written and
> interpret it as written... so to now say about 16-bit breakpoint or
> branch instructions to me sounds like changing the point of discussion.


The grand view is to change a 32bit nop to a 32bit branch and link, and
vice-versa. But because the 32bit operation may cross cache-lines or
pages, there's no safe way to to do that directly (I'm assuming from
everything that I've been told). Thus, the idea is to break the
conversion up into steps where we only change half of the instruction.
Changing the first half to be either a breakpoint or a branch that skips
the second half. Considering that no matter how the processor sees the
result, it wont be an issue. This is where there's a lot of assumptions
that I'm trying to understand, and where you may be frustrated with me.

If we can change the first half of the instruction with a 16bit
operation that skips the second half, then it may be possible to change
the entire 32bit op with another 32bit op in a series of steps, as I
explained several times.

-- Steve



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
  2012-12-10 10:04                     ` Will Deacon
  2012-12-10 13:02                       ` Steven Rostedt
@ 2012-12-10 16:45                       ` Jon Medhurst (Tixy)
  1 sibling, 0 replies; 35+ messages in thread
From: Jon Medhurst (Tixy) @ 2012-12-10 16:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Russell King - ARM Linux, Frederic Weisbecker, linux-kernel,
	Steven Rostedt, Rabin Vincent, Ingo Molnar, linux-arm-kernel

On Mon, 2012-12-10 at 10:04 +0000, Will Deacon wrote:
> Hi Jon,
> 
> Back-pedalling a bit here, but I'm confused by one of your points below:
> 
> On Fri, Dec 07, 2012 at 05:45:47PM +0000, Jon Medhurst (Tixy) wrote:
> > On Fri, 2012-12-07 at 12:13 -0500, Steven Rostedt wrote:
> > > I'll make my question more general:
> > > 
> > > If I have a nop, that is a size of a call (branch and link), which is
> > > near the beginning of a function and not part of any conditional, and I
> > > want to convert it into a call (branch and link), would adding a
> > > breakpoint to it, modifying it to the call, and then removing the
> > > breakpoint be possible? Of course it would require syncing in between
> > > steps, but my question is, if the above is possible on a thumb2 ARM
> > > processor?
> > 
> > I believe so. The details are (repeating your earlier explanation) ...
> > 
> > 1. Replace first half of nop with 16bit 'breakpoint' instruction.
> 
> Sort of -- you'd actually need 2x16-bit nops to make this work.
> 
> > 2. Sync.(cache flush to PoU + IPIs to make other cores invalidate the
> > icache for changed part of the nop instruction).
> 
> Why do you need to use IPIs for I-cache invalidation on other cores? For
> ARMv7 SMP (i.e. the multi-processing extensions) doing I-cache invalidation
> by MVA to PoU will be broadcast to the applicable domain for the
> shareability attributes of the address. So if you do icimvau with an
> inner-shareable virtual address, it will be broadcast by the hardware.

That was a clue I was missing, and it means that my patch which spawned
this thread is flawed. The original problem I was trying to cure was
random crashes whilst ftrace_modify_all_code() was going round modifying
kernel functions, and I fixed this by getting all cores to
__flush_icache_all() after the modifications had been made. But if cache
flushes are broadcast across all cores then my reasoning for the fix is
wrong.

As this only seems to surface on TC2 perhaps CCI doesn't do the magic we
want, or we have it misconfigured, or were been hit by cache differences
between A7 and A15? (I've seen comments somewhere which says A7 has VIPT
aliasing I-cache, and A15 is PIPT non-aliasing).

Will need to some more detailed investigation when I get time.

-- 
Tixy


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2012-12-10 16:45 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-06 18:11 [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus Jon Medhurst (Tixy)
2012-12-06 19:19 ` Steven Rostedt
2012-12-07  9:22   ` Jon Medhurst (Tixy)
2012-12-07 14:03     ` Steven Rostedt
2012-12-07 14:55       ` Jon Medhurst (Tixy)
2012-12-07 15:28         ` Steven Rostedt
2012-12-07 15:40           ` Jon Medhurst (Tixy)
2012-12-07 16:09             ` Steven Rostedt
2012-12-07 16:23           ` Russell King - ARM Linux
2012-12-07 16:36             ` Steven Rostedt
2012-12-07 16:45               ` Russell King - ARM Linux
2012-12-07 17:13                 ` Steven Rostedt
2012-12-07 17:45                   ` Jon Medhurst (Tixy)
2012-12-07 18:06                     ` Steven Rostedt
2012-12-07 18:17                       ` Steven Rostedt
2012-12-07 18:18                       ` Jon Medhurst (Tixy)
2012-12-10 10:04                     ` Will Deacon
2012-12-10 13:02                       ` Steven Rostedt
2012-12-10 13:33                         ` Will Deacon
2012-12-10 13:40                         ` Jamie Lokier
2012-12-10 14:56                           ` Will Deacon
2012-12-10 13:57                         ` Russell King - ARM Linux
2012-12-10 14:06                           ` Steven Rostedt
2012-12-10 14:07                             ` Russell King - ARM Linux
2012-12-10 14:46                               ` Steven Rostedt
2012-12-10 15:25                                 ` Russell King - ARM Linux
2012-12-10 16:31                                   ` Steven Rostedt
2012-12-10 16:45                       ` Jon Medhurst (Tixy)
2012-12-07 18:13                   ` Russell King - ARM Linux
2012-12-07 18:43                     ` Steven Rostedt
2012-12-07 19:02                       ` Will Deacon
2012-12-07 20:01                         ` Steven Rostedt
2012-12-10 11:04                         ` Jon Medhurst (Tixy)
2012-12-10 11:24                           ` Will Deacon
2012-12-10 14:02                             ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).