All of lore.kernel.org
 help / color / mirror / Atom feed
* ARM926EJ-S TLB lockdown
@ 2010-09-01 15:19 Johannes Stezenbach
  2010-09-01 20:01 ` Linus Walleij
  0 siblings, 1 reply; 3+ messages in thread
From: Johannes Stezenbach @ 2010-09-01 15:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

this is just a FYI in case someone is interested, but comments
are of course welcome.

ARM926EJ-S has two TLBs, one is 64-entry 2-way set associative,
the other is 8-entry fully associative for lockdown TLB entries.
The lockdown TLB is currently unused in Linux.  I thought maybe
I could get a performance win so I added the following to
the MACHINE_START's .map_io function of my platform:

#define tlb_lockdown(addr) \
	__asm__ volatile ( \
		"  ldr r1, =" #addr "		@ virtual address\n" \
		"  mrc p15,0,r0,c10,c0,0	@ read lockdown register\n" \
		"  orr r0,r0,#1			@ set preserve bit\n" \
		"  mcr p15,0,r0,c10,c0,0	@ write lockdown register\n" \
		"  mcr p15,0,r1,c8,c7,1		@ invalidate TLB single entry\n" \
		"  ldr r1,[r1]			@ cause TLB miss to load TLB entry\n" \
		"  mrc p15,0,r0,c10,c0,0	@ read lockdown register\n" \
		"  bic r0,r0,#1			@ clear preserve bit\n" \
		"  mcr p15,0,r0,c10,c0,0	@ write lockdown register\n" \
		: : : "r0", "r1")

		tlb_lockdown(0xffff0000);	// exception vectors
		tlb_lockdown(0xc0000000);	// kernel code / data
		tlb_lockdown(0xc0100000);	// kernel code / data
		tlb_lockdown(0xc0200000);	// kernel code / data
		tlb_lockdown(0xc0300000);	// kernel code / data
		tlb_lockdown(0xc0400000);	// kernel code / data
		tlb_lockdown(0xc0500000);	// kernel code / data
		tlb_lockdown(0xc0600000);	// kernel code / data
#undef tlb_lockdown

I used a JTAG debugger to dump the TLB to confirm the lockdown entries
are correct and stay in the TLB during run time.

Then I compared lmbench results (with init=/bin/sh):

Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host                 OS  Mhz null null      open slct sig  sig  fork exec sh  
                             call  I/O stat clos TCP  inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
plain     Linux 2.6.32.  330 1.15 2.72 14.9 21.5 89.7 5.33 12.5 2497 9497 15.K
tlb       Linux 2.6.32.  330 1.11 1.96 14.8 21.1 89.3 3.90 12.4 2461 9392 15.K

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
plain     Linux 2.6.32.  139.2  221.6  144.0  237.4  161.3   241.0   162.8
tlb       Linux 2.6.32.  134.3  216.0  139.6  228.2  158.4   234.1   158.6

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                        Create Delete Create Delete Latency Fault  Fault  selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
plain     Linux 2.6.32.   56.0   30.0  262.1   69.6  2764.0 2.817    21.9  43.4
tlb       Linux 2.6.32.   53.7   28.9  266.8   65.7  2806.0 2.500    21.9  44.3

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
plain     Linux 2.6.32. 33.6 36.3 30.6   44.3  115.1   95.5   83.9 113. 212.2
tlb       Linux 2.6.32. 34.0 34.6 30.9   45.7  117.9   95.5   83.9 115. 212.3


It seems syscall-heavy micro benchmarks like "null I/O" benefit, but most
of the result changes are within the measurement noise.
I also ran iperf TCP benchmark and got no improvement.

BTW, I updated elinux.org Wiki page about lmbench.
http://elinux.org/Benchmark_Programs


Cheers,
Johannes

^ permalink raw reply	[flat|nested] 3+ messages in thread

* ARM926EJ-S TLB lockdown
  2010-09-01 15:19 ARM926EJ-S TLB lockdown Johannes Stezenbach
@ 2010-09-01 20:01 ` Linus Walleij
  2010-09-02 14:30   ` Johannes Stezenbach
  0 siblings, 1 reply; 3+ messages in thread
From: Linus Walleij @ 2010-09-01 20:01 UTC (permalink / raw)
  To: linux-arm-kernel

Interesting stuff Johannes!

2010/9/1 Johannes Stezenbach <js@sig21.net>:

> ? ? ? ? ? ? ? ?tlb_lockdown(0xffff0000); ? ? ? // exception vectors

This is probably clever to put in the lockdown TLB

> ? ? ? ? ? ? ? ?tlb_lockdown(0xc0000000); ? ? ? // kernel code / data
> ? ? ? ? ? ? ? ?tlb_lockdown(0xc0100000); ? ? ? // kernel code / data
> ? ? ? ? ? ? ? ?tlb_lockdown(0xc0200000); ? ? ? // kernel code / data
> ? ? ? ? ? ? ? ?tlb_lockdown(0xc0300000); ? ? ? // kernel code / data
> ? ? ? ? ? ? ? ?tlb_lockdown(0xc0400000); ? ? ? // kernel code / data
> ? ? ? ? ? ? ? ?tlb_lockdown(0xc0500000); ? ? ? // kernel code / data
> ? ? ? ? ? ? ? ?tlb_lockdown(0xc0600000); ? ? ? // kernel code / data

But are these really most relevant to lock down?

Since you have a JTAG debugger, can't you profile what
memory pages are actually accessed most often and lock down
these?

But it can be even more elaborate. Profile out the *functions*
most used.

When I've worked with TCM I played with the idea to be able to
tag functions like this:

#define __hotfunc __attribute__((long_call)) __section(.hot.text) noinline
(...)
int __hotfunc foo();

Then have the linker put the hotfuncs into separate pages and
link that.

You can use the same scheme for locked-down TLB:s I believe?
Up to 8 pages of code tagged "hotfunc" will be diverted to these
pages and locked down.

See the stuff in arch/arm/include/asm/tcm.h for the compiler
directives and check the link script in
arch/arm/kernel/vmlinux.lds.S to see how I'm separating the
TCM stuff to separate pages.

Just my ?0.01...

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 3+ messages in thread

* ARM926EJ-S TLB lockdown
  2010-09-01 20:01 ` Linus Walleij
@ 2010-09-02 14:30   ` Johannes Stezenbach
  0 siblings, 0 replies; 3+ messages in thread
From: Johannes Stezenbach @ 2010-09-02 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 01, 2010 at 10:01:14PM +0200, Linus Walleij wrote:
> 2010/9/1 Johannes Stezenbach <js@sig21.net>:
> 
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xffff0000); ? ? ? // exception vectors
> 
> This is probably clever to put in the lockdown TLB
> 
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xc0000000); ? ? ? // kernel code / data
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xc0100000); ? ? ? // kernel code / data
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xc0200000); ? ? ? // kernel code / data
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xc0300000); ? ? ? // kernel code / data
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xc0400000); ? ? ? // kernel code / data
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xc0500000); ? ? ? // kernel code / data
> > ? ? ? ? ? ? ? ?tlb_lockdown(0xc0600000); ? ? ? // kernel code / data
> 
> But are these really most relevant to lock down?

Well, these were just easy to do.  The kernel uses 1MB sections
(not 4K pages) for these mappings, thus it is possible to cover
all kernel code + data + a bit more with just 7 TLB entries.
Kernel modules and everything else use 4K pages so it
becomes difficult to decide what to lock down.

But OTOH the kernel's use of 1MB sections also means there is
not much TLB pressure which I guess explains the small gain.

> Since you have a JTAG debugger, can't you profile what
> memory pages are actually accessed most often and lock down
> these?
> 
> But it can be even more elaborate. Profile out the *functions*
> most used.
> 
> When I've worked with TCM I played with the idea to be able to
> tag functions like this:
> 
> #define __hotfunc __attribute__((long_call)) __section(.hot.text) noinline
> (...)
> int __hotfunc foo();
> 
> Then have the linker put the hotfuncs into separate pages and
> link that.
> 
> You can use the same scheme for locked-down TLB:s I believe?
> Up to 8 pages of code tagged "hotfunc" will be diverted to these
> pages and locked down.
> 
> See the stuff in arch/arm/include/asm/tcm.h for the compiler
> directives and check the link script in
> arch/arm/kernel/vmlinux.lds.S to see how I'm separating the
> TCM stuff to separate pages.

That's much more work than I'm willing to invest right now ;-/

I guess the intended use for the lockdown TLB is to minimize
latency for realtime code, e.g. if I had hooked up a FIQ handler
I would lock down it's code, stack and data pages.


Thanks
Johannes

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-09-02 14:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-01 15:19 ARM926EJ-S TLB lockdown Johannes Stezenbach
2010-09-01 20:01 ` Linus Walleij
2010-09-02 14:30   ` Johannes Stezenbach

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.