linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC, patch] i386: vgetcpu(), take 2
@ 2006-06-21 12:24 Chuck Ebbert
  2006-06-21 17:14 ` Andi Kleen
  0 siblings, 1 reply; 26+ messages in thread
From: Chuck Ebbert @ 2006-06-21 12:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jakub Jelinek, Roland McGrath, Ulrich Drepper, Andi Kleen,
	Linus Torvalds, linux-kernel

In-Reply-To: <20060621081539.GA14227@elte.hu>

On Wed, 21 Jun 2006 10:15:39 +0200, Ingo Molnar wrote:

> * Chuck Ebbert <76306.1226@compuserve.com> wrote:
> 
> > Use a GDT entry's limit field to store per-cpu data for fast access 
> > from userspace, and provide a vsyscall to access the current CPU 
> > number stored there.
> 
> very nice idea! I thought of doing sys_get_cpu() too, but my idea was to 
> use the scheduler to keep a writable [and permanently pinned, 
> per-thread] VDSO data page uptodate with the current CPU# [and other 
> interesting data]. Btw., do we know how fast LSL is on modern CPUs?

Now that the GDT is a full page for each CPU there's plenty of space
for all kinds of per-cpu data, even if we waste 75% of it.  LSL seems
pretty fast; I got 13 clocks for the whole lsl/jnz/and sequence on K8
and 21 clocks on PII.  Myabe you can test P4?

/* test how fast lsl/jnz/and runs.
 */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

#define rdtscll(t)	asm volatile ("rdtsc" : "=A" (t))

#ifndef ITERS
#define ITERS	1000000
#endif

int main(int argc, char * const argv[])
{
	unsigned long long tsc1, tsc2;
	int count, cpu, junk;

	rdtscll(tsc1);
	asm (
		"	pushl %%ds		\n"
		"	popl %2			\n"
		"1:				\n"
#ifdef DO_TEST
		"	lsl %2,%0		\n"
		"	jnz 2f			\n"
		"	and $0xff,%0		\n"
#endif
		"	dec %1			\n"
		"	jnz 1b			\n"
		"2:				\n"
		: "=&r" (cpu), "=&r" (count), "=&r" (junk)
		: "1" (ITERS), "0" (-1)
	);
	rdtscll(tsc2);

	if (count == 0)
		printf("loops: %d, avg: %llu clocks\n",
			ITERS, (tsc2 - tsc1) / ITERS);
	return 0;
}


> > +__vgetcpu:
> > +.LSTART_vgetcpu:
> > +   movl $-EFAULT,%eax
> > +   movl $((27<<3)|3),%edx
> > +   lsll %edx,%eax
> > +   jnz 1f
> > +   andl $0xff,%eax
> > +1:
> > +   ret
> 
> this needs unwinder annotations as well to make this a proper DSO, so 
> that for example a breakpoint here does not confuse gdb.

I can't write those.

> also, would be nice to do something like this in 64-bit mode too.

Andi has x86_64 patches in his tree and is considering this method for
ia32 support.

-- 
Chuck
 "You can't read a newspaper if you can't read."  --George W. Bush

^ permalink raw reply	[flat|nested] 26+ messages in thread
* Re: [RFC, patch] i386: vgetcpu(), take 2
@ 2006-06-22 12:23 Chuck Ebbert
  2006-06-22 12:44 ` Andi Kleen
  0 siblings, 1 reply; 26+ messages in thread
From: Chuck Ebbert @ 2006-06-22 12:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Linus Torvalds, Ulrich Drepper, Roland McGrath,
	Jakub Jelinek, Ingo Molnar

In-Reply-To: <200606211914.37137.ak@suse.de>

On Wed, 21 Jun 2006 19:14:37 +0200, Andi Kleen wrote:

>> 
>> /* test how fast lsl/jnz/and runs.
>>  */
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> 
>> #define rdtscll(t)   asm volatile ("rdtsc" : "=A" (t))
>> 
>> #ifndef ITERS
>> #define ITERS        1000000
>> #endif
>> 
>> int main(int argc, char * const argv[])
>> {
>>      unsigned long long tsc1, tsc2;
>>      int count, cpu, junk;
>> 
>>      rdtscll(tsc1);
>>      asm (
>>              "       pushl %%ds              \n"
>>              "       popl %2                 \n"
>>              "1:                             \n"
>> #ifdef DO_TEST
>>              "       lsl %2,%0               \n"
>>              "       jnz 2f                  \n"
>>              "       and $0xff,%0            \n"
>> #endif
>>              "       dec %1                  \n"
>>              "       jnz 1b                  \n"
>>              "2:                             \n"
>>              : "=&r" (cpu), "=&r" (count), "=&r" (junk)
>>              : "1" (ITERS), "0" (-1)
>>      );
>>      rdtscll(tsc2);
>
> Measuring this way is a bad idea because you get far too much 
> noise from the RDTSCs. Usually you need to put a a few thousands entry 
> loop inside the RDTSCP and devide the result by the loop count

I got tired of people (namely me) forgetting to compile the C code
with optimization, so I did the loop in assembler.  It does 1000000
iterations by default.  Later I added the DO_TEST that lets you test
the empty loop just because I was curious.

A more realistic test with the two 'mov' instructions inside the loops
still only takes 16 clocks, so I'm wondering why you get 60?  Does the
vsyscall add that much overhead?  With this I get 29-30 clocks per loop
on Pentium II:


/* vgetcpu.c: test how fast vgetcpu runs
 * boot kernel with vgetcpu patch first, then build this:
 *  gcc -O3 -o vgetcpu vgetcpu.c <srcpath>/arch/i386/kernel/vsyscall-int80.so
 * (don't forget the optimization (-O3))
 */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

extern int __vgetcpu(void);

#define rdtscll(t)      asm("rdtsc" : "=A" (t))

int main(int argc, char * const argv[])
{
        long long tsc1, tsc2;
        int i, iters = 999999;

        rdtscll(tsc1);
        for (i = 0; i < iters; i++)
                __vgetcpu();
        rdtscll(tsc2);

        printf("loops: %d, avg: %llu\n", iters, (tsc2 - tsc1) / iters);

        return 0;
}
-- 
Chuck
 "You can't read a newspaper if you can't read."  --George W. Bush

^ permalink raw reply	[flat|nested] 26+ messages in thread
* [RFC, patch] i386: vgetcpu(), take 2
@ 2006-06-21  7:27 Chuck Ebbert
  2006-06-21  8:15 ` Ingo Molnar
  2006-06-21  9:26 ` Andi Kleen
  0 siblings, 2 replies; 26+ messages in thread
From: Chuck Ebbert @ 2006-06-21  7:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Ingo Molnar, Andi Kleen

Use a GDT entry's limit field to store per-cpu data for fast access
from userspace, and provide a vsyscall to access the current CPU
number stored there.

Questions:
 1. Will the vdso relocation patch break this?
 2. Should the version number of the vsyscall .so be incremented?

Test program using the new call:

/* vgetcpu.c: get CPU number we are running on.
 * build kernel with vgetcpu patch first, then:
 *  gcc -o vgetcpu vgetcpu.c <srcpath>/arch/i386/kernel/vsyscall-sysenter.so
 */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

extern int __vgetcpu(void);

int main(int argc, char * const argv[])
{
	printf("cpu: %u\n", __vgetcpu());

	return 0;
}

---
 arch/i386/kernel/cpu/common.c        |    3 +++
 arch/i386/kernel/head.S              |    8 +++++++-
 arch/i386/kernel/vsyscall-getcpu.S   |   25 +++++++++++++++++++++++++
 arch/i386/kernel/vsyscall-int80.S    |    2 ++
 arch/i386/kernel/vsyscall-sysenter.S |    2 ++
 arch/i386/kernel/vsyscall.lds.S      |    1 +
 6 files changed, 40 insertions(+), 1 deletion(-)

--- 2.6.17-32.orig/arch/i386/kernel/cpu/common.c
+++ 2.6.17-32/arch/i386/kernel/cpu/common.c
@@ -642,6 +642,9 @@ void __cpuinit cpu_init(void)
 		((((__u64)stk16_off) << 32) & 0xff00000000000000ULL) |
 		(CPU_16BIT_STACK_SIZE - 1);
 
+	/* Set up GDT entry for per-cpu data */
+ 	*(__u64 *)(&gdt[27]) |= cpu;
+
 	cpu_gdt_descr->size = GDT_SIZE - 1;
  	cpu_gdt_descr->address = (unsigned long)gdt;
 
--- 2.6.17-32.orig/arch/i386/kernel/head.S
+++ 2.6.17-32/arch/i386/kernel/head.S
@@ -525,7 +525,13 @@ ENTRY(cpu_gdt_table)
 	.quad 0x004092000000ffff	/* 0xc8 APM DS    data */
 
 	.quad 0x0000920000000000	/* 0xd0 - ESPFIX 16-bit SS */
-	.quad 0x0000000000000000	/* 0xd8 - unused */
+
+	/*
+	 * Use a GDT entry to store per-cpu data for user space (DPL 3.)
+	 * 32-bit data segment, byte granularity, base 0, limit set at runtime.
+	 */
+	.quad 0x0040f20000000000	/* 0xd8 - for per-cpu user data */
+
 	.quad 0x0000000000000000	/* 0xe0 - unused */
 	.quad 0x0000000000000000	/* 0xe8 - unused */
 	.quad 0x0000000000000000	/* 0xf0 - unused */
--- /dev/null
+++ 2.6.17-32/arch/i386/kernel/vsyscall-getcpu.S
@@ -0,0 +1,25 @@
+/*
+ * vgetcpu
+ * This file is #include'd by vsyscall-*.S to define them after the
+ * vsyscall entry point.  The kernel assumes that the addresses of these
+ * routines are constant for all vsyscall implementations.
+ */
+
+#include <linux/errno.h>
+
+	.text
+	.org __kernel_rt_sigreturn+32,0x90
+	.globl __vgetcpu
+	.type __vgetcpu,@function
+__vgetcpu:
+.LSTART_vgetcpu:
+	movl $-EFAULT,%eax
+	movl $((27<<3)|3),%edx
+	lsll %edx,%eax
+	jnz 1f
+	andl $0xff,%eax
+1:
+	ret
+.LEND_vgetcpu:
+	.size __vgetcpu,.-.LSTART_vgetcpu
+
--- 2.6.17-32.orig/arch/i386/kernel/vsyscall-int80.S
+++ 2.6.17-32/arch/i386/kernel/vsyscall-int80.S
@@ -51,3 +51,5 @@ __kernel_vsyscall:
  * Get the common code for the sigreturn entry points.
  */
 #include "vsyscall-sigreturn.S"
+
+#include "vsyscall-getcpu.S"
--- 2.6.17-32.orig/arch/i386/kernel/vsyscall-sysenter.S
+++ 2.6.17-32/arch/i386/kernel/vsyscall-sysenter.S
@@ -120,3 +120,5 @@ SYSENTER_RETURN:
  * Get the common code for the sigreturn entry points.
  */
 #include "vsyscall-sigreturn.S"
+
+#include "vsyscall-getcpu.S"
--- 2.6.17-32.orig/arch/i386/kernel/vsyscall.lds.S
+++ 2.6.17-32/arch/i386/kernel/vsyscall.lds.S
@@ -57,6 +57,7 @@ VERSION
     	__kernel_vsyscall;
     	__kernel_sigreturn;
     	__kernel_rt_sigreturn;
+	__vgetcpu;
 
     local: *;
   };
-- 
Chuck
 "You can't read a newspaper if you can't read."  --George W. Bush

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2006-06-29  8:48 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-06-21 12:24 [RFC, patch] i386: vgetcpu(), take 2 Chuck Ebbert
2006-06-21 17:14 ` Andi Kleen
2006-06-21 17:27   ` Linus Torvalds
2006-06-21 17:50     ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2006-06-22 12:23 Chuck Ebbert
2006-06-22 12:44 ` Andi Kleen
2006-06-21  7:27 Chuck Ebbert
2006-06-21  8:15 ` Ingo Molnar
2006-06-21 17:38   ` Artur Skawina
2006-06-28  5:44   ` Paul Jackson
2006-06-28  8:53     ` Andi Kleen
2006-06-28  9:00       ` Ingo Molnar
2006-06-29  8:47         ` Paul Jackson
2006-06-21  9:26 ` Andi Kleen
2006-06-21  9:35   ` Ingo Molnar
2006-06-21 21:54   ` Rohit Seth
2006-06-21 22:21     ` Andi Kleen
2006-06-21 22:59       ` Rohit Seth
2006-06-21 23:05         ` Andi Kleen
2006-06-21 23:18           ` Rohit Seth
2006-06-21 23:29             ` Andi Kleen
2006-06-22  0:55               ` Rohit Seth
2006-06-22  8:08                 ` Andi Kleen
2006-06-22 21:06                   ` Rohit Seth
2006-06-22 22:14                     ` Andi Kleen
2006-06-22 23:10                       ` Rohit Seth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).