linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/2] sLeAZY FPU feature
@ 2006-07-01 17:11 Arjan van de Ven
  2006-07-01 17:12 ` [patch 1/2] sLeAZY FPU feature - x86_64 support Arjan van de Ven
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Arjan van de Ven @ 2006-07-01 17:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, ak

Hi,

the two patches in this series (the x86-64 on by me, the i386 one by
Chuck Ebbert) change how the lazy fpu feature works. In the current
situation, we are 100% lazy, meaning that after every context switch,
the application takes a trap on the first FPU use, which then restores
the FPU context.

The sLeAZY FPU patch changes this behavior; if a process has used the
FPU for 5 stints at a row, the behavior becomes proactive and the FPU
context is restored during the regular context switch already. This
means we can avoid the trap.

The underlying assumption is that if a process uses 5 times consecutive,
it's likely to do it the 6th and later times as well (eg it's not a
one-off behavior).

There is a limit built in; this proactive behavior resets after 255
times, so that when a process is long lived and chances behavior, it'll
still get the right behavior (for performance) after some time.

Chuck measured a +/- 0.4% performance gain, and my experiments show a
similar improvement.

Greetings,
   Arjan van de Ven


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch 1/2] sLeAZY FPU feature - x86_64 support
  2006-07-01 17:11 [patch 0/2] sLeAZY FPU feature Arjan van de Ven
@ 2006-07-01 17:12 ` Arjan van de Ven
  2006-07-01 21:49   ` Andi Kleen
  2006-07-01 17:13 ` [patch 2/2] sLeAZY FPU feature - i386 support Arjan van de Ven
  2006-07-01 17:40 ` [patch 0/2] sLeAZY FPU feature Nick Piggin
  2 siblings, 1 reply; 7+ messages in thread
From: Arjan van de Ven @ 2006-07-01 17:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: ak, akpm

From: Arjan van de Ven <arjan@linux.intel.com>

right now the kernel on x86-64 has a 100% lazy fpu behavior: after
*every* context switch a trap is taken for the first FPU use to restore
the FPU context lazily. This is of course great for applications that
have very sporadic or no FPU use (since then you avoid doing the
expensive save/restore all the time). However for very frequent FPU
users... you take an extra trap every context switch.

The patch below adds a simple heuristic to this code: After 5
consecutive context switches of FPU use, the lazy behavior is disabled
and the context gets restored every context switch. If the app indeed
uses the FPU, the trap is avoided. (the chance of the 6th time slice
using FPU after the previous 5 having done so are quite high obviously).

After 256 switches, this is reset and lazy behavior is returned (until
there are 5 consecutive ones again). The reason for this is to give apps
that do longer bursts of FPU use still the lazy behavior back after some
time.


Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

---
 arch/x86_64/kernel/process.c |   10 ++++++++++
 arch/x86_64/kernel/traps.c   |    1 +
 include/asm-x86_64/i387.h    |    5 ++++-
 include/linux/sched.h        |    9 +++++++++
 4 files changed, 24 insertions(+), 1 deletion(-)

Index: linux-2.6.17-sleazyfpu/arch/x86_64/kernel/process.c
===================================================================
--- linux-2.6.17-sleazyfpu.orig/arch/x86_64/kernel/process.c
+++ linux-2.6.17-sleazyfpu/arch/x86_64/kernel/process.c
@@ -515,6 +515,10 @@ __switch_to(struct task_struct *prev_p, 
 	int cpu = smp_processor_id();  
 	struct tss_struct *tss = &per_cpu(init_tss, cpu);
 
+	/* we're going to use this soon, after a few expensive things */
+	if (next_p->fpu_counter>5)
+		prefetch(&next->i387.fxsave);
+
 	/*
 	 * Reload esp0, LDT and the page table pointer:
 	 */
@@ -618,6 +622,12 @@ __switch_to(struct task_struct *prev_p, 
 		}
 	}
 
+	/* If the task has used fpu the last 5 timeslices, just do a full
+	 * restore of the math state immediately to avoid the trap; the
+	 * chances of needing FPU soon are obviously high now
+	 */
+	if (next_p->fpu_counter>5)
+		math_state_restore();
 	return prev_p;
 }
 
Index: linux-2.6.17-sleazyfpu/arch/x86_64/kernel/traps.c
===================================================================
--- linux-2.6.17-sleazyfpu.orig/arch/x86_64/kernel/traps.c
+++ linux-2.6.17-sleazyfpu/arch/x86_64/kernel/traps.c
@@ -1061,6 +1061,7 @@ asmlinkage void math_state_restore(void)
 		init_fpu(me);
 	restore_fpu_checking(&me->thread.i387.fxsave);
 	task_thread_info(me)->status |= TS_USEDFPU;
+	me->fpu_counter++;
 }
 
 void __init trap_init(void)
Index: linux-2.6.17-sleazyfpu/include/asm-x86_64/i387.h
===================================================================
--- linux-2.6.17-sleazyfpu.orig/include/asm-x86_64/i387.h
+++ linux-2.6.17-sleazyfpu/include/asm-x86_64/i387.h
@@ -24,6 +24,7 @@ extern unsigned int mxcsr_feature_mask;
 extern void mxcsr_feature_mask_init(void);
 extern void init_fpu(struct task_struct *child);
 extern int save_i387(struct _fpstate __user *buf);
+extern asmlinkage void math_state_restore(void);
 
 /*
  * FPU lazy state save handling...
@@ -31,7 +32,9 @@ extern int save_i387(struct _fpstate __u
 
 #define unlazy_fpu(tsk) do { \
 	if (task_thread_info(tsk)->status & TS_USEDFPU) \
-		save_init_fpu(tsk); \
+		save_init_fpu(tsk); 			\
+	else						\
+		tsk->fpu_counter = 0;			\
 } while (0)
 
 /* Ignore delayed exceptions from user space */
Index: linux-2.6.17-sleazyfpu/include/linux/sched.h
===================================================================
--- linux-2.6.17-sleazyfpu.orig/include/linux/sched.h
+++ linux-2.6.17-sleazyfpu/include/linux/sched.h
@@ -1023,6 +1023,15 @@ struct task_struct {
 	spinlock_t delays_lock;
 	struct task_delay_info *delays;
 #endif
+	/*
+	 * fpu_counter contains the number of consecutive context switches
+	 * that the FPU is used. If this is over a threshold, the lazy fpu
+	 * saving becomes unlazy to save the trap. This is an unsigned char
+	 * so that after 256 times the counter wraps and the behavior turns
+	 * lazy again; this to deal with bursty apps that only use FPU for
+	 * a short time
+	 */
+	unsigned char fpu_counter;
 };
 
 static inline pid_t process_group(struct task_struct *tsk)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [patch 2/2] sLeAZY FPU feature - i386 support
  2006-07-01 17:11 [patch 0/2] sLeAZY FPU feature Arjan van de Ven
  2006-07-01 17:12 ` [patch 1/2] sLeAZY FPU feature - x86_64 support Arjan van de Ven
@ 2006-07-01 17:13 ` Arjan van de Ven
  2006-07-01 17:40 ` [patch 0/2] sLeAZY FPU feature Nick Piggin
  2 siblings, 0 replies; 7+ messages in thread
From: Arjan van de Ven @ 2006-07-01 17:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: ak, akpm

From: Chuck Ebbert <76306.1226@compuserve.com>

i386 port of the sLeAZY-fpu feature. 
Chuck reports that this gives him a +/- 0.4% improvement on his
simple benchmark

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

 arch/i386/kernel/process.c |   12 ++++++++++++
 arch/i386/kernel/traps.c   |    3 ++-
 include/asm-i386/i387.h    |    5 ++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6.17-sleazyfpu/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.17-sleazyfpu.orig/arch/i386/kernel/process.c
+++ linux-2.6.17-sleazyfpu/arch/i386/kernel/process.c
@@ -631,6 +631,11 @@ struct task_struct fastcall * __switch_t
 
 	__unlazy_fpu(prev_p);
 
+
+	/* we're going to use this soon, after a few expensive things */
+	if (next_p->fpu_counter > 5)
+		prefetch(&next->i387.fxsave);
+
 	/*
 	 * Reload esp0.
 	 */
@@ -689,6 +694,13 @@ struct task_struct fastcall * __switch_t
 
 	disable_tsc(prev_p, next_p);
 
+	/* If the task has used fpu the last 5 timeslices, just do a full
+	 * restore of the math state immediately to avoid the trap; the
+	 * chances of needing FPU soon are obviously high now
+	 */
+	if (next_p->fpu_counter > 5)
+		math_state_restore();
+
 	return prev_p;
 }
 
Index: linux-2.6.17-sleazyfpu/arch/i386/kernel/traps.c
===================================================================
--- linux-2.6.17-sleazyfpu.orig/arch/i386/kernel/traps.c
+++ linux-2.6.17-sleazyfpu/arch/i386/kernel/traps.c
@@ -1063,7 +1063,7 @@ fastcall unsigned char * fixup_x86_bogus
  * Must be called with kernel preemption disabled (in this case,
  * local interrupts are disabled at the call-site in entry.S).
  */
-asmlinkage void math_state_restore(struct pt_regs regs)
+asmlinkage void math_state_restore(void)
 {
 	struct thread_info *thread = current_thread_info();
 	struct task_struct *tsk = thread->task;
@@ -1073,6 +1073,7 @@ asmlinkage void math_state_restore(struc
 		init_fpu(tsk);
 	restore_fpu(tsk);
 	thread->status |= TS_USEDFPU;	/* So we fnsave on switch_to() */
+	tsk->fpu_counter++;
 }
 
 #ifndef CONFIG_MATH_EMULATION
Index: linux-2.6.17-sleazyfpu/include/asm-i386/i387.h
===================================================================
--- linux-2.6.17-sleazyfpu.orig/include/asm-i386/i387.h
+++ linux-2.6.17-sleazyfpu/include/asm-i386/i387.h
@@ -76,7 +76,9 @@ static inline void __save_init_fpu( stru
 
 #define __unlazy_fpu( tsk ) do { \
 	if (task_thread_info(tsk)->status & TS_USEDFPU) \
-		save_init_fpu( tsk ); \
+		save_init_fpu( tsk ); 			\
+	else						\
+		tsk->fpu_counter = 0;			\
 } while (0)
 
 #define __clear_fpu( tsk )					\
@@ -118,6 +120,7 @@ static inline void save_init_fpu( struct
 extern unsigned short get_fpu_cwd( struct task_struct *tsk );
 extern unsigned short get_fpu_swd( struct task_struct *tsk );
 extern unsigned short get_fpu_mxcsr( struct task_struct *tsk );
+extern asmlinkage void math_state_restore(void);
 
 /*
  * Signal frame handlers...



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2] sLeAZY FPU feature
  2006-07-01 17:11 [patch 0/2] sLeAZY FPU feature Arjan van de Ven
  2006-07-01 17:12 ` [patch 1/2] sLeAZY FPU feature - x86_64 support Arjan van de Ven
  2006-07-01 17:13 ` [patch 2/2] sLeAZY FPU feature - i386 support Arjan van de Ven
@ 2006-07-01 17:40 ` Nick Piggin
  2006-07-01 19:42   ` Arjan van de Ven
  2 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2006-07-01 17:40 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, akpm, ak

Arjan van de Ven wrote:
> Hi,
> 
> the two patches in this series (the x86-64 on by me, the i386 one by
> Chuck Ebbert) change how the lazy fpu feature works. In the current
> situation, we are 100% lazy, meaning that after every context switch,
> the application takes a trap on the first FPU use, which then restores
> the FPU context.
> 
> The sLeAZY FPU patch changes this behavior; if a process has used the
> FPU for 5 stints at a row, the behavior becomes proactive and the FPU
> context is restored during the regular context switch already. This
> means we can avoid the trap.
> 
> The underlying assumption is that if a process uses 5 times consecutive,
> it's likely to do it the 6th and later times as well (eg it's not a
> one-off behavior).
> 
> There is a limit built in; this proactive behavior resets after 255
> times, so that when a process is long lived and chances behavior, it'll
> still get the right behavior (for performance) after some time.
> 
> Chuck measured a +/- 0.4% performance gain, and my experiments show a
> similar improvement.

What sort of test? Any idea of the results for a best case microbenchmark
(something like two threads ping-pong a couple of futexes between them,
in between doing a single FPU op)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 0/2] sLeAZY FPU feature
  2006-07-01 17:40 ` [patch 0/2] sLeAZY FPU feature Nick Piggin
@ 2006-07-01 19:42   ` Arjan van de Ven
  0 siblings, 0 replies; 7+ messages in thread
From: Arjan van de Ven @ 2006-07-01 19:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, akpm, ak

On Sun, 2006-07-02 at 03:40 +1000, Nick Piggin wrote:

> What sort of test?

the one I did was long running FPU app (calculating PI using FPU)

>  Any idea of the results for a best case microbenchmark
> (something like two threads ping-pong a couple of futexes between them,
> in between doing a single FPU op)

ok I wrote a test scenario for this; the performance gain I get with
this is 8.5% 

the FPU part of the hot loop I used is
                A = 0.3 * (A+B);
with A and B doubles




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 1/2] sLeAZY FPU feature - x86_64 support
  2006-07-01 17:12 ` [patch 1/2] sLeAZY FPU feature - x86_64 support Arjan van de Ven
@ 2006-07-01 21:49   ` Andi Kleen
  2006-07-01 21:56     ` Arjan van de Ven
  0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2006-07-01 21:49 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, akpm


> After 256 switches, this is reset and lazy behavior is returned (until
> there are 5 consecutive ones again). The reason for this is to give apps
> that do longer bursts of FPU use still the lazy behavior back after some
> time.

Cool. This has been on my todo list forever.

However I'm not sure 256 is a good number. It seems a bit too high.

> Index: linux-2.6.17-sleazyfpu/arch/x86_64/kernel/process.c
> ===================================================================
> --- linux-2.6.17-sleazyfpu.orig/arch/x86_64/kernel/process.c
> +++ linux-2.6.17-sleazyfpu/arch/x86_64/kernel/process.c
> @@ -515,6 +515,10 @@ __switch_to(struct task_struct *prev_p, 
>  	int cpu = smp_processor_id();  
>  	struct tss_struct *tss = &per_cpu(init_tss, cpu);
>  
> +	/* we're going to use this soon, after a few expensive things */
> +	if (next_p->fpu_counter>5)
> +		prefetch(&next->i387.fxsave);

Did you measure this prefetch makes a difference? I would expect it to
be too soon to be really worth while (normally you need hundreds of
instructions for them to make sense and that's probably not the case here) 

>  #endif
> +	/*
> +	 * fpu_counter contains the number of consecutive context switches
> +	 * that the FPU is used. If this is over a threshold, the lazy fpu
> +	 * saving becomes unlazy to save the trap. This is an unsigned char
> +	 * so that after 256 times the counter wraps and the behavior turns
> +	 * lazy again; this to deal with bursty apps that only use FPU for
> +	 * a short time
> +	 */
> +	unsigned char fpu_counter;

Putting it at the end is also not good because there are the rarely used
cachelines. Probably better in the thread structure
-Andi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch 1/2] sLeAZY FPU feature - x86_64 support
  2006-07-01 21:49   ` Andi Kleen
@ 2006-07-01 21:56     ` Arjan van de Ven
  0 siblings, 0 replies; 7+ messages in thread
From: Arjan van de Ven @ 2006-07-01 21:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


> 
> However I'm not sure 256 is a good number. It seems a bit too high.

it's 256 context switches... if you care about context switch cycles
you'll do many, and 256 isn't a lot ;)

(remember that this is after 5 *consecutive* fpu uses, not just 5 uses
total, to you're really a heavy fpu user if you hit that)

> 
> > Index: linux-2.6.17-sleazyfpu/arch/x86_64/kernel/process.c
> > ===================================================================
> > --- linux-2.6.17-sleazyfpu.orig/arch/x86_64/kernel/process.c
> > +++ linux-2.6.17-sleazyfpu/arch/x86_64/kernel/process.c
> > @@ -515,6 +515,10 @@ __switch_to(struct task_struct *prev_p, 
> >  	int cpu = smp_processor_id();  
> >  	struct tss_struct *tss = &per_cpu(init_tss, cpu);
> >  
> > +	/* we're going to use this soon, after a few expensive things */
> > +	if (next_p->fpu_counter>5)
> > +		prefetch(&next->i387.fxsave);
> 
> Did you measure this prefetch makes a difference? I would expect it to
> be too soon to be really worth while (normally you need hundreds of
> instructions for them to make sense and that's probably not the case here) 

s/instructions/cycles/

well there are 4 segment loads, a few msr accesses, a few PDA writes and
optionally even the fxsave of the old task inbetween the prefetch and
the use of the memory; those do add up *bigtime*...




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-07-01 21:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-07-01 17:11 [patch 0/2] sLeAZY FPU feature Arjan van de Ven
2006-07-01 17:12 ` [patch 1/2] sLeAZY FPU feature - x86_64 support Arjan van de Ven
2006-07-01 21:49   ` Andi Kleen
2006-07-01 21:56     ` Arjan van de Ven
2006-07-01 17:13 ` [patch 2/2] sLeAZY FPU feature - i386 support Arjan van de Ven
2006-07-01 17:40 ` [patch 0/2] sLeAZY FPU feature Nick Piggin
2006-07-01 19:42   ` Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).