All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches
@ 2014-08-06 12:55 Mel Gorman
  2014-08-06 12:55 ` [PATCH 1/3] x86, fpu: Do not copy fpu preload state Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Mel Gorman @ 2014-08-06 12:55 UTC (permalink / raw)
  To: H Peter Anvin, Suresh Siddha; +Cc: Mike Galbraith, Linux-X86, LKML, Mel Gorman

Eager FPU switching is used on CPUs that support xsave on the grounds
that CPUs that support it can optimise the switch with xsaveopt and xrstor
instead of serialising by updating cr0.TS which has serialising semantics.

The path for eagerfpu is fatter than it needs to be because it still
maintains the fpu_counter for lazy FPU switches even though the information
is never used. This patch splits the paths optimises the eagerfpu path a
little. The benefit is marginal, it was just noticed when looking at why
integer-only workloads were spending time saving/restoring FPU states.

 arch/x86/include/asm/fpu-internal.h | 46 +++++++++++++++++++++++++++++--------
 arch/x86/kernel/process_32.c        |  2 +-
 arch/x86/kernel/process_64.c        |  2 +-
 3 files changed, 38 insertions(+), 12 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3] x86, fpu: Do not copy fpu preload state
  2014-08-06 12:55 [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
@ 2014-08-06 12:55 ` Mel Gorman
  2014-08-06 12:55 ` [PATCH 2/3] x86, fpu: Split FPU save state preparation into eagerfpu and !eagerfpu parts Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2014-08-06 12:55 UTC (permalink / raw)
  To: H Peter Anvin, Suresh Siddha; +Cc: Mike Galbraith, Linux-X86, LKML, Mel Gorman

__switch_to declares a fpu_switch_t on the stack then calls a prepare
function which declares a second one on the stack and copies the result
back. As the functions are inline the compiler will optimise it correctly
but it looks weird. This patch makes it clear that a single instance of
fpu_switch_t is used.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/fpu-internal.h | 14 ++++++--------
 arch/x86/kernel/process_32.c        |  2 +-
 arch/x86/kernel/process_64.c        |  2 +-
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index 115e368..b8771c4a 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -416,15 +416,14 @@ static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
 		cpu == new->thread.fpu.last_cpu;
 }
 
-static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu)
+static inline void switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu,
+						fpu_switch_t *fpu)
 {
-	fpu_switch_t fpu;
-
 	/*
 	 * If the task has used the math, pre-load the FPU on xsave processors
 	 * or if the past 5 consecutive context-switches used math.
 	 */
-	fpu.preload = tsk_used_math(new) && (use_eager_fpu() ||
+	fpu->preload = tsk_used_math(new) && (use_eager_fpu() ||
 					     new->thread.fpu_counter > 5);
 	if (__thread_has_fpu(old)) {
 		if (!__save_init_fpu(old))
@@ -433,7 +432,7 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
 		old->thread.fpu.has_fpu = 0;	/* But leave fpu_owner_task! */
 
 		/* Don't change CR0.TS if we just switch! */
-		if (fpu.preload) {
+		if (fpu->preload) {
 			new->thread.fpu_counter++;
 			__thread_set_has_fpu(new);
 			prefetch(new->thread.fpu.state);
@@ -442,16 +441,15 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
 	} else {
 		old->thread.fpu_counter = 0;
 		old->thread.fpu.last_cpu = ~0;
-		if (fpu.preload) {
+		if (fpu->preload) {
 			new->thread.fpu_counter++;
 			if (!use_eager_fpu() && fpu_lazy_restore(new, cpu))
-				fpu.preload = 0;
+				fpu->preload = 0;
 			else
 				prefetch(new->thread.fpu.state);
 			__thread_fpu_begin(new);
 		}
 	}
-	return fpu;
 }
 
 /*
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 7bc86bb..260ed717 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -257,7 +257,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
 
-	fpu = switch_fpu_prepare(prev_p, next_p, cpu);
+	switch_fpu_prepare(prev_p, next_p, cpu, &fpu);
 
 	/*
 	 * Reload esp0.
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ca5b02d..b155462 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -284,7 +284,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	unsigned fsindex, gsindex;
 	fpu_switch_t fpu;
 
-	fpu = switch_fpu_prepare(prev_p, next_p, cpu);
+	switch_fpu_prepare(prev_p, next_p, cpu, &fpu);
 
 	/*
 	 * Reload esp0, LDT and the page table pointer:
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3] x86, fpu: Split FPU save state preparation into eagerfpu and !eagerfpu parts
  2014-08-06 12:55 [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
  2014-08-06 12:55 ` [PATCH 1/3] x86, fpu: Do not copy fpu preload state Mel Gorman
@ 2014-08-06 12:55 ` Mel Gorman
  2014-08-06 12:55 ` [PATCH 3/3] x86, fpu: Do not update fpu_counter in the eagerfpu case Mel Gorman
  2014-08-27 16:03 ` [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
  3 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2014-08-06 12:55 UTC (permalink / raw)
  To: H Peter Anvin, Suresh Siddha; +Cc: Mike Galbraith, Linux-X86, LKML, Mel Gorman

If the CPU supports non-lazy FPU support using xsave then a different save
path is used for the FPU during context switch. The xsave path is heavier
than it needs to be so this patch splits the two cases in preparation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/fpu-internal.h | 40 ++++++++++++++++++++++++++++++++-----
 1 file changed, 35 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index b8771c4a..8d92807 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -416,15 +416,36 @@ static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
 		cpu == new->thread.fpu.last_cpu;
 }
 
-static inline void switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu,
+static inline void switch_eagerfpu_prepare(struct task_struct *old, struct task_struct *new, int cpu,
+						fpu_switch_t *fpu)
+{
+	fpu->preload = tsk_used_math(new);
+
+	if (__thread_has_fpu(old)) {
+		if (!__save_init_fpu(old))
+			cpu = ~0;
+		old->thread.fpu.last_cpu = cpu;
+		old->thread.fpu.has_fpu = 0;	/* But leave fpu_owner_task! */
+	} else {
+		old->thread.fpu_counter = 0;
+		old->thread.fpu.last_cpu = ~0;
+	}
+
+	if (fpu->preload) {
+		new->thread.fpu_counter++;
+		__thread_set_has_fpu(new);
+		prefetch(new->thread.fpu.state);
+	}
+}
+
+static inline void switch_preloadfpu_prepare(struct task_struct *old, struct task_struct *new, int cpu,
 						fpu_switch_t *fpu)
 {
 	/*
 	 * If the task has used the math, pre-load the FPU on xsave processors
 	 * or if the past 5 consecutive context-switches used math.
 	 */
-	fpu->preload = tsk_used_math(new) && (use_eager_fpu() ||
-					     new->thread.fpu_counter > 5);
+	fpu->preload = tsk_used_math(new) && (new->thread.fpu_counter > 5);
 	if (__thread_has_fpu(old)) {
 		if (!__save_init_fpu(old))
 			cpu = ~0;
@@ -436,14 +457,14 @@ static inline void switch_fpu_prepare(struct task_struct *old, struct task_struc
 			new->thread.fpu_counter++;
 			__thread_set_has_fpu(new);
 			prefetch(new->thread.fpu.state);
-		} else if (!use_eager_fpu())
+		} else
 			stts();
 	} else {
 		old->thread.fpu_counter = 0;
 		old->thread.fpu.last_cpu = ~0;
 		if (fpu->preload) {
 			new->thread.fpu_counter++;
-			if (!use_eager_fpu() && fpu_lazy_restore(new, cpu))
+			if (fpu_lazy_restore(new, cpu))
 				fpu->preload = 0;
 			else
 				prefetch(new->thread.fpu.state);
@@ -452,6 +473,15 @@ static inline void switch_fpu_prepare(struct task_struct *old, struct task_struc
 	}
 }
 
+static inline void switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu,
+						fpu_switch_t *fpu)
+{
+	if (use_eager_fpu())
+		switch_eagerfpu_prepare(old, new, cpu, fpu);
+	else
+		switch_preloadfpu_prepare(old, new, cpu, fpu);
+}
+
 /*
  * By the time this gets called, we've already cleared CR0.TS and
  * given the process the FPU if we are going to preload the FPU
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3] x86, fpu: Do not update fpu_counter in the eagerfpu case
  2014-08-06 12:55 [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
  2014-08-06 12:55 ` [PATCH 1/3] x86, fpu: Do not copy fpu preload state Mel Gorman
  2014-08-06 12:55 ` [PATCH 2/3] x86, fpu: Split FPU save state preparation into eagerfpu and !eagerfpu parts Mel Gorman
@ 2014-08-06 12:55 ` Mel Gorman
  2014-08-27 16:03 ` [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
  3 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2014-08-06 12:55 UTC (permalink / raw)
  To: H Peter Anvin, Suresh Siddha; +Cc: Mike Galbraith, Linux-X86, LKML, Mel Gorman

In the non-lazy FPU restore case there is no need for fpu_counter. This
patch avoids updating the counter unnecessarily.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/include/asm/fpu-internal.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index 8d92807..85fdd6b 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -427,12 +427,10 @@ static inline void switch_eagerfpu_prepare(struct task_struct *old, struct task_
 		old->thread.fpu.last_cpu = cpu;
 		old->thread.fpu.has_fpu = 0;	/* But leave fpu_owner_task! */
 	} else {
-		old->thread.fpu_counter = 0;
 		old->thread.fpu.last_cpu = ~0;
 	}
 
 	if (fpu->preload) {
-		new->thread.fpu_counter++;
 		__thread_set_has_fpu(new);
 		prefetch(new->thread.fpu.state);
 	}
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches
  2014-08-06 12:55 [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
                   ` (2 preceding siblings ...)
  2014-08-06 12:55 ` [PATCH 3/3] x86, fpu: Do not update fpu_counter in the eagerfpu case Mel Gorman
@ 2014-08-27 16:03 ` Mel Gorman
  2014-08-27 17:03   ` H. Peter Anvin
  3 siblings, 1 reply; 6+ messages in thread
From: Mel Gorman @ 2014-08-27 16:03 UTC (permalink / raw)
  To: H Peter Anvin, Suresh Siddha; +Cc: Mike Galbraith, Linux-X86, LKML

On Wed, Aug 06, 2014 at 01:55:45PM +0100, Mel Gorman wrote:
> Eager FPU switching is used on CPUs that support xsave on the grounds
> that CPUs that support it can optimise the switch with xsaveopt and xrstor
> instead of serialising by updating cr0.TS which has serialising semantics.
> 
> The path for eagerfpu is fatter than it needs to be because it still
> maintains the fpu_counter for lazy FPU switches even though the information
> is never used. This patch splits the paths optimises the eagerfpu path a
> little. The benefit is marginal, it was just noticed when looking at why
> integer-only workloads were spending time saving/restoring FPU states.
> 

This was initially sent when it would collide with the merge window
which was stupid timing. Nothing has actually changed since but I wonder
if anyone had a chance to take a look at this patches?

Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches
  2014-08-27 16:03 ` [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
@ 2014-08-27 17:03   ` H. Peter Anvin
  0 siblings, 0 replies; 6+ messages in thread
From: H. Peter Anvin @ 2014-08-27 17:03 UTC (permalink / raw)
  To: Mel Gorman, Suresh Siddha; +Cc: Mike Galbraith, Linux-X86, LKML

On 08/27/2014 09:03 AM, Mel Gorman wrote:
> On Wed, Aug 06, 2014 at 01:55:45PM +0100, Mel Gorman wrote:
>> Eager FPU switching is used on CPUs that support xsave on the grounds
>> that CPUs that support it can optimise the switch with xsaveopt and xrstor
>> instead of serialising by updating cr0.TS which has serialising semantics.
>>
>> The path for eagerfpu is fatter than it needs to be because it still
>> maintains the fpu_counter for lazy FPU switches even though the information
>> is never used. This patch splits the paths optimises the eagerfpu path a
>> little. The benefit is marginal, it was just noticed when looking at why
>> integer-only workloads were spending time saving/restoring FPU states.
>>
> 
> This was initially sent when it would collide with the merge window
> which was stupid timing. Nothing has actually changed since but I wonder
> if anyone had a chance to take a look at this patches?
> 

Looking at it now.

	-hpa



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-08-27 17:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-06 12:55 [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
2014-08-06 12:55 ` [PATCH 1/3] x86, fpu: Do not copy fpu preload state Mel Gorman
2014-08-06 12:55 ` [PATCH 2/3] x86, fpu: Split FPU save state preparation into eagerfpu and !eagerfpu parts Mel Gorman
2014-08-06 12:55 ` [PATCH 3/3] x86, fpu: Do not update fpu_counter in the eagerfpu case Mel Gorman
2014-08-27 16:03 ` [PATCH 0/3] Reduce length of the eagerfpu path during x86 context switches Mel Gorman
2014-08-27 17:03   ` H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.