All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Xenomai] ipipe/x86: do not restore during context switch
@ 2013-02-06 17:03 Jan Kiszka
  2013-02-06 17:09 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 17:03 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

Gilles,

do you remember if this core-3.4 change was a performance optimization
or a necessary fix? Also, I'm not yet understanding why we need all the
#ifdefs except for the first one which forces fpu.preload to 0.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:03 [Xenomai] ipipe/x86: do not restore during context switch Jan Kiszka
@ 2013-02-06 17:09 ` Gilles Chanteperdrix
  2013-02-06 17:33   ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 17:09 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 06:03 PM, Jan Kiszka wrote:

> Gilles,
> 
> do you remember if this core-3.4 change was a performance optimization
> or a necessary fix? Also, I'm not yet understanding why we need all the
> #ifdefs except for the first one which forces fpu.preload to 0.


It is a performance optimization, without it, we systematically hit the
maximum latency when the timer would tick during a context switch which
restores the FPU. Note that if you change that, you will probably break
-forge.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:09 ` Gilles Chanteperdrix
@ 2013-02-06 17:33   ` Jan Kiszka
  2013-02-06 17:35     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 17:33 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
> 
>> Gilles,
>>
>> do you remember if this core-3.4 change was a performance optimization
>> or a necessary fix? Also, I'm not yet understanding why we need all the
>> #ifdefs except for the first one which forces fpu.preload to 0.
> 
> 
> It is a performance optimization, without it, we systematically hit the
> maximum latency when the timer would tick during a context switch which
> restores the FPU. Note that if you change that, you will probably break
> -forge.

According to the Intel folks who introduced eagerfpu, xsave, or at least
xsaveopt (which I didn't implemented yet) is now faster than serializing
clts/stts. On the other hand, the worst case is a full SSE + AVX restore
while the target RT task is not depending on the FPU.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:33   ` Jan Kiszka
@ 2013-02-06 17:35     ` Gilles Chanteperdrix
  2013-02-06 17:40       ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 17:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 06:33 PM, Jan Kiszka wrote:

> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>
>>> Gilles,
>>>
>>> do you remember if this core-3.4 change was a performance optimization
>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>
>>
>> It is a performance optimization, without it, we systematically hit the
>> maximum latency when the timer would tick during a context switch which
>> restores the FPU. Note that if you change that, you will probably break
>> -forge.
> 
> According to the Intel folks who introduced eagerfpu, xsave, or at least
> xsaveopt (which I didn't implemented yet) is now faster than serializing
> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
> while the target RT task is not depending on the FPU.


Without xsave, we never restore fpu if the RT task never used it. This
changes with xsave?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:35     ` Gilles Chanteperdrix
@ 2013-02-06 17:40       ` Jan Kiszka
  2013-02-06 17:44         ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 17:40 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
> 
>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>
>>>> Gilles,
>>>>
>>>> do you remember if this core-3.4 change was a performance optimization
>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>
>>>
>>> It is a performance optimization, without it, we systematically hit the
>>> maximum latency when the timer would tick during a context switch which
>>> restores the FPU. Note that if you change that, you will probably break
>>> -forge.
>>
>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>> while the target RT task is not depending on the FPU.
> 
> 
> Without xsave, we never restore fpu if the RT task never used it. This
> changes with xsave?

This would change with eagerfpu which depends on xsave. The kernel
sticks with lazy switching in the absence of xsaveopt.

>From the log message of the related commit:

    Reasons driving this model change [Jan: eagerfpu] are:
    
    i. Newer processors support optimized state save/restore using xsaveopt and
    xrstor by tracking the INIT state and MODIFIED state during context-switch.
    This is faster than modifying the cr0.TS bit which has serializing semantics.
    
    ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
    With certain workloads (like boot, kernel-compilation etc), application
    completes its work with in the first 5 task switches, thus taking upto 5 #DNA
    traps with the kernel not getting a chance to apply the above mentioned
    pre-load heuristic.
    
    iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
    and thus will not work correctly in the presence of lazy restore. Non-lazy
    state restore is needed for enabling such features.
    
    Some data on a two socket SNB system:
     * Saved 20K DNA exceptions during boot on a two socket SNB system.
     * Saved 50K DNA exceptions during kernel-compilation workload.
     * Improved throughput of the AVX based checksumming function inside the
       kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
       pair.

I guess for a first 3.8 version I will now simply force eagerfpu off at
I-pipe level. We should then likely benchmark the current code against
an eagerfpu+xsaveopt-enabled version to decide.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:40       ` Jan Kiszka
@ 2013-02-06 17:44         ` Gilles Chanteperdrix
  2013-02-06 17:47           ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 17:44 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 06:40 PM, Jan Kiszka wrote:

> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>
>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>
>>>>> Gilles,
>>>>>
>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>
>>>>
>>>> It is a performance optimization, without it, we systematically hit the
>>>> maximum latency when the timer would tick during a context switch which
>>>> restores the FPU. Note that if you change that, you will probably break
>>>> -forge.
>>>
>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>> while the target RT task is not depending on the FPU.
>>
>>
>> Without xsave, we never restore fpu if the RT task never used it. This
>> changes with xsave?
> 
> This would change with eagerfpu which depends on xsave. The kernel
> sticks with lazy switching in the absence of xsaveopt.


I am not sure you understand what I mean, so, I am going to reformulate.
Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
restore. But Xenomai eager fpu restore is a nop if the RT task never
used FPU since its inception (and all the parents from which it is
cloned never used FPU either). Does Linux eager switching mean the same
thing?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:44         ` Gilles Chanteperdrix
@ 2013-02-06 17:47           ` Jan Kiszka
  2013-02-06 17:51             ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 17:47 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
> 
>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>
>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>
>>>>>> Gilles,
>>>>>>
>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>
>>>>>
>>>>> It is a performance optimization, without it, we systematically hit the
>>>>> maximum latency when the timer would tick during a context switch which
>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>> -forge.
>>>>
>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>> while the target RT task is not depending on the FPU.
>>>
>>>
>>> Without xsave, we never restore fpu if the RT task never used it. This
>>> changes with xsave?
>>
>> This would change with eagerfpu which depends on xsave. The kernel
>> sticks with lazy switching in the absence of xsaveopt.
> 
> 
> I am not sure you understand what I mean, so, I am going to reformulate.
> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
> restore. But Xenomai eager fpu restore is a nop if the RT task never
> used FPU since its inception (and all the parents from which it is
> cloned never used FPU either). Does Linux eager switching mean the same
> thing?

eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
that the FPU was unused by the source/destination. And no fiddling with
TS anymore, at no time.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:47           ` Jan Kiszka
@ 2013-02-06 17:51             ` Gilles Chanteperdrix
  2013-02-06 18:26               ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 17:51 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 06:47 PM, Jan Kiszka wrote:

> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>
>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>
>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>
>>>>>>> Gilles,
>>>>>>>
>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>
>>>>>>
>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>> -forge.
>>>>>
>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>> while the target RT task is not depending on the FPU.
>>>>
>>>>
>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>> changes with xsave?
>>>
>>> This would change with eagerfpu which depends on xsave. The kernel
>>> sticks with lazy switching in the absence of xsaveopt.
>>
>>
>> I am not sure you understand what I mean, so, I am going to reformulate.
>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>> used FPU since its inception (and all the parents from which it is
>> cloned never used FPU either). Does Linux eager switching mean the same
>> thing?
> 
> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
> that the FPU was unused by the source/destination. And no fiddling with
> TS anymore, at no time.


I still do not understand this sentence then: "the worst case is a full
SSE + AVX restore while the target RT task is not depending on the FPU."
If the RT task does not depend on the FPU, why would xsaveopt/xrstor
restore SSE and AVX context?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 17:51             ` Gilles Chanteperdrix
@ 2013-02-06 18:26               ` Jan Kiszka
  2013-02-06 18:31                 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 18:26 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
> 
>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>
>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>
>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>
>>>>>>>> Gilles,
>>>>>>>>
>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>
>>>>>>>
>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>> -forge.
>>>>>>
>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>> while the target RT task is not depending on the FPU.
>>>>>
>>>>>
>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>> changes with xsave?
>>>>
>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>> sticks with lazy switching in the absence of xsaveopt.
>>>
>>>
>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>> used FPU since its inception (and all the parents from which it is
>>> cloned never used FPU either). Does Linux eager switching mean the same
>>> thing?
>>
>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>> that the FPU was unused by the source/destination. And no fiddling with
>> TS anymore, at no time.
> 
> 
> I still do not understand this sentence then: "the worst case is a full
> SSE + AVX restore while the target RT task is not depending on the FPU."
> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
> restore SSE and AVX context?

Switching between two tasks that both use the full state space defines
the maximum latency of the FPU save/restore step. We cannot interrupt
xsave or xrstor instructions, but we couldn't interrupt fxsave either.

What we can do, though, is to ensure that we have at least an preemption
point between both. Do we have such thing so far, a chance to handle a
Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
the following task B? If not, the discussion is mood and we are just
shifting probabilities of the very same worst case.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 18:26               ` Jan Kiszka
@ 2013-02-06 18:31                 ` Gilles Chanteperdrix
  2013-02-06 18:35                   ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 18:31 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 07:26 PM, Jan Kiszka wrote:

> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>
>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>
>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>
>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>
>>>>>>>>> Gilles,
>>>>>>>>>
>>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>
>>>>>>>>
>>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>>> -forge.
>>>>>>>
>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>
>>>>>>
>>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>>> changes with xsave?
>>>>>
>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>
>>>>
>>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>> used FPU since its inception (and all the parents from which it is
>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>> thing?
>>>
>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>> that the FPU was unused by the source/destination. And no fiddling with
>>> TS anymore, at no time.
>>
>>
>> I still do not understand this sentence then: "the worst case is a full
>> SSE + AVX restore while the target RT task is not depending on the FPU."
>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>> restore SSE and AVX context?
> 
> Switching between two tasks that both use the full state space defines
> the maximum latency of the FPU save/restore step. We cannot interrupt
> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
> 
> What we can do, though, is to ensure that we have at least an preemption
> point between both. Do we have such thing so far, a chance to handle a
> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
> the following task B? If not, the discussion is mood and we are just
> shifting probabilities of the very same worst case.


We can implement unlocked context switch support on x86 as we do on
other platforms. I tried that on atom actually and it did not really
improve latencies. You do not answer my question though, why would
xsave/xrstor do anything if the RT thread has not used FPU (and all its
parents have not used fpu) ?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 18:31                 ` Gilles Chanteperdrix
@ 2013-02-06 18:35                   ` Jan Kiszka
  2013-02-06 18:40                     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 18:35 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 19:31, Gilles Chanteperdrix wrote:
> On 02/06/2013 07:26 PM, Jan Kiszka wrote:
> 
>> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>>
>>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>>
>>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>>
>>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>>
>>>>>>>>>> Gilles,
>>>>>>>>>>
>>>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>>>> -forge.
>>>>>>>>
>>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>>
>>>>>>>
>>>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>>>> changes with xsave?
>>>>>>
>>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>>
>>>>>
>>>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>>> used FPU since its inception (and all the parents from which it is
>>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>>> thing?
>>>>
>>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>>> that the FPU was unused by the source/destination. And no fiddling with
>>>> TS anymore, at no time.
>>>
>>>
>>> I still do not understand this sentence then: "the worst case is a full
>>> SSE + AVX restore while the target RT task is not depending on the FPU."
>>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>>> restore SSE and AVX context?
>>
>> Switching between two tasks that both use the full state space defines
>> the maximum latency of the FPU save/restore step. We cannot interrupt
>> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
>>
>> What we can do, though, is to ensure that we have at least an preemption
>> point between both. Do we have such thing so far, a chance to handle a
>> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
>> the following task B? If not, the discussion is mood and we are just
>> shifting probabilities of the very same worst case.
> 
> 
> We can implement unlocked context switch support on x86 as we do on
> other platforms. I tried that on atom actually and it did not really
> improve latencies. You do not answer my question though, why would
> xsave/xrstor do anything if the RT thread has not used FPU (and all its
> parents have not used fpu) ?

We first of all would have to wait for the unrelated switch between
those two Linux tasks before we could handle the IRQ and switch to the
FPU-free RT task. __switch_to is atomic, also for Linux->Linux, no?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 18:35                   ` Jan Kiszka
@ 2013-02-06 18:40                     ` Gilles Chanteperdrix
  2013-02-06 19:22                       ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 18:40 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 07:35 PM, Jan Kiszka wrote:

> On 2013-02-06 19:31, Gilles Chanteperdrix wrote:
>> On 02/06/2013 07:26 PM, Jan Kiszka wrote:
>>
>>> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>>>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>>>
>>>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>>>
>>>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>>>
>>>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>>>
>>>>>>>>>>> Gilles,
>>>>>>>>>>>
>>>>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>>>>> -forge.
>>>>>>>>>
>>>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>>>
>>>>>>>>
>>>>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>>>>> changes with xsave?
>>>>>>>
>>>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>>>
>>>>>>
>>>>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>>>> used FPU since its inception (and all the parents from which it is
>>>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>>>> thing?
>>>>>
>>>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>>>> that the FPU was unused by the source/destination. And no fiddling with
>>>>> TS anymore, at no time.
>>>>
>>>>
>>>> I still do not understand this sentence then: "the worst case is a full
>>>> SSE + AVX restore while the target RT task is not depending on the FPU."
>>>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>>>> restore SSE and AVX context?
>>>
>>> Switching between two tasks that both use the full state space defines
>>> the maximum latency of the FPU save/restore step. We cannot interrupt
>>> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
>>>
>>> What we can do, though, is to ensure that we have at least an preemption
>>> point between both. Do we have such thing so far, a chance to handle a
>>> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
>>> the following task B? If not, the discussion is mood and we are just
>>> shifting probabilities of the very same worst case.
>>
>>
>> We can implement unlocked context switch support on x86 as we do on
>> other platforms. I tried that on atom actually and it did not really
>> improve latencies. You do not answer my question though, why would
>> xsave/xrstor do anything if the RT thread has not used FPU (and all its
>> parents have not used fpu) ?
> 
> We first of all would have to wait for the unrelated switch between
> those two Linux tasks before we could handle the IRQ and switch to the
> FPU-free RT task. __switch_to is atomic, also for Linux->Linux, no?


Only the *IP and *SP switch need to be atomic, the whole __switch_to can
be split in several atomic sections, this is what I tested on atom. But
as I said, it did not lead to any latency improvement.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 18:40                     ` Gilles Chanteperdrix
@ 2013-02-06 19:22                       ` Jan Kiszka
  2013-02-06 19:30                         ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 19:22 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 19:40, Gilles Chanteperdrix wrote:
> On 02/06/2013 07:35 PM, Jan Kiszka wrote:
> 
>> On 2013-02-06 19:31, Gilles Chanteperdrix wrote:
>>> On 02/06/2013 07:26 PM, Jan Kiszka wrote:
>>>
>>>> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>>>>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>>>>
>>>>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>>>>
>>>>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>>>>
>>>>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Gilles,
>>>>>>>>>>>>
>>>>>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>>>>>> -forge.
>>>>>>>>>>
>>>>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>>>>>> changes with xsave?
>>>>>>>>
>>>>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>>>>
>>>>>>>
>>>>>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>>>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>>>>> used FPU since its inception (and all the parents from which it is
>>>>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>>>>> thing?
>>>>>>
>>>>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>>>>> that the FPU was unused by the source/destination. And no fiddling with
>>>>>> TS anymore, at no time.
>>>>>
>>>>>
>>>>> I still do not understand this sentence then: "the worst case is a full
>>>>> SSE + AVX restore while the target RT task is not depending on the FPU."
>>>>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>>>>> restore SSE and AVX context?
>>>>
>>>> Switching between two tasks that both use the full state space defines
>>>> the maximum latency of the FPU save/restore step. We cannot interrupt
>>>> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
>>>>
>>>> What we can do, though, is to ensure that we have at least an preemption
>>>> point between both. Do we have such thing so far, a chance to handle a
>>>> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
>>>> the following task B? If not, the discussion is mood and we are just
>>>> shifting probabilities of the very same worst case.
>>>
>>>
>>> We can implement unlocked context switch support on x86 as we do on
>>> other platforms. I tried that on atom actually and it did not really
>>> improve latencies. You do not answer my question though, why would
>>> xsave/xrstor do anything if the RT thread has not used FPU (and all its
>>> parents have not used fpu) ?
>>
>> We first of all would have to wait for the unrelated switch between
>> those two Linux tasks before we could handle the IRQ and switch to the
>> FPU-free RT task. __switch_to is atomic, also for Linux->Linux, no?
> 
> 
> Only the *IP and *SP switch need to be atomic, the whole __switch_to can
> be split in several atomic sections, this is what I tested on atom. But
> as I said, it did not lead to any latency improvement.

Ok, so back to the patch about which this discussion started: It
enforced that Linux only saves the FPU state on switches, never directly
restores it but enforces lazy restoring, right? To ensure that
save+restore for Linux tasks is always interruptible in the middle.
However, that sounds pretty expensive when applying FPU/SSE/etc. load on
Linux.

Instead of always doing stts for the new task, we could do the restore
later, after the hard_local_irq_enable of __ipipe_switch_tail. That
should allow the eager model for Linux as well without making
save+restore of Linux-Linux switches atomic.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 19:22                       ` Jan Kiszka
@ 2013-02-06 19:30                         ` Gilles Chanteperdrix
  2013-02-06 19:55                           ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 19:30 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 08:22 PM, Jan Kiszka wrote:

> On 2013-02-06 19:40, Gilles Chanteperdrix wrote:
>> On 02/06/2013 07:35 PM, Jan Kiszka wrote:
>>
>>> On 2013-02-06 19:31, Gilles Chanteperdrix wrote:
>>>> On 02/06/2013 07:26 PM, Jan Kiszka wrote:
>>>>
>>>>> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>>>>>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>>>>>
>>>>>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>>>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>>>>>
>>>>>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>>>>>
>>>>>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Gilles,
>>>>>>>>>>>>>
>>>>>>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>>>>>>> -forge.
>>>>>>>>>>>
>>>>>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>>>>>>> changes with xsave?
>>>>>>>>>
>>>>>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>>>>>
>>>>>>>>
>>>>>>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>>>>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>>>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>>>>>> used FPU since its inception (and all the parents from which it is
>>>>>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>>>>>> thing?
>>>>>>>
>>>>>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>>>>>> that the FPU was unused by the source/destination. And no fiddling with
>>>>>>> TS anymore, at no time.
>>>>>>
>>>>>>
>>>>>> I still do not understand this sentence then: "the worst case is a full
>>>>>> SSE + AVX restore while the target RT task is not depending on the FPU."
>>>>>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>>>>>> restore SSE and AVX context?
>>>>>
>>>>> Switching between two tasks that both use the full state space defines
>>>>> the maximum latency of the FPU save/restore step. We cannot interrupt
>>>>> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
>>>>>
>>>>> What we can do, though, is to ensure that we have at least an preemption
>>>>> point between both. Do we have such thing so far, a chance to handle a
>>>>> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
>>>>> the following task B? If not, the discussion is mood and we are just
>>>>> shifting probabilities of the very same worst case.
>>>>
>>>>
>>>> We can implement unlocked context switch support on x86 as we do on
>>>> other platforms. I tried that on atom actually and it did not really
>>>> improve latencies. You do not answer my question though, why would
>>>> xsave/xrstor do anything if the RT thread has not used FPU (and all its
>>>> parents have not used fpu) ?
>>>
>>> We first of all would have to wait for the unrelated switch between
>>> those two Linux tasks before we could handle the IRQ and switch to the
>>> FPU-free RT task. __switch_to is atomic, also for Linux->Linux, no?
>>
>>
>> Only the *IP and *SP switch need to be atomic, the whole __switch_to can
>> be split in several atomic sections, this is what I tested on atom. But
>> as I said, it did not lead to any latency improvement.
> 
> Ok, so back to the patch about which this discussion started: It
> enforced that Linux only saves the FPU state on switches, never directly
> restores it but enforces lazy restoring, right? To ensure that
> save+restore for Linux tasks is always interruptible in the middle.
> However, that sounds pretty expensive when applying FPU/SSE/etc. load on
> Linux.


To the contrary, the overhead is the cost of the fault (with the
user/kernel and kernel/user switches), so, the larger the context
switch, the smaller the overhead in proportion.

> 
> Instead of always doing stts for the new task, we could do the restore
> later, after the hard_local_irq_enable of __ipipe_switch_tail. That
> should allow the eager model for Linux as well without making
> save+restore of Linux-Linux switches atomic.


That could be done, but it is probably simpler to implement unlocked
context switch, and split __switch_to into several atomic sections.
Anyway, any change in this area will probably break the work done for
kthreads on -forge, so, can't we postpone this?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 19:30                         ` Gilles Chanteperdrix
@ 2013-02-06 19:55                           ` Jan Kiszka
  2013-02-06 20:03                             ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 19:55 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 20:30, Gilles Chanteperdrix wrote:
> On 02/06/2013 08:22 PM, Jan Kiszka wrote:
> 
>> On 2013-02-06 19:40, Gilles Chanteperdrix wrote:
>>> On 02/06/2013 07:35 PM, Jan Kiszka wrote:
>>>
>>>> On 2013-02-06 19:31, Gilles Chanteperdrix wrote:
>>>>> On 02/06/2013 07:26 PM, Jan Kiszka wrote:
>>>>>
>>>>>> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>>>>>>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>>>>>>
>>>>>>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>>>>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>>>>>>
>>>>>>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gilles,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>>>>>>>> -forge.
>>>>>>>>>>>>
>>>>>>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>>>>>>>> changes with xsave?
>>>>>>>>>>
>>>>>>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>>>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>>>>>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>>>>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>>>>>>> used FPU since its inception (and all the parents from which it is
>>>>>>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>>>>>>> thing?
>>>>>>>>
>>>>>>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>>>>>>> that the FPU was unused by the source/destination. And no fiddling with
>>>>>>>> TS anymore, at no time.
>>>>>>>
>>>>>>>
>>>>>>> I still do not understand this sentence then: "the worst case is a full
>>>>>>> SSE + AVX restore while the target RT task is not depending on the FPU."
>>>>>>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>>>>>>> restore SSE and AVX context?
>>>>>>
>>>>>> Switching between two tasks that both use the full state space defines
>>>>>> the maximum latency of the FPU save/restore step. We cannot interrupt
>>>>>> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
>>>>>>
>>>>>> What we can do, though, is to ensure that we have at least an preemption
>>>>>> point between both. Do we have such thing so far, a chance to handle a
>>>>>> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
>>>>>> the following task B? If not, the discussion is mood and we are just
>>>>>> shifting probabilities of the very same worst case.
>>>>>
>>>>>
>>>>> We can implement unlocked context switch support on x86 as we do on
>>>>> other platforms. I tried that on atom actually and it did not really
>>>>> improve latencies. You do not answer my question though, why would
>>>>> xsave/xrstor do anything if the RT thread has not used FPU (and all its
>>>>> parents have not used fpu) ?
>>>>
>>>> We first of all would have to wait for the unrelated switch between
>>>> those two Linux tasks before we could handle the IRQ and switch to the
>>>> FPU-free RT task. __switch_to is atomic, also for Linux->Linux, no?
>>>
>>>
>>> Only the *IP and *SP switch need to be atomic, the whole __switch_to can
>>> be split in several atomic sections, this is what I tested on atom. But
>>> as I said, it did not lead to any latency improvement.
>>
>> Ok, so back to the patch about which this discussion started: It
>> enforced that Linux only saves the FPU state on switches, never directly
>> restores it but enforces lazy restoring, right? To ensure that
>> save+restore for Linux tasks is always interruptible in the middle.
>> However, that sounds pretty expensive when applying FPU/SSE/etc. load on
>> Linux.
> 
> 
> To the contrary, the overhead is the cost of the fault (with the
> user/kernel and kernel/user switches), so, the larger the context
> switch, the smaller the overhead in proportion.

Yes, continuously faulting in FPU states of heavy Linux users is the
problem. That must be changed.

> 
>>
>> Instead of always doing stts for the new task, we could do the restore
>> later, after the hard_local_irq_enable of __ipipe_switch_tail. That
>> should allow the eager model for Linux as well without making
>> save+restore of Linux-Linux switches atomic.
> 
> 
> That could be done, but it is probably simpler to implement unlocked
> context switch, and split __switch_to into several atomic sections.

Yep, indeed.

> Anyway, any change in this area will probably break the work done for
> kthreads on -forge, so, can't we postpone this?

For how long? What are the dependencies? I thought unlocked context
switches already exit for other archs.

At least I will need to look into this internally - we are using less
than 10% of our CPUs for RT, the rest wants high performance.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 19:55                           ` Jan Kiszka
@ 2013-02-06 20:03                             ` Gilles Chanteperdrix
  2013-02-06 20:17                               ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 20:03 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 08:55 PM, Jan Kiszka wrote:

>> To the contrary, the overhead is the cost of the fault (with the
>> user/kernel and kernel/user switches), so, the larger the context
>> switch, the smaller the overhead in proportion.
> 
> Yes, continuously faulting in FPU states of heavy Linux users is the
> problem. That must be changed.


We are talking x86 here, so, the cost of the FPU fault is not that heavy.

>>> Instead of always doing stts for the new task, we could do the restore
>>> later, after the hard_local_irq_enable of __ipipe_switch_tail. That
>>> should allow the eager model for Linux as well without making
>>> save+restore of Linux-Linux switches atomic.
>>
>>
>> That could be done, but it is probably simpler to implement unlocked
>> context switch, and split __switch_to into several atomic sections.
> 
> Yep, indeed.
> 
>> Anyway, any change in this area will probably break the work done for
>> kthreads on -forge, so, can't we postpone this?
> 
> For how long? What are the dependencies?


For the time it takes to validate FPU on kthreads with -forge.

> I thought unlocked context
> switches already exit for other archs.


That is not an issue, indeed.

> 
> At least I will need to look into this internally - we are using less
> than 10% of our CPUs for RT, the rest wants high performance.


Are you sure this is not a priori optimization of something which is not
really an issue?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 20:03                             ` Gilles Chanteperdrix
@ 2013-02-06 20:17                               ` Jan Kiszka
  2013-02-06 20:20                                 ` Gilles Chanteperdrix
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2013-02-06 20:17 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Xenomai

On 2013-02-06 21:03, Gilles Chanteperdrix wrote:
> On 02/06/2013 08:55 PM, Jan Kiszka wrote:
> 
>>> To the contrary, the overhead is the cost of the fault (with the
>>> user/kernel and kernel/user switches), so, the larger the context
>>> switch, the smaller the overhead in proportion.
>>
>> Yes, continuously faulting in FPU states of heavy Linux users is the
>> problem. That must be changed.
> 
> 
> We are talking x86 here, so, the cost of the FPU fault is not that heavy.
> 
>>>> Instead of always doing stts for the new task, we could do the restore
>>>> later, after the hard_local_irq_enable of __ipipe_switch_tail. That
>>>> should allow the eager model for Linux as well without making
>>>> save+restore of Linux-Linux switches atomic.
>>>
>>>
>>> That could be done, but it is probably simpler to implement unlocked
>>> context switch, and split __switch_to into several atomic sections.
>>
>> Yep, indeed.
>>
>>> Anyway, any change in this area will probably break the work done for
>>> kthreads on -forge, so, can't we postpone this?
>>
>> For how long? What are the dependencies?
> 
> 
> For the time it takes to validate FPU on kthreads with -forge.
> 
>> I thought unlocked context
>> switches already exit for other archs.
> 
> 
> That is not an issue, indeed.
> 
>>
>> At least I will need to look into this internally - we are using less
>> than 10% of our CPUs for RT, the rest wants high performance.
> 
> 
> Are you sure this is not a priori optimization of something which is not
> really an issue?

Of course, it depends on the context switch rate. We have to measure,
but I wouldn't be surprised to see high numbers. And the FPU is used by
everything today.

My colleague recently measured that (on an older standard kernel)
accelerated disk encryption was slightly slower than unaccelerated one -
due to the overhead of FPU context switches. I suppose we have a nice
benchmark in that scenario (dd on an encrypted disk)...

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Xenomai] ipipe/x86: do not restore during context switch
  2013-02-06 20:17                               ` Jan Kiszka
@ 2013-02-06 20:20                                 ` Gilles Chanteperdrix
  0 siblings, 0 replies; 18+ messages in thread
From: Gilles Chanteperdrix @ 2013-02-06 20:20 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Xenomai

On 02/06/2013 09:17 PM, Jan Kiszka wrote:

> On 2013-02-06 21:03, Gilles Chanteperdrix wrote:
>> On 02/06/2013 08:55 PM, Jan Kiszka wrote:
>>
>>>> To the contrary, the overhead is the cost of the fault (with the
>>>> user/kernel and kernel/user switches), so, the larger the context
>>>> switch, the smaller the overhead in proportion.
>>>
>>> Yes, continuously faulting in FPU states of heavy Linux users is the
>>> problem. That must be changed.
>>
>>
>> We are talking x86 here, so, the cost of the FPU fault is not that heavy.
>>
>>>>> Instead of always doing stts for the new task, we could do the restore
>>>>> later, after the hard_local_irq_enable of __ipipe_switch_tail. That
>>>>> should allow the eager model for Linux as well without making
>>>>> save+restore of Linux-Linux switches atomic.
>>>>
>>>>
>>>> That could be done, but it is probably simpler to implement unlocked
>>>> context switch, and split __switch_to into several atomic sections.
>>>
>>> Yep, indeed.
>>>
>>>> Anyway, any change in this area will probably break the work done for
>>>> kthreads on -forge, so, can't we postpone this?
>>>
>>> For how long? What are the dependencies?
>>
>>
>> For the time it takes to validate FPU on kthreads with -forge.
>>
>>> I thought unlocked context
>>> switches already exit for other archs.
>>
>>
>> That is not an issue, indeed.
>>
>>>
>>> At least I will need to look into this internally - we are using less
>>> than 10% of our CPUs for RT, the rest wants high performance.
>>
>>
>> Are you sure this is not a priori optimization of something which is not
>> really an issue?
> 
> Of course, it depends on the context switch rate. We have to measure,
> but I wouldn't be surprised to see high numbers. And the FPU is used by
> everything today.
> 
> My colleague recently measured that (on an older standard kernel)
> accelerated disk encryption was slightly slower than unaccelerated one -
> due to the overhead of FPU context switches. I suppose we have a nice
> benchmark in that scenario (dd on an encrypted disk)...


You are talking kernel_fpu_begin()/kernel_fpu_end() here, this is a
completely different issue. If Xenomai interrupts Linux in the middle of
such a section, the FPU context is restored eagerly when switching back
to Linux, so there is absolutely nothing more we can do in this area.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-02-06 20:20 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-06 17:03 [Xenomai] ipipe/x86: do not restore during context switch Jan Kiszka
2013-02-06 17:09 ` Gilles Chanteperdrix
2013-02-06 17:33   ` Jan Kiszka
2013-02-06 17:35     ` Gilles Chanteperdrix
2013-02-06 17:40       ` Jan Kiszka
2013-02-06 17:44         ` Gilles Chanteperdrix
2013-02-06 17:47           ` Jan Kiszka
2013-02-06 17:51             ` Gilles Chanteperdrix
2013-02-06 18:26               ` Jan Kiszka
2013-02-06 18:31                 ` Gilles Chanteperdrix
2013-02-06 18:35                   ` Jan Kiszka
2013-02-06 18:40                     ` Gilles Chanteperdrix
2013-02-06 19:22                       ` Jan Kiszka
2013-02-06 19:30                         ` Gilles Chanteperdrix
2013-02-06 19:55                           ` Jan Kiszka
2013-02-06 20:03                             ` Gilles Chanteperdrix
2013-02-06 20:17                               ` Jan Kiszka
2013-02-06 20:20                                 ` Gilles Chanteperdrix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.