All of lore.kernel.org
 help / color / mirror / Atom feed
* Dovetail <-> PREEMPT_RT hybridization
@ 2020-07-20 20:47 Philippe Gerum
  2020-07-20 22:44 ` Paul
  2020-07-21  5:26 ` Meng, Fino
  0 siblings, 2 replies; 13+ messages in thread
From: Philippe Gerum @ 2020-07-20 20:47 UTC (permalink / raw)
  To: Evl; +Cc: Xenomai


FWIW, I'm investigating the opportunity for rebasing Dovetail - and therefore
the EVL core - on the PREEMPT_RT code base, ahead of the final integration of
the latter into the mainline kernel tree. In the same move, the goal would be
to leverage the improvements brought by native preemption with respect to
fine-grained interrupt protection, while keeping the alternate scheduling [1]
feature, which still exhibits significantly shorter preemption times and much
cleaner jitter compared to what is - at least currently - achievable with a
plain PREEMPT_RT kernel under meaningful stress.

With such hybridization, the Dovetail implementation should be even simpler.
Companion cores based on it could run on the out-of-band execution stage
unimpeded by other forms of preemption disabling in the in-band kernel (e.g.
locks). This would preserve the most significant advantage of the pipelining
model when it comes to reliable response times for applications at a modest
processing cost by a lightweight real-time infrastructure.

This work entails porting the latest Dovetail code base I have been working on
lately from 5.8 back to 5.6-rt, since this is the most recent public release
of PREEMPT_RT so far. In addition, interrupt-free sections which are deemed
too long for the companion core on top to cope with, need to be identified in
the target PREEMPT_RT release so that they could be mitigated (see below for
an explanation about how it could be done). In the future, a way to automate
such research should be looked for, since finding these spots is likely going
to be the boring task to carry out each time this new Dovetail implementation
is ported to the next PREEMPT_RT release. Plus, the truckload of other tricky
issues I may have overlooked.

If anyone is interested in participating in this work, let me know. I cannot
guarantee success, but the data I have collected over time with both the dual
kernel and native preemption models leaves me optimistic about the outcome if
they are combined the right way.

-- Nitty-gritty details about why and how to do this

Those acquainted with the interrupt pipelining technique Dovetail implements
[2] may already know that decoupling the interrupt mask flag as perceived by
the CPU from the one perceived by the kernel induces a number of tricky
issues. We want interrupts to be unmasked in the CPU as long as possible while
the kernel runs; to this end local_irq_*() helpers are switched to a
software-based implementation which virtualizes the interrupt mask as
perceived by the kernel, while leaving interrupts enabled in the CPU,
postponing the delivery of IRQs blocked by the virtual masking until they are
accepted again by the kernel (aka "stall bit"). This is a plain simple
log-if-blocked-then-replay-when-unblocked game [3].

However, we also have to synchronize the hardware and software interrupt masks
in some specific places of the kernel in order to keep some hardware and
software logic happy. Two examples come to mind, there are more of them:

- hardware-wise, we want updates to some registers to remain fully atomic
despite the fact interrupt pipelining is in effect. For arm64, we have to
ensure that updates to the translation table registers (TTBRs) cannot be
preempted, likewise for updates to the CR4 register on x86 which is notably
used during TLB management.  In both cases, we have to locally revert/override
the changes Dovetail implicitly did by re-introducing CPU-based forms of
interrupt disabling, instead of the software-based one.

- software-wise, maintaining the LOCKDEP logic usable in a pipelined system
requires fixing up the virtual interrupt mask on kernel boundaries between
kernel and user mode, so that it properly reflects what the locking validation
engine expects at all times. This has been the most time-consuming work in a
number of Dovetail upgrades to recent kernel releases, 5.8-rc included.
Besides, I'm still not happy with the way this is done, which looks like
playing whack-a-mole to some extent.

Many of these issues are hard to identify, some may not be trivial to address
(LOCKDEP support can become really ugly in this respect). Several other
sub-systems like CPU idleness and power management have similar requirements
for particular code paths.

Now, we may have another option for gaining fine-grained interrupt protection,
which would build on the relentless work the PREEMPT_RT folks did about
shrinking the interrupt-free sections in the kernel code to the bare minimum
which is acceptable for native preemption, by threading IRQs and introducing
sleeping locks mainly.

Instead of systematizing the virtualization of the local_irq_*() helpers, we
could switch them back to their original - hardware-based - behavior, adding
controlled mask-breaking statements manually to any remaining problematic code
path. Such statement would enable interrupts in the CPU while blocking them
for the in-band kernel, using a local, non-pervasive variant of the current
interrupt pipeline.

Within those long interrupt-free sections created by the in-band code, the
companion core would nevertheless be allowed to process pending interrupts
immediately while maintaining the interrupt protection for the in-band kernel.
Identifying these sections for enabling the out-of-band code to preempt
locally should be a matter of properly using the irqsoff tracer, provided the
trace_hardirqs* instrumentation is correct.

e.g. roughly sketching a possible use case:

__schedule()
lock(rq) /* hard irqs off */
...
context_switch()
	switch_mm
	switch_to
...
unlock(rq) /* hard irqs on */

The interrupt-free section above could amount to tenths of microseconds on
armv7 under significant pressure (especially with a sluggish L2 outer cache)
and would prevent the out-of-band (companion) core to preempt in the meantime.
To address this, switching the virtual interrupt state could be done manually
by some dedicated service, say, "oob_synchronize()", which would first stall
the in-band stage to keep the code interrupt-free in-band wise, then allow any
pending hard IRQ to be taken by toggling the CPU mask flag, possibly some of
which the companion core would handle. Other IRQs to be handled by the in-band
code would have to wait into a deferred interrupt log until hard IRQs are
generally re-enabled later on, which is what happens today with the common
pipelining technique on a broader scope.

__schedule()
lock(rq) /* hard irqs off */
...
context_switch()
	switch_mm
	cond_sync_oob(); /* pending IRQs are synchronized for oob only */
	switch_to
...
unlock(rq) /* hard irqs on */

Ideally, switch_mm() should allow out-of-band IRQs to flow normally while
changing the memory context for in-band tasks - we once had that for armv4/5
in the early days of the I-pipe, but this would require non-trivial magic to
do this properly in current kernels. So maybe next when all the rest is
functional.

Congrats if you read up to there. Comments welcome as usual.

[1] https://evlproject.org/dovetail/altsched/
[2]
https://www.usenix.org/legacy/publications/library/proceedings/micro93/full_papers/stodolsky.txt
[3] https://evlproject.org/dovetail/pipeline/#virtual-i-flag

-- 
Philippe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-20 20:47 Dovetail <-> PREEMPT_RT hybridization Philippe Gerum
@ 2020-07-20 22:44 ` Paul
  2020-07-21  8:18   ` Philippe Gerum
  2020-07-21  5:26 ` Meng, Fino
  1 sibling, 1 reply; 13+ messages in thread
From: Paul @ 2020-07-20 22:44 UTC (permalink / raw)
  To: Philippe Gerum via Xenomai; +Cc: Philippe Gerum, Evl

On Mon, 20 Jul 2020 22:47:29 +0200
Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote:

> 
> Congrats if you read up to there. Comments welcome as usual.
> 
> [1] https://evlproject.org/dovetail/altsched/

Getting a "Forbidden, You don't have permission to access this
resource" error for that URL...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Dovetail <-> PREEMPT_RT hybridization
  2020-07-20 20:47 Dovetail <-> PREEMPT_RT hybridization Philippe Gerum
  2020-07-20 22:44 ` Paul
@ 2020-07-21  5:26 ` Meng, Fino
  2020-07-21 17:18   ` Philippe Gerum
  1 sibling, 1 reply; 13+ messages in thread
From: Meng, Fino @ 2020-07-21  5:26 UTC (permalink / raw)
  To: Philippe Gerum, Evl; +Cc: Xenomai (xenomai@xenomai.org)


>Sent: Tuesday, July 21, 2020 4:47 AM
>
>FWIW, I'm investigating the opportunity for rebasing Dovetail - and therefore the EVL core - on the PREEMPT_RT code base,
>ahead of the final integration of the latter into the mainline kernel tree. In the same move, the goal would be to leverage the
>improvements brought by native preemption with respect to fine-grained interrupt protection, while keeping the alternate
>scheduling [1] feature, which still exhibits significantly shorter preemption times and much cleaner jitter compared to what
>is - at least currently - achievable with a plain PREEMPT_RT kernel under meaningful stress.
>
>With such hybridization, the Dovetail implementation should be even simpler.
>Companion cores based on it could run on the out-of-band execution stage unimpeded by other forms of preemption
>disabling in the in-band kernel (e.g.
>locks). This would preserve the most significant advantage of the pipelining model when it comes to reliable response times
>for applications at a modest processing cost by a lightweight real-time infrastructure.
>
>This work entails porting the latest Dovetail code base I have been working on lately from 5.8 back to 5.6-rt, since this is the
>most recent public release of PREEMPT_RT so far. In addition, interrupt-free sections which are deemed too long for the
>companion core on top to cope with, need to be identified in the target PREEMPT_RT release so that they could be mitigated
>(see below for an explanation about how it could be done). In the future, a way to automate such research should be
>looked for, since finding these spots is likely going to be the boring task to carry out each time this new Dovetail
>implementation is ported to the next PREEMPT_RT release. Plus, the truckload of other tricky issues I may have overlooked.
>
>If anyone is interested in participating in this work, let me know. I cannot guarantee success, but the data I have collected
>over time with both the dual kernel and native preemption models leaves me optimistic about the outcome if they are
>combined the right way.

Hi Philippe,

I would like to participate. One the of motivation is the TSN stack is now within Preempt-RT Linux. 
Some time ago we have discussed with Jan about similar idea, patch Ipipe/Xenomai onto Preempt-RT kernel but not vanilla kernel, 
then separate Cobalt thread and Preempt-RT's RT thread to different cores.

BR / Fino (孟祥夫)
Intel – IOTG Developer Enabling

>
>-- Nitty-gritty details about why and how to do this
>
>Those acquainted with the interrupt pipelining technique Dovetail implements [2] may already know that decoupling the
>interrupt mask flag as perceived by the CPU from the one perceived by the kernel induces a number of tricky issues. We
>want interrupts to be unmasked in the CPU as long as possible while the kernel runs; to this end local_irq_*() helpers are
>switched to a software-based implementation which virtualizes the interrupt mask as perceived by the kernel, while leaving
>interrupts enabled in the CPU, postponing the delivery of IRQs blocked by the virtual masking until they are accepted again
>by the kernel (aka "stall bit"). This is a plain simple log-if-blocked-then-replay-when-unblocked game [3].
>
>However, we also have to synchronize the hardware and software interrupt masks in some specific places of the kernel in
>order to keep some hardware and software logic happy. Two examples come to mind, there are more of them:
>
>- hardware-wise, we want updates to some registers to remain fully atomic despite the fact interrupt pipelining is in effect.
>For arm64, we have to ensure that updates to the translation table registers (TTBRs) cannot be preempted, likewise for
>updates to the CR4 register on x86 which is notably used during TLB management.  In both cases, we have to locally
>revert/override the changes Dovetail implicitly did by re-introducing CPU-based forms of interrupt disabling, instead of the
>software-based one.
>
>- software-wise, maintaining the LOCKDEP logic usable in a pipelined system requires fixing up the virtual interrupt mask on
>kernel boundaries between kernel and user mode, so that it properly reflects what the locking validation engine expects at
>all times. This has been the most time-consuming work in a number of Dovetail upgrades to recent kernel releases, 5.8-rc
>included.
>Besides, I'm still not happy with the way this is done, which looks like playing whack-a-mole to some extent.
>
>Many of these issues are hard to identify, some may not be trivial to address (LOCKDEP support can become really ugly in
>this respect). Several other sub-systems like CPU idleness and power management have similar requirements for particular
>code paths.
>
>Now, we may have another option for gaining fine-grained interrupt protection, which would build on the relentless work
>the PREEMPT_RT folks did about shrinking the interrupt-free sections in the kernel code to the bare minimum which is
>acceptable for native preemption, by threading IRQs and introducing sleeping locks mainly.
>
>Instead of systematizing the virtualization of the local_irq_*() helpers, we could switch them back to their original -
>hardware-based - behavior, adding controlled mask-breaking statements manually to any remaining problematic code path.
>Such statement would enable interrupts in the CPU while blocking them for the in-band kernel, using a local, non-pervasive
>variant of the current interrupt pipeline.
>
>Within those long interrupt-free sections created by the in-band code, the companion core would nevertheless be allowed
>to process pending interrupts immediately while maintaining the interrupt protection for the in-band kernel.
>Identifying these sections for enabling the out-of-band code to preempt locally should be a matter of properly using the
>irqsoff tracer, provided the
>trace_hardirqs* instrumentation is correct.
>
>e.g. roughly sketching a possible use case:
>
>__schedule()
>lock(rq) /* hard irqs off */
>...
>context_switch()
>	switch_mm
>	switch_to
>...
>unlock(rq) /* hard irqs on */
>
>The interrupt-free section above could amount to tenths of microseconds on
>armv7 under significant pressure (especially with a sluggish L2 outer cache) and would prevent the out-of-band (companion)
>core to preempt in the meantime.
>To address this, switching the virtual interrupt state could be done manually by some dedicated service, say,
>"oob_synchronize()", which would first stall the in-band stage to keep the code interrupt-free in-band wise, then allow any
>pending hard IRQ to be taken by toggling the CPU mask flag, possibly some of which the companion core would handle.
>Other IRQs to be handled by the in-band code would have to wait into a deferred interrupt log until hard IRQs are generally
>re-enabled later on, which is what happens today with the common pipelining technique on a broader scope.
>
>__schedule()
>lock(rq) /* hard irqs off */
>...
>context_switch()
>	switch_mm
>	cond_sync_oob(); /* pending IRQs are synchronized for oob only */
>	switch_to
>...
>unlock(rq) /* hard irqs on */
>
>Ideally, switch_mm() should allow out-of-band IRQs to flow normally while changing the memory context for in-band tasks -
>we once had that for armv4/5 in the early days of the I-pipe, but this would require non-trivial magic to do this properly in
>current kernels. So maybe next when all the rest is functional.
>
>Congrats if you read up to there. Comments welcome as usual.
>
>[1] https://evlproject.org/dovetail/altsched/
>[2]
>https://www.usenix.org/legacy/publications/library/proceedings/micro93/full_papers/stodolsky.txt
>[3] https://evlproject.org/dovetail/pipeline/#virtual-i-flag
>
>--
>Philippe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-20 22:44 ` Paul
@ 2020-07-21  8:18   ` Philippe Gerum
  2020-07-21  8:39     ` Paul
  0 siblings, 1 reply; 13+ messages in thread
From: Philippe Gerum @ 2020-07-21  8:18 UTC (permalink / raw)
  To: Paul, Philippe Gerum via Xenomai; +Cc: Evl

On 7/21/20 12:44 AM, Paul wrote:
> On Mon, 20 Jul 2020 22:47:29 +0200
> Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote:
> 
>>
>> Congrats if you read up to there. Comments welcome as usual.
>>
>> [1] https://evlproject.org/dovetail/altsched/
> 
> Getting a "Forbidden, You don't have permission to access this
> resource" error for that URL...
> 

Is this a general issue with the site or can you access the top page at
https://evlproject.org for instance?

-- 
Philippe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-21  8:18   ` Philippe Gerum
@ 2020-07-21  8:39     ` Paul
  2020-07-21  9:25       ` Philippe Gerum
  0 siblings, 1 reply; 13+ messages in thread
From: Paul @ 2020-07-21  8:39 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Philippe Gerum via Xenomai, Evl

On Tue, 21 Jul 2020 10:18:01 +0200
Philippe Gerum <rpm@xenomai.org> wrote:

> On 7/21/20 12:44 AM, Paul wrote:
> > On Mon, 20 Jul 2020 22:47:29 +0200
> > Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote:
> > 
> >>
> >> Congrats if you read up to there. Comments welcome as usual.
> >>
> >> [1] https://evlproject.org/dovetail/altsched/
> > 
> > Getting a "Forbidden, You don't have permission to access this
> > resource" error for that URL...
> > 
> 
> Is this a general issue with the site or can you access the top page
> at https://evlproject.org for instance?
> 

I can access the rest of the site including much of the Dovetail docs.
Just the altsched page is out of bounds.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-21  8:39     ` Paul
@ 2020-07-21  9:25       ` Philippe Gerum
  2020-07-21  9:43         ` Paul
  0 siblings, 1 reply; 13+ messages in thread
From: Philippe Gerum @ 2020-07-21  9:25 UTC (permalink / raw)
  To: Paul; +Cc: Philippe Gerum via Xenomai, Evl

On 7/21/20 10:39 AM, Paul wrote:
> On Tue, 21 Jul 2020 10:18:01 +0200
> Philippe Gerum <rpm@xenomai.org> wrote:
> 
>> On 7/21/20 12:44 AM, Paul wrote:
>>> On Mon, 20 Jul 2020 22:47:29 +0200
>>> Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote:
>>>
>>>>
>>>> Congrats if you read up to there. Comments welcome as usual.
>>>>
>>>> [1] https://evlproject.org/dovetail/altsched/
>>>
>>> Getting a "Forbidden, You don't have permission to access this
>>> resource" error for that URL...
>>>
>>
>> Is this a general issue with the site or can you access the top page
>> at https://evlproject.org for instance?
>>
> 
> I can access the rest of the site including much of the Dovetail docs.
> Just the altsched page is out of bounds.
> 

Can you try again, forcing a cache reload in your browser?

Thanks,

-- 
Philippe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-21  9:25       ` Philippe Gerum
@ 2020-07-21  9:43         ` Paul
  2020-07-21  9:46           ` Philippe Gerum
  0 siblings, 1 reply; 13+ messages in thread
From: Paul @ 2020-07-21  9:43 UTC (permalink / raw)
  To: Philippe Gerum; +Cc: Philippe Gerum via Xenomai, Evl

On Tue, 21 Jul 2020 11:25:43 +0200
Philippe Gerum <rpm@xenomai.org> wrote:

> On 7/21/20 10:39 AM, Paul wrote:
> > On Tue, 21 Jul 2020 10:18:01 +0200
> > Philippe Gerum <rpm@xenomai.org> wrote:
> > 
> >> On 7/21/20 12:44 AM, Paul wrote:
> >>> On Mon, 20 Jul 2020 22:47:29 +0200
> >>> Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote:
> >>>
> >>>>
> >>>> Congrats if you read up to there. Comments welcome as usual.
> >>>>
> >>>> [1] https://evlproject.org/dovetail/altsched/
> >>>
> >>> Getting a "Forbidden, You don't have permission to access this
> >>> resource" error for that URL...
> >>>
> >>
> >> Is this a general issue with the site or can you access the top
> >> page at https://evlproject.org for instance?
> >>
> > 
> > I can access the rest of the site including much of the Dovetail
> > docs. Just the altsched page is out of bounds.
> > 
> 
> Can you try again, forcing a cache reload in your browser?

That's fixed it ;-)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-21  9:43         ` Paul
@ 2020-07-21  9:46           ` Philippe Gerum
  0 siblings, 0 replies; 13+ messages in thread
From: Philippe Gerum @ 2020-07-21  9:46 UTC (permalink / raw)
  To: Paul; +Cc: Philippe Gerum via Xenomai, Evl

On 7/21/20 11:43 AM, Paul wrote:
> On Tue, 21 Jul 2020 11:25:43 +0200
> Philippe Gerum <rpm@xenomai.org> wrote:
> 
>> On 7/21/20 10:39 AM, Paul wrote:
>>> On Tue, 21 Jul 2020 10:18:01 +0200
>>> Philippe Gerum <rpm@xenomai.org> wrote:
>>>
>>>> On 7/21/20 12:44 AM, Paul wrote:
>>>>> On Mon, 20 Jul 2020 22:47:29 +0200
>>>>> Philippe Gerum via Xenomai <xenomai@xenomai.org> wrote:
>>>>>
>>>>>>
>>>>>> Congrats if you read up to there. Comments welcome as usual.
>>>>>>
>>>>>> [1] https://evlproject.org/dovetail/altsched/
>>>>>
>>>>> Getting a "Forbidden, You don't have permission to access this
>>>>> resource" error for that URL...
>>>>>
>>>>
>>>> Is this a general issue with the site or can you access the top
>>>> page at https://evlproject.org for instance?
>>>>
>>>
>>> I can access the rest of the site including much of the Dovetail
>>> docs. Just the altsched page is out of bounds.
>>>
>>
>> Can you try again, forcing a cache reload in your browser?
> 
> That's fixed it ;-)
> 

Ok, thanks.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-21  5:26 ` Meng, Fino
@ 2020-07-21 17:18   ` Philippe Gerum
  2020-07-22 12:26     ` Meng, Fino
  2020-07-23 13:09     ` Steven Seeger
  0 siblings, 2 replies; 13+ messages in thread
From: Philippe Gerum @ 2020-07-21 17:18 UTC (permalink / raw)
  To: Meng, Fino, Evl; +Cc: Xenomai (xenomai@xenomai.org)

On 7/21/20 7:26 AM, Meng, Fino wrote:
> 
>> Sent: Tuesday, July 21, 2020 4:47 AM
>>
>> FWIW, I'm investigating the opportunity for rebasing Dovetail - and therefore the EVL core - on the PREEMPT_RT code base,
>> ahead of the final integration of the latter into the mainline kernel tree. In the same move, the goal would be to leverage the
>> improvements brought by native preemption with respect to fine-grained interrupt protection, while keeping the alternate
>> scheduling [1] feature, which still exhibits significantly shorter preemption times and much cleaner jitter compared to what
>> is - at least currently - achievable with a plain PREEMPT_RT kernel under meaningful stress.
>>
>> With such hybridization, the Dovetail implementation should be even simpler.
>> Companion cores based on it could run on the out-of-band execution stage unimpeded by other forms of preemption
>> disabling in the in-band kernel (e.g.
>> locks). This would preserve the most significant advantage of the pipelining model when it comes to reliable response times
>> for applications at a modest processing cost by a lightweight real-time infrastructure.
>>
>> This work entails porting the latest Dovetail code base I have been working on lately from 5.8 back to 5.6-rt, since this is the
>> most recent public release of PREEMPT_RT so far. In addition, interrupt-free sections which are deemed too long for the
>> companion core on top to cope with, need to be identified in the target PREEMPT_RT release so that they could be mitigated
>> (see below for an explanation about how it could be done). In the future, a way to automate such research should be
>> looked for, since finding these spots is likely going to be the boring task to carry out each time this new Dovetail
>> implementation is ported to the next PREEMPT_RT release. Plus, the truckload of other tricky issues I may have overlooked.
>>
>> If anyone is interested in participating in this work, let me know. I cannot guarantee success, but the data I have collected
>> over time with both the dual kernel and native preemption models leaves me optimistic about the outcome if they are
>> combined the right way.
> 
> Hi Philippe,
> 
> I would like to participate. One the of motivation is the TSN stack is now within Preempt-RT Linux. 
> Some time ago we have discussed with Jan about similar idea, patch Ipipe/Xenomai onto Preempt-RT kernel but not vanilla kernel, 
> then separate Cobalt thread and Preempt-RT's RT thread to different cores.
> 

Ok. As far as I'm concerned, I'm only scratching an itch, I find some interest
in looking for ways to downsize the hardware for running applications with
demanding response time requirements, without necessarily resorting to a plain
rtos.

Back to the initial point, this work should involve, roughly:

- implementing a Dovetail variant in the native preemption kernel. This is
actually not a direct port, the new implementation would depart from the
current Dovetail code in significant ways, although the basics would be the
same, only used differently. I plan to work on this, although it would be much
better if other folks would join me in the implementation once the thing is
bootstrapped.

- identifying and quantifying the longest interrupt-free sections in the
target preempt-rt kernel under meaningful stress load, with the irqoff tracer.
I wrote down some information [1] about the stress workloads which actually
make a difference when benchmarking as far as I can tell. At any rate, the
results we would get there would be crucial in order to figure out where to
add the out-of-band synchronization points, and likely of some interest
upstream too. I'm primarily targeting armv7 and armv8, it would be great if
you could help with x86.

- the two previous points are obviously part of an iterative process centered
on testing the implementation with a real-time core. I'm going to use the EVL
core for this, since it is sitting on Dovetail already.

[1] https://evlproject.org/core/benchmarks/#stress-load

-- 
Philippe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: Dovetail <-> PREEMPT_RT hybridization
  2020-07-21 17:18   ` Philippe Gerum
@ 2020-07-22 12:26     ` Meng, Fino
  2020-07-23 13:09     ` Steven Seeger
  1 sibling, 0 replies; 13+ messages in thread
From: Meng, Fino @ 2020-07-22 12:26 UTC (permalink / raw)
  To: Philippe Gerum, Evl; +Cc: Xenomai (xenomai@xenomai.org)


>On 7/21/20 7:26 AM, Meng, Fino wrote:
>>
>>> Sent: Tuesday, July 21, 2020 4:47 AM
>>>
>>> FWIW, I'm investigating the opportunity for rebasing Dovetail - and
>>> therefore the EVL core - on the PREEMPT_RT code base, ahead of the
>>> final integration of the latter into the mainline kernel tree. In the
>>> same move, the goal would be to leverage the improvements brought by
>>> native preemption with respect to fine-grained interrupt protection, while keeping the alternate scheduling [1] feature,
>which still exhibits significantly shorter preemption times and much cleaner jitter compared to what is - at least currently -
>achievable with a plain PREEMPT_RT kernel under meaningful stress.
>>>
>>> With such hybridization, the Dovetail implementation should be even simpler.
>>> Companion cores based on it could run on the out-of-band execution
>>> stage unimpeded by other forms of preemption disabling in the in-band kernel (e.g.
>>> locks). This would preserve the most significant advantage of the
>>> pipelining model when it comes to reliable response times for applications at a modest processing cost by a lightweight
>real-time infrastructure.
>>>
>>> This work entails porting the latest Dovetail code base I have been
>>> working on lately from 5.8 back to 5.6-rt, since this is the most
>>> recent public release of PREEMPT_RT so far. In addition,
>>> interrupt-free sections which are deemed too long for the companion
>>> core on top to cope with, need to be identified in the target
>>> PREEMPT_RT release so that they could be mitigated (see below for an explanation about how it could be done). In the
>future, a way to automate such research should be looked for, since finding these spots is likely going to be the boring task
>to carry out each time this new Dovetail implementation is ported to the next PREEMPT_RT release. Plus, the truckload of
>other tricky issues I may have overlooked.
>>>
>>> If anyone is interested in participating in this work, let me know. I
>>> cannot guarantee success, but the data I have collected over time
>>> with both the dual kernel and native preemption models leaves me optimistic about the outcome if they are combined
>the right way.
>>
>> Hi Philippe,
>>
>> I would like to participate. One the of motivation is the TSN stack is now within Preempt-RT Linux.
>> Some time ago we have discussed with Jan about similar idea, patch
>> Ipipe/Xenomai onto Preempt-RT kernel but not vanilla kernel, then separate Cobalt thread and Preempt-RT's RT thread to
>different cores.
>>
>
>Ok. As far as I'm concerned, I'm only scratching an itch, I find some interest in looking for ways to downsize the hardware
>for running applications with demanding response time requirements, without necessarily resorting to a plain rtos.
>
>Back to the initial point, this work should involve, roughly:
>
>- implementing a Dovetail variant in the native preemption kernel. This is actually not a direct port, the new implementation
>would depart from the current Dovetail code in significant ways, although the basics would be the same, only used
>differently. I plan to work on this, although it would be much better if other folks would join me in the implementation once
>the thing is bootstrapped.
>
>- identifying and quantifying the longest interrupt-free sections in the target preempt-rt kernel under meaningful stress load,
>with the irqoff tracer.
>I wrote down some information [1] about the stress workloads which actually make a difference when benchmarking as far
>as I can tell. At any rate, the results we would get there would be crucial in order to figure out where to add the out-of-band
>synchronization points, and likely of some interest upstream too. I'm primarily targeting armv7 and armv8, it would be great
>if you could help with x86.
>
>- the two previous points are obviously part of an iterative process centered on testing the implementation with a real-time
>core. I'm going to use the EVL core for this, since it is sitting on Dovetail already.
>
>[1] https://evlproject.org/core/benchmarks/#stress-load

>

I will use UP Xtreme (WHL8565U) for test, since it is easy to buy for global developers.
https://up-shop.org/up-xtreme-series.html

Intel IOTG kernel team maintains a Preempt-RT kernel, well tested for x86, but up to 5.4
https://github.com/intel/linux-intel-lts/tree/5.4/preempt-rt
don't know if would help in this case.

BR / Fino


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-21 17:18   ` Philippe Gerum
  2020-07-22 12:26     ` Meng, Fino
@ 2020-07-23 13:09     ` Steven Seeger
  2020-07-23 16:23       ` Philippe Gerum
  1 sibling, 1 reply; 13+ messages in thread
From: Steven Seeger @ 2020-07-23 13:09 UTC (permalink / raw)
  To: Meng, Fino, Evl, xenomai; +Cc: Philippe Gerum

On Tuesday, July 21, 2020 1:18:21 PM EDT Philippe Gerum wrote:
> 
> - identifying and quantifying the longest interrupt-free sections in the
> target preempt-rt kernel under meaningful stress load, with the irqoff
> tracer. I wrote down some information [1] about the stress workloads which
> actually make a difference when benchmarking as far as I can tell. At any
> rate, the results we would get there would be crucial in order to figure
> out where to add the out-of-band synchronization points, and likely of some
> interest upstream too. I'm primarily targeting armv7 and armv8, it would be
> great if you could help with x86.

So from my perspective, one of the beauties of Xenomai with traditional IPIPE 
is you can analyze the fast interrupt path and see that by design you have an 
upper bound on latency. You can even calculate it. It's based on the number of 
cpu cycles at irq entry multiplied by the total numbers of IRQs that could 
happen at the same time. Depending on your hardware, maybe you know the 
priority of handling the interrupt in question.

The point was the system was analyzable by design.

When you start talking about looking for long critical sections and adding 
sync points in it, I think you take away the by-design guarantees for latency. 
This might make it less-suitable for hard realtime systems.

IMHO this is not any better than Preempt-RT. But maybe I am missing something. 
:)

Steven





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-23 13:09     ` Steven Seeger
@ 2020-07-23 16:23       ` Philippe Gerum
  2020-07-23 21:53         ` Steven Seeger
  0 siblings, 1 reply; 13+ messages in thread
From: Philippe Gerum @ 2020-07-23 16:23 UTC (permalink / raw)
  To: Steven Seeger, Meng, Fino, Evl, xenomai

On 7/23/20 3:09 PM, Steven Seeger wrote:
> On Tuesday, July 21, 2020 1:18:21 PM EDT Philippe Gerum wrote:
>>
>> - identifying and quantifying the longest interrupt-free sections in the
>> target preempt-rt kernel under meaningful stress load, with the irqoff
>> tracer. I wrote down some information [1] about the stress workloads which
>> actually make a difference when benchmarking as far as I can tell. At any
>> rate, the results we would get there would be crucial in order to figure
>> out where to add the out-of-band synchronization points, and likely of some
>> interest upstream too. I'm primarily targeting armv7 and armv8, it would be
>> great if you could help with x86.
> 
> So from my perspective, one of the beauties of Xenomai with traditional IPIPE 
> is you can analyze the fast interrupt path and see that by design you have an 
> upper bound on latency. You can even calculate it. It's based on the number of 
> cpu cycles at irq entry multiplied by the total numbers of IRQs that could 
> happen at the same time. Depending on your hardware, maybe you know the 
> priority of handling the interrupt in question.
> 
> The point was the system was analyzable by design.
>

Two misunderstandings it seems:

- this work is all about evolving Dovetail, not Xenomai. If such work does
bring the upsides I'm expecting, then I would surely switch EVL to it. In
parallel, you would still have the opportunity to keep the current Dovetail
implementation - currently under validation on top of 5.8 - and maintain it
for Xenomai, once the latter is rebased over the former. You could also stick
to the I-pipe for Xenomai, so no issue.

- you seem to be assuming that every code paths of the kernel is interruptible
with the I-pipe/Dovetail, this is not the case, by far. Some keys portions run
with hard irqs off, just because there is no other way to 1) share some code
paths between the regular kernel and the real-time core, 2) the hardware may
require it (as hinted in my introductory post). Some of those sections may
take ages under cache pressure (switch_to comes to mind), tenths of
micro-seconds, happening mostly randomly from the standpoint of the external
observer (i.e. you, me). So much for quantifying timings by design.

We can only figure out a worst-case value by submitting the system to a
reckless stress workload, for long enough. This game of sharing the very same
hardware between GPOS and a RTOS activities has been based on a probabilistic
approach so far, which can be summarized as: do your best to keep the
interrupts enabled as long as possible, ensure fine-grained preemption of
tasks, make sure to give the result hell to detect issues, and hope for the
hardware not to rain on the parade.

Back to the initial point: virtualizing the effect of the local_irq helpers
you refer to is required when their use is front and center in serializing
kernel activities. However, in a preempt-rt kernel, most interrupt handlers
are threaded, regular spinlocks are blocking mutexes in disguise, so what
remains is:

- sections covered by the raw_spin_lock API, which is primarily a problem
because we would spin with hard irqs off attempting to acquire the lock. There
is a proven technical solution to this based on a application of interrupt
pipelining.

- few remaining local_irq disabled sections which may run for too long, but
could be relaxed enough in order for the real-time core to preempt without
prejudice. This is where pro-actively tracing the kernel under stress comes
into play.

Working on these three aspects specifically does not bring less guarantees
than hoping for no assembly code to create long uninterruptible section
(therefore not covered by local_irq_* helpers), no driver talking to a GPU
killing latency with CPU stalls, no shared cache architecture causing all sort
of insane traffic between cache levels, causing memory access speed to sink
and overall performances to degrade.

Again, the key issue there is about running two competing workloads on the
same hardware, GPOS and RTOS. They tend to not get along that much. Each time
the latter pauses, the former may resume and happily trash some hardware
sub-system both rely on. So for your calculation to be right, you would have
not only to involve the RTOS code, but also what the GPOS is up to, and how
this might change the timings.

In this respect, no I-pipe - and for that matter no current or future Dovetail
implementation - can provide any guarantee. In short, I'm pretty convinced
that any calculation you would try would be wrong by design, missing quite a
few significant variables in the equation, at least for precise timing.

However, it is true that relying on native preemption creates more
opportunities for some hog code to cause havoc in the latency chart because of
braindamage locking for instance, which may have been solved in a
serendipitous way with the current I-pipe/Dovetail model, due to the
systematized virtualization of the local_irq helpers. This said, spinlock-wise
for preempt-rt, this would only be an issue with rogue code contributions
explicitly using raw spinlocks mindlessly, which would be yet another step
toward a nomination of their author for the Darwin awards (especially if they
get caught by some upstream reviewer on the lkml).

At this point, there are two options:

- consider that direct or indirect local_irq* usage is the only factor of
increased interrupt latency, which is provably wrong.

- assume that future work aimed at statically detecting misbehaving code
(rt-wise) in the kernel may succeed, which may be optimistic but at least not
fundamentally flawed. So I'll go for the optimistic view.

> When you start talking about looking for long critical sections and adding 
> sync points in it, I think you take away the by-design guarantees for latency. 
> This might make it less-suitable for hard realtime systems.
> 
> IMHO this is not any better than Preempt-RT. But maybe I am missing something. 
> :)
> 

You may be missing the reasoning behind the alternate scheduling Dovetail
implements, which is a generalization of what the I-pipe and Xenomai do to
schedule tasks regardless of the preemption status of the regular kernel.

IMHO, the issue shouldn't be about decreasing interrupt masking throughout the
kernel, which should be fine-grained enough to only require manual fixups for
addressing the remaining problems - this assumption triggered the idea of a
Dovetail overhaul. The other granularity, the one that still matters today,
relates to task preemption. How snappy can the kernel be made in order to
direct the CPU to running a different task when some urgent event has to be
handled asap. Dovetail still brings the alternate scheduling feature for
expedited tasks preempt-rt has no reason to have, by essence.

This task preemption issue is a harder nut to crack, because techniques which
make the granularity finer may also increase the overall cost induced in
maintaining the infrastructure (complex priority inheritance for mutexes,
threaded irq model and sleeping spinlocks which increase the context switching
rate, preemptible RCU and so on). That cost, which is clearly visible in
latency analysis, is by design not applicable to tasks scheduled by a
companion core, with much simpler locking semantics which are not shared with
the regular kernel. In that sense specifically, I would definitely agree that
estimating a WCET based on the scheduling behavior of Xenomai or EVL is way
simpler than mathematizing what might happen in the linux kernel.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dovetail <-> PREEMPT_RT hybridization
  2020-07-23 16:23       ` Philippe Gerum
@ 2020-07-23 21:53         ` Steven Seeger
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Seeger @ 2020-07-23 21:53 UTC (permalink / raw)
  To: Meng, Fino, Evl, xenomai, Philippe Gerum

On Thursday, July 23, 2020 12:23:53 PM EDT Philippe Gerum wrote:
> Two misunderstandings it seems:
> 
> - this work is all about evolving Dovetail, not Xenomai. If such work does
> bring the upsides I'm expecting, then I would surely switch EVL to it. In
> parallel, you would still have the opportunity to keep the current Dovetail
> implementation - currently under validation on top of 5.8 - and maintain it
> for Xenomai, once the latter is rebased over the former. You could also
> stick to the I-pipe for Xenomai, so no issue.

That may be my misunderstanding. I thought Dovetail's ultimate goal is at 
least the performance of IPIPE but being simpler to maintain.

> - you seem to be assuming that every code paths of the kernel is
> interruptible with the I-pipe/Dovetail, this is not the case, by far. Some
> keys portions run with hard irqs off, just because there is no other way to
> 1) share some code paths between the regular kernel and the real-time core,
> 2) the hardware may require it (as hinted in my introductory post). Some of
> those sections may take ages under cache pressure (switch_to comes to
> mind), tenths of micro-seconds, happening mostly randomly from the
> standpoint of the external observer (i.e. you, me). So much for quantifying
> timings by design.

So with switch_to having hard irqs off, the cache pressure should be 
deterministic because there's an upper bound on cache lines, the number of 
memory pages that need to be accessed, and the code path is pretty straight 
forward if memory serfves. I would think that this being well bounded should 
serve to my initial point.

> 
> We can only figure out a worst-case value by submitting the system to a
> reckless stress workload, for long enough. This game of sharing the very
> same hardware between GPOS and a RTOS activities has been based on a
> probabilistic approach so far, which can be summarized as: do your best to
> keep the interrupts enabled as long as possible, ensure fine-grained
> preemption of tasks, make sure to give the result hell to detect issues,
> and hope for the hardware not to rain on the parade.

I agree that in practice, a reckless stress workload is necessary to quantify 
system latency. However, relying on this is a problem when it comes time to 
convince managers who want to spend tons of money for expensive and proven OS 
solutions instead of using the fun and cool stuff we do. ;)

At some point, if possible, someone should try and actually prove the system 
given the bounds.

1) There's only so many pages of memory
2) There's only so much cache and so many cache lines
3) There's only so many sources of interrupts
4) There's only so many sources of CPU stalls where those number of stalls 
should have a limit in hardware.

I can't really think of anything else, but I don't know why there'd be any 
sort of randomness on top of this.

One thing we might be not on the same page of is that typically (especially 
single processor systems) when I talk about timing by design calculations I am 
referring to one single high priority thing. That could be a timer interrupt 
to the first instruction running in that timer interrupt handler, or it could 
be to the point where the highest priority thread in the system resumes.

> 
> Back to the initial point: virtualizing the effect of the local_irq helpers
> you refer to is required when their use is front and center in serializing
> kernel activities. However, in a preempt-rt kernel, most interrupt handlers
> are threaded, regular spinlocks are blocking mutexes in disguise, so what
> remains is:

Yes but this depends on a cooperative model. Other drivers can mess you up, as 
described by you below.

> 
> - sections covered by the raw_spin_lock API, which is primarily a problem
> because we would spin with hard irqs off attempting to acquire the lock.
> There is a proven technical solution to this based on a application of
> interrupt pipelining.

Yes.
 
> - few remaining local_irq disabled sections which may run for too long, but
> could be relaxed enough in order for the real-time core to preempt without
> prejudice. This is where pro-actively tracing the kernel under stress comes
> into play.

This is my problem with preempt-rt. Ipipe forces this preemption by changing 
what the macros do that linux devs think is turning interrupts off. We never 
need to worry about this in the RTOS domain.
 
> Working on these three aspects specifically does not bring less guarantees
> than hoping for no assembly code to create long uninterruptible section
> (therefore not covered by local_irq_* helpers), no driver talking to a GPU
> killing latency with CPU stalls, no shared cache architecture causing all
> sort of insane traffic between cache levels, causing memory access speed to
> sink and overall performances to degrade.

I havne't had a chance to work with these sorts of systems but we are doing 
more wuth arm processors with multi-level MMU and I'm very curious about how 
this will affect performance when you're trying to do AMP and RTOS and GPOS on 
the same chip.

> 
> Again, the key issue there is about running two competing workloads on the
> same hardware, GPOS and RTOS. They tend to not get along that much. Each
> time the latter pauses, the former may resume and happily trash some
> hardware sub-system both rely on. So for your calculation to be right, you
> would have not only to involve the RTOS code, but also what the GPOS is up
> to, and how this might change the timings.

That is true. This could possibly be handled with a hypervisor though that 
actually touches the hardware.

> 
> In this respect, no I-pipe - and for that matter no current or future
> Dovetail implementation - can provide any guarantee. In short, I'm pretty
> convinced that any calculation you would try would be wrong by design,
> missing quite a few significant variables in the equation, at least for
> precise timing.

There's only so many ways to break the system, so we just need to make sure we 
find all the variables. ;) I do think what I am saying is true for single 
processor systems, but you have a point for multi.

> However, it is true that relying on native preemption creates more
> opportunities for some hog code to cause havoc in the latency chart because
> of braindamage locking for instance, which may have been solved in a
> serendipitous way with the current I-pipe/Dovetail model, due to the
> systematized virtualization of the local_irq helpers. This said,
> spinlock-wise for preempt-rt, this would only be an issue with rogue code
> contributions explicitly using raw spinlocks mindlessly, which would be yet
> another step toward a nomination of their author for the Darwin awards
> (especially if they get caught by some upstream reviewer on the lkml).

There are people who insist that the system be safe and robust against rogue 
code.

> 
> At this point, there are two options:
> 
> - consider that direct or indirect local_irq* usage is the only factor of
> increased interrupt latency, which is provably wrong.

I would think this to be the largest contributor to it, though.

> 
> - assume that future work aimed at statically detecting misbehaving code
> (rt-wise) in the kernel may succeed, which may be optimistic but at least
> not fundamentally flawed. So I'll go for the optimistic view.

This is a good point. But I think a system where you depend on millions of 
lines of code to be good instead of a a few thousand lines of code is asking 
for trouble.
 
> 
> This task preemption issue is a harder nut to crack, because techniques
> which make the granularity finer may also increase the overall cost induced
> in maintaining the infrastructure (complex priority inheritance for
> mutexes, threaded irq model and sleeping spinlocks which increase the
> context switching rate, preemptible RCU and so on). That cost, which is
> clearly visible in latency analysis, is by design not applicable to tasks
> scheduled by a companion core, with much simpler locking semantics which
> are not shared with the regular kernel. In that sense specifically, I would
> definitely agree that estimating a WCET based on the scheduling behavior of
> Xenomai or EVL is way simpler than mathematizing what might happen in the
> linux kernel.

So in your opinion what is more important? Lowest possible latency on a best-
case or average basis, or determinism? What I mean is the things you mention 
that come with a higher cost in terms of latency come with more determinism 
(less jitter, more predictable result, etc.) So what is the goal you are 
working towards?

I've known you for like 20 years and probably never won an argument, so 
history says I will be headed to defeat here. ;)

Steven





^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-07-23 21:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-20 20:47 Dovetail <-> PREEMPT_RT hybridization Philippe Gerum
2020-07-20 22:44 ` Paul
2020-07-21  8:18   ` Philippe Gerum
2020-07-21  8:39     ` Paul
2020-07-21  9:25       ` Philippe Gerum
2020-07-21  9:43         ` Paul
2020-07-21  9:46           ` Philippe Gerum
2020-07-21  5:26 ` Meng, Fino
2020-07-21 17:18   ` Philippe Gerum
2020-07-22 12:26     ` Meng, Fino
2020-07-23 13:09     ` Steven Seeger
2020-07-23 16:23       ` Philippe Gerum
2020-07-23 21:53         ` Steven Seeger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.