All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai] Reading /proc/xenomai/stat causes high latencies
@ 2014-04-22 16:02 Jeroen Van den Keybus
  2014-04-23  9:14 ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-22 16:02 UTC (permalink / raw)
  To: xenomai

Using a 3.10.18 kernel with Xenomai 2.6.3, reading the stat entry of
/proc/xenomai causes high latencies in RT tasks. I've found a report
on a similar issue in
https://sites.google.com/site/manisbutareed/linuxcnc-2-5/xenomai-user-threads.
We also had this occurring on a 3.8.13 kernel.

A typical latency test run looks like:

RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
RTD|     -2.382|     -2.357|     -1.945|       0|     0|     -2.382|     -1.945
RTD|     -2.577|     -2.360|     -1.749|       0|     0|     -2.577|     -1.749
RTD|     -2.380|     -2.360|     -1.865|       0|     0|     -2.577|     -1.749
RTD|     -2.568|     -2.361|     -1.530|       0|     0|     -2.577|     -1.530
RTD|     -2.379|     -2.359|     -1.732|       0|     0|     -2.577|     -1.530
RTD|     -2.381|     -2.361|     -2.008|       0|     0|     -2.577|     -1.530
RTD|     -2.381|     -2.360|     -2.085|       0|     0|     -2.577|     -1.530
RTD|     -2.699|     -2.359|      2.566|       0|     0|     -2.699|      2.566
RTD|     -2.380|     -2.320|     -1.876|       0|     0|     -2.699|      2.566
RTD|     -2.381|     -2.359|      2.528|       0|     0|     -2.699|      2.566
RTD|     -2.380|     -2.360|     -1.805|       0|     0|     -2.699|      2.566
RTD|     -2.579|     -2.311|     -0.045|       0|     0|     -2.699|      2.566
RTD|     -2.380|     -2.359|     -2.072|       0|     0|     -2.699|      2.566
RTD|     -2.575|     -2.360|      2.065|       0|     0|     -2.699|      2.566
RTD|     -2.381|     19.028|   3043.067|      31|     0|     -2.699|   3043.067
RTD|     -2.566|     26.488|    105.823|      32|     0|     -2.699|   3043.067
RTD|     -2.443|     -2.276|      0.597|      32|     0|     -2.699|   3043.067
RTD|     -2.584|     -2.306|      2.032|      32|     0|     -2.699|   3043.067
RTD|     -2.377|     -2.242|      4.106|      32|     0|     -2.699|   3043.067
RTD|     -2.537|     -2.291|      4.394|      32|     0|     -2.699|   3043.067

It is obvious where I issued the 'cat /proc/xenomai/stat' command.

I will try to create an I-trace now. (config is same as under my
previous post 'Slow execution of RT task' - I'm still looking into
that issue as well).


Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-22 16:02 [Xenomai] Reading /proc/xenomai/stat causes high latencies Jeroen Van den Keybus
@ 2014-04-23  9:14 ` Jeroen Van den Keybus
  2014-04-23 13:45   ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-23  9:14 UTC (permalink / raw)
  To: xenomai

Curious. If I enable the I-pipe tracer, the problem goes away. If I
disable it, it reliably returns. In the last case, however, loading
xeno_native is either slow (several seconds) or does not terminate
unitl I hit a key (I used <SHIFT L>) on an attached keyboard.

(dmesg log contains the same rcutree warnings in both cases as
mentioned under 'Slow ... RT task' post)

Jeroen.


2014-04-22 18:02 GMT+02:00 Jeroen Van den Keybus
<jeroen.vandenkeybus@gmail.com>:
> Using a 3.10.18 kernel with Xenomai 2.6.3, reading the stat entry of
> /proc/xenomai causes high latencies in RT tasks. I've found a report
> on a similar issue in
> https://sites.google.com/site/manisbutareed/linuxcnc-2-5/xenomai-user-threads.
> We also had this occurring on a 3.8.13 kernel.
>
> A typical latency test run looks like:
>
> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
> RTD|     -2.382|     -2.357|     -1.945|       0|     0|     -2.382|     -1.945
> RTD|     -2.577|     -2.360|     -1.749|       0|     0|     -2.577|     -1.749
> RTD|     -2.380|     -2.360|     -1.865|       0|     0|     -2.577|     -1.749
> RTD|     -2.568|     -2.361|     -1.530|       0|     0|     -2.577|     -1.530
> RTD|     -2.379|     -2.359|     -1.732|       0|     0|     -2.577|     -1.530
> RTD|     -2.381|     -2.361|     -2.008|       0|     0|     -2.577|     -1.530
> RTD|     -2.381|     -2.360|     -2.085|       0|     0|     -2.577|     -1.530
> RTD|     -2.699|     -2.359|      2.566|       0|     0|     -2.699|      2.566
> RTD|     -2.380|     -2.320|     -1.876|       0|     0|     -2.699|      2.566
> RTD|     -2.381|     -2.359|      2.528|       0|     0|     -2.699|      2.566
> RTD|     -2.380|     -2.360|     -1.805|       0|     0|     -2.699|      2.566
> RTD|     -2.579|     -2.311|     -0.045|       0|     0|     -2.699|      2.566
> RTD|     -2.380|     -2.359|     -2.072|       0|     0|     -2.699|      2.566
> RTD|     -2.575|     -2.360|      2.065|       0|     0|     -2.699|      2.566
> RTD|     -2.381|     19.028|   3043.067|      31|     0|     -2.699|   3043.067
> RTD|     -2.566|     26.488|    105.823|      32|     0|     -2.699|   3043.067
> RTD|     -2.443|     -2.276|      0.597|      32|     0|     -2.699|   3043.067
> RTD|     -2.584|     -2.306|      2.032|      32|     0|     -2.699|   3043.067
> RTD|     -2.377|     -2.242|      4.106|      32|     0|     -2.699|   3043.067
> RTD|     -2.537|     -2.291|      4.394|      32|     0|     -2.699|   3043.067
>
> It is obvious where I issued the 'cat /proc/xenomai/stat' command.
>
> I will try to create an I-trace now. (config is same as under my
> previous post 'Slow execution of RT task' - I'm still looking into
> that issue as well).
>
>
> Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-23  9:14 ` Jeroen Van den Keybus
@ 2014-04-23 13:45   ` Jeroen Van den Keybus
  2014-04-23 14:07     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-23 13:45 UTC (permalink / raw)
  To: xenomai

I've attached an I-trace from what happens when 'modprobe xeno_native'
stalls. I could use some hints as where to start looking into this
issue. Right now, I would say I have a look at which code paths are
traversed and not when CONFIG_IPIPE_TRACE is unset and set
respectively.


Jeroen.


2014-04-23 11:14 GMT+02:00 Jeroen Van den Keybus
<jeroen.vandenkeybus@gmail.com>:
> Curious. If I enable the I-pipe tracer, the problem goes away. If I
> disable it, it reliably returns. In the last case, however, loading
> xeno_native is either slow (several seconds) or does not terminate
> unitl I hit a key (I used <SHIFT L>) on an attached keyboard.
>
> (dmesg log contains the same rcutree warnings in both cases as
> mentioned under 'Slow ... RT task' post)
>
> Jeroen.
>
>
> 2014-04-22 18:02 GMT+02:00 Jeroen Van den Keybus
> <jeroen.vandenkeybus@gmail.com>:
>> Using a 3.10.18 kernel with Xenomai 2.6.3, reading the stat entry of
>> /proc/xenomai causes high latencies in RT tasks. I've found a report
>> on a similar issue in
>> https://sites.google.com/site/manisbutareed/linuxcnc-2-5/xenomai-user-threads.
>> We also had this occurring on a 3.8.13 kernel.
>>
>> A typical latency test run looks like:
>>
>> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
>> RTD|     -2.382|     -2.357|     -1.945|       0|     0|     -2.382|     -1.945
>> RTD|     -2.577|     -2.360|     -1.749|       0|     0|     -2.577|     -1.749
>> RTD|     -2.380|     -2.360|     -1.865|       0|     0|     -2.577|     -1.749
>> RTD|     -2.568|     -2.361|     -1.530|       0|     0|     -2.577|     -1.530
>> RTD|     -2.379|     -2.359|     -1.732|       0|     0|     -2.577|     -1.530
>> RTD|     -2.381|     -2.361|     -2.008|       0|     0|     -2.577|     -1.530
>> RTD|     -2.381|     -2.360|     -2.085|       0|     0|     -2.577|     -1.530
>> RTD|     -2.699|     -2.359|      2.566|       0|     0|     -2.699|      2.566
>> RTD|     -2.380|     -2.320|     -1.876|       0|     0|     -2.699|      2.566
>> RTD|     -2.381|     -2.359|      2.528|       0|     0|     -2.699|      2.566
>> RTD|     -2.380|     -2.360|     -1.805|       0|     0|     -2.699|      2.566
>> RTD|     -2.579|     -2.311|     -0.045|       0|     0|     -2.699|      2.566
>> RTD|     -2.380|     -2.359|     -2.072|       0|     0|     -2.699|      2.566
>> RTD|     -2.575|     -2.360|      2.065|       0|     0|     -2.699|      2.566
>> RTD|     -2.381|     19.028|   3043.067|      31|     0|     -2.699|   3043.067
>> RTD|     -2.566|     26.488|    105.823|      32|     0|     -2.699|   3043.067
>> RTD|     -2.443|     -2.276|      0.597|      32|     0|     -2.699|   3043.067
>> RTD|     -2.584|     -2.306|      2.032|      32|     0|     -2.699|   3043.067
>> RTD|     -2.377|     -2.242|      4.106|      32|     0|     -2.699|   3043.067
>> RTD|     -2.537|     -2.291|      4.394|      32|     0|     -2.699|   3043.067
>>
>> It is obvious where I issued the 'cat /proc/xenomai/stat' command.
>>
>> I will try to create an I-trace now. (config is same as under my
>> previous post 'Slow execution of RT task' - I'm still looking into
>> that issue as well).
>>
>>
>> Jeroen.
-------------- next part --------------
I-pipe worst-case tracing service on 3.10.18-ipipe/ipipe release #1
-------------------------------------------------------------
CPU: 0, Begin: 382257313528 cycles, Trace Points: 7 (-2048/+1), Length: 19678412 us
Calibrated minimum trace-point overhead: 0.060 us

 +----- Hard IRQs ('|': locked)
 |+-- Xenomai
 ||+- Linux ('*': domain stalled, '+': current, '#': current+stalled)
 |||			  +---------- Delay flag ('+': > 1 us, '!': > 10 us)
 |||			  |	   +- NMI noise ('N')
 |||			  |	   |
	  Type	  User Val.   Time    Delay  Function (Parent)
 |   +end     0x80000000 -10026	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10026	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10026	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10026	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10026	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10026	  0.095  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10026	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10026	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10026	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10025	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10025	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10025	  0.091  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10025	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10025	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10025	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10025	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10025	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10025	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10024	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10024	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10024	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10024	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10024	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10024	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10024	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10024	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10024	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10024	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10024	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10023	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10023	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10023	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10023	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10023	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10023	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10023	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10023	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10023	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10022	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10022	  0.095  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10022	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10022	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10022	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10022	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10022	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10022	  0.091  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10022	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10022	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10021	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10021	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10021	  0.142  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10021	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10021	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10021	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10021	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10021	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10021	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10021	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10020	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10020	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10020	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10020	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10020	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10020	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10020	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10020	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10020	  0.090  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10020	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10019	  0.094  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10019	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10019	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10019	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10019	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10019	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10019	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10019	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10019	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10019	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10018	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10018	  0.090  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10018	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10018	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10018	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10018	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10018	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10018	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10018	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10018	  0.092  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10017	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10017	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10017	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10017	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10017	  0.092  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10017	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10017	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10017	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10017	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10017	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10016	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10016	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10016	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10016	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10016	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10016	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10016	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10016	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10016	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10016	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10015	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10015	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10015	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10015	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10015	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10015	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10015	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10015	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10015	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10015	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10015	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10014	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10014	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10014	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10014	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10014	  0.097  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10014	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10014	  0.089  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10014	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10014	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10014	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10013	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10013	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10013	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10013	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10013	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10013	  0.090  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10013	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10013	  0.094  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10013	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10012	  0.091  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10012	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10012	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10012	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10012	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10012	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10012	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10012	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10012	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10012	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10012	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10011	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10011	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10011	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10011	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10011	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10011	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10011	  0.092  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10011	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10011	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10010	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10010	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10010	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10010	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10010	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10010	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10010	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10010	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10010	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10010	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10010	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10009	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10009	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10009	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10009	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10009	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10009	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10009	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10009	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10009	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10009	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10008	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10008	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10008	  0.091  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10008	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10008	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10008	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10008	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10008	  0.142  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10008	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10007	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10007	  0.092  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10007	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10007	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10007	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10007	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10007	  0.092  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10007	  0.097  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10007	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10007	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10006	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10006	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10006	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10006	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10006	  0.094  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10006	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10006	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10006	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10006	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10006	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10005	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10005	  0.095  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10005	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10005	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10005	  0.111  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10005	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10005	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10005	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10005	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10005	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10004	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10004	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10004	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10004	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10004	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10004	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10004	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10004	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10004	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10004	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10003	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10003	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10003	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10003	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10003	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10003	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10003	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10003	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10003	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10003	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10002	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10002	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10002	  0.091  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10002	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10002	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10002	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10002	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10002	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10002	  0.097  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10002	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10001	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10001	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10001	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -10001	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -10001	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -10001	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -10001	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -10001	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -10001	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -10001	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -10000	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -10000	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -10000	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -10000	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -10000	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -10000	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -10000	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -10000	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -10000	  0.090  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -10000	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9999	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9999	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9999	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9999	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9999	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9999	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9999	  0.091  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9999	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9999	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9999	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9998	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9998	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9998	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9998	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9998	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9998	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9998	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9998	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9998	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9998	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9997	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9997	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9997	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9997	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9997	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9997	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9997	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9997	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9997	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9997	  0.097  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9996	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9996	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9996	  0.090  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9996	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9996	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9996	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9996	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9996	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9996	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9996	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9995	  0.091  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9995	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9995	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9995	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9995	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9995	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9995	  0.097  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9995	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9995	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9995	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9994	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9994	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9994	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9994	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9994	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9994	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9994	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9994	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9994	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9994	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9993	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9993	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9993	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9993	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9993	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9993	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9993	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9993	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9993	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9993	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9992	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9992	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9992	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9992	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9992	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9992	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9992	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9992	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9992	  0.089  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9992	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9991	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9991	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9991	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9991	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9991	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9991	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9991	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9991	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9991	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9991	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9990	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9990	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9990	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9990	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9990	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9990	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9990	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9990	  0.097  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9990	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9990	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9989	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9989	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9989	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9989	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9989	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9989	  0.142  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9989	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9989	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9989	  0.091  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9989	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9988	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9988	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9988	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9988	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9988	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9988	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9988	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9988	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9988	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9988	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9987	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9987	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9987	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9987	  0.094  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9987	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9987	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9987	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9987	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9987	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9987	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9986	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9986	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9986	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9986	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9986	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9986	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9986	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9986	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9986	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9986	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9985	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9985	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9985	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9985	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9985	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9985	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9985	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9985	  0.092  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9985	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9985	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9984	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9984	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9984	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9984	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9984	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9984	  0.094  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9984	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9984	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9984	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9984	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9983	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9983	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9983	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9983	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9983	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9983	  0.111  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9983	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9983	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9983	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9983	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9982	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9982	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9982	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9982	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9982	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9982	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9982	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9982	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9982	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9982	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9981	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9981	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9981	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9981	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9981	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9981	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9981	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9981	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9981	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9981	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9980	  0.142  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9980	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9980	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9980	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9980	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9980	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9980	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9980	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9980	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9980	  0.097  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9979	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9979	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9979	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9979	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9979	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9979	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9979	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9979	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9979	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9979	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9978	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9978	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9978	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9978	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9978	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9978	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9978	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9978	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9978	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9978	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9977	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9977	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9977	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9977	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9977	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9977	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9977	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9977	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9977	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9977	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9976	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9976	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9976	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9976	  0.097  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9976	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9976	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9976	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9976	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9976	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9976	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9975	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9975	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9975	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9975	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9975	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9975	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9975	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9975	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9975	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9975	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9974	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9974	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9974	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9974	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9974	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9974	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9974	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9974	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9974	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9974	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9973	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9973	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9973	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9973	  0.097  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9973	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9973	  0.089  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9973	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9973	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9973	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9973	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9972	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9972	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9972	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9972	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9972	  0.090  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9972	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9972	  0.094  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9972	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9972	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9972	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9971	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9971	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9971	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9971	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9971	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9971	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9971	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9971	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9971	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9971	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9970	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9970	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9970	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9970	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9970	  0.094  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9970	  0.091  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9970	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9970	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9970	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9970	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9969	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9969	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9969	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9969	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9969	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9969	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9969	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9969	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9969	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9969	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9968	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9968	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9968	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9968	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9968	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9968	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9968	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9968	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9968	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9968	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9967	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9967	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9967	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9967	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9967	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9967	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9967	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9967	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9967	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9967	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9966	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9966	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9966	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9966	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9966	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9966	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9966	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9966	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9966	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9966	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9965	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9965	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9965	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9965	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9965	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9965	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9965	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9965	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9965	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9965	  0.095  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9964	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9964	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9964	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9964	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9964	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9964	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9964	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9964	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9964	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9964	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9963	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9963	  0.098  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9963	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9963	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9963	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9963	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9963	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9963	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9963	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9963	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9962	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9962	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9962	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9962	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9962	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9962	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9962	  0.094  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9962	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9962	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9962	  0.109  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9961	  0.094  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9961	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9961	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9961	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9961	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9961	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9961	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9961	  0.118  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9961	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9961	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9960	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9960	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9960	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9960	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9960	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9960	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9960	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9960	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9960	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9959	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9959	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9959	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9959	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9959	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9959	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9959	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9959	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9959	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9959	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9959	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9958	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9958	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9958	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9958	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9958	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9958	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9958	  0.097  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9958	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9958	  0.089  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9958	  0.094  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9957	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9957	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9957	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9957	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9957	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9957	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9957	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9957	  0.090  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9957	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9956	  0.094  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9956	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9956	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9956	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9956	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9956	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9956	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9956	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9956	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9956	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9955	  0.090  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9955	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9955	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9955	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9955	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9955	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9955	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9955	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9955	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9955	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9955	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9954	  0.097  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9954	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9954	  0.089  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9954	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9954	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9954	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9954	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9954	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9954	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9954	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9953	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9953	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9953	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9953	  0.096  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9953	  0.090  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9953	  0.089  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9953	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9953	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9953	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9952	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9952	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9952	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9952	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9952	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9952	  0.090  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9952	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9952	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9952	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9952	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9952	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9951	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9951	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9951	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9951	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9951	  0.097  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9951	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9951	  0.089  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9951	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9951	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9951	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9950	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9950	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9950	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9950	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9950	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9950	  0.090  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9950	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9950	  0.094  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9950	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9949	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9949	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9949	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9949	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9949	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9949	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9949	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9949	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9949	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9949	  0.090  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9949	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9948	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9948	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9948	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9948	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9948	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9948	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9948	  0.092  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9948	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9948	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9947	  0.095  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9947	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9947	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9947	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9947	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9947	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9947	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9947	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9947	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9947	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9947	  0.090  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9946	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9946	  0.094  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9946	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9946	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9946	  0.122  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9946	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9946	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9946	  0.092  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9946	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9946	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9945	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9945	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9945	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9945	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9945	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9945	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9945	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9945	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9945	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9944	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9944	  0.092  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9944	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9944	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9944	  0.092  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9944	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9944	  0.092  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9944	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9944	  0.120  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9944	  0.087  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9943	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9943	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9943	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9943	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9943	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9943	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9943	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9943	  0.091  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9943	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9943	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9942	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9942	  0.094  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9942	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9942	  0.090  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9942	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9942	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9942	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9942	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9942	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9942	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9941	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9941	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9941	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9941	  0.095  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9941	  0.092  __alloc_pages_nodemask+0x0 (__vmalloc_node_range+0xf3)
     +func               -9941	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9941	  0.121  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9941	  0.112  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9941	  0.096  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9941	  0.091  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9940	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9940	  0.095  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9940	  0.121  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     #func               -9940	  0.085  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9940	  0.091  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9940	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9940	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9940	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9940	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9940	  0.143  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9939	  0.122  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9939	  0.107  map_vm_area+0x0 (__vmalloc_node_range+0x11e)
     +func               -9939	  0.594  vmap_page_range_noflush+0x0 (map_vm_area+0x2e)
     +func               -9939	  0.096  __ipipe_pin_range_globally+0x0 (vmap_page_range_noflush+0x22a)
     +func               -9939	  5.627  _raw_spin_lock+0x0 (__ipipe_pin_range_globally+0x8e)
     +func               -9933	  0.377  xnheap_init+0x0 [xeno_nucleus] (xnpod_init+0x2ad [xeno_nucleus])
     +func               -9933	  6.681  init_extent+0x0 [xeno_nucleus] (xnheap_init+0x1f1 [xeno_nucleus])
 |   +begin   0x80000000 -9926	  0.370  xnheap_init+0x239 [xeno_nucleus] (xnpod_init+0x2ad [xeno_nucleus])
 |  *+func               -9925	  0.122  __ipipe_restore_head+0x0 (xnheap_init+0x360 [xeno_nucleus])
 |   +end     0x80000000 -9925	  0.191  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9925	  0.096  xnheap_set_label+0x0 [xeno_nucleus] (xnpod_init+0x2c8 [xeno_nucleus])
 |   +begin   0x80000000 -9925	  0.270  xnheap_set_label+0x5d [xeno_nucleus] (xnpod_init+0x2c8 [xeno_nucleus])
 |  *+func               -9925	  0.112  __ipipe_restore_head+0x0 (xnheap_set_label+0x168 [xeno_nucleus])
 |   +end     0x80000000 -9925	  0.089  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9925	  0.109  kmalloc_order_trace+0x0 (xnpod_init+0x2dc [xeno_nucleus])
     +func               -9924	  0.090  __get_free_pages+0x0 (kmalloc_order_trace+0x2e)
     +func               -9924	  0.094  __alloc_pages_nodemask+0x0 (__get_free_pages+0x17)
     +func               -9924	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9924	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9924	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9924	  0.102  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9924	  0.090  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9924	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9924	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9924	  0.151  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     +func               -9923	  0.111  _raw_spin_lock_irqsave+0x0 (get_page_from_freelist+0x5de)
     #func               -9923	  0.131  __rmqueue+0x0 (get_page_from_freelist+0x5ef)
     #func               -9923	  0.156  get_pageblock_flags_group+0x0 (get_page_from_freelist+0x60b)
     #func               -9923	  0.103  __mod_zone_page_state+0x0 (get_page_from_freelist+0x622)
     #func               -9923	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9923	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9923	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9923	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9923	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9922	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9922	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9922	  0.261  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9922	  0.192  xnheap_init+0x0 [xeno_nucleus] (xnpod_init+0x2fe [xeno_nucleus])
     +func               -9922	  3.767  init_extent+0x0 [xeno_nucleus] (xnheap_init+0x1f1 [xeno_nucleus])
 |   +begin   0x80000000 -9918	  0.161  xnheap_init+0x239 [xeno_nucleus] (xnpod_init+0x2fe [xeno_nucleus])
 |  *+func               -9918	  0.102  __ipipe_restore_head+0x0 (xnheap_init+0x360 [xeno_nucleus])
 |   +end     0x80000000 -9918	  0.103  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9918	  0.103  xnheap_set_label+0x0 [xeno_nucleus] (xnpod_init+0x31b [xeno_nucleus])
 |   +begin   0x80000000 -9918	  0.198  xnheap_set_label+0x5d [xeno_nucleus] (xnpod_init+0x31b [xeno_nucleus])
 |  *+func               -9917	  0.111  __ipipe_restore_head+0x0 (xnheap_set_label+0x168 [xeno_nucleus])
 |   +end     0x80000000 -9917	  0.274  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9917	  0.271  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9917	  0.425  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9916	  0.243  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9916	  0.200  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9916	  0.110  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9916	  0.229  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9915	  0.105  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9915	  0.231  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9915	  0.110  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9915	  0.175  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9915	  0.302  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9915	  0.165  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9914	  0.161  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9914	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9914	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9914	  0.165  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9914	  0.165  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9914	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9914	  0.258  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9913	  0.308  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9913	  0.176  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9913	  0.161  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9913	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9913	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9912	  0.098  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9912	  0.163  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9912	  0.112  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9912	  0.163  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9912	  0.088  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9912	  0.230  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9912	  0.161  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9911	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9911	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9911	  0.155  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9911	  0.094  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9911	  0.171  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9911	  0.112  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9911	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9911	  0.174  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9910	  0.169  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9910	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9910	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9910	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9910	  0.156  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9910	  0.163  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9909	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9909	  0.185  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9909	  0.132  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9909	  0.165  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9909	  0.165  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9909	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9909	  0.091  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9908	  0.090  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9908	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9908	  0.114  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9908	  0.095  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9908	  0.090  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9908	  0.197  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9908	  0.160  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9908	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9907	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9907	  0.154  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9907	  0.090  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9907	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9907	  0.104  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9907	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9907	  0.249  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9906	  0.162  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9906	  0.162  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9906	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9906	  0.154  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9906	  0.157  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9906	  0.163  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9906	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9905	  0.192  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9905	  0.161  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9905	  0.161  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9905	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9905	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9905	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9905	  0.091  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9904	  0.163  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9904	  0.102  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9904	  0.094  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9904	  0.089  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9904	  0.201  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9904	  0.158  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9904	  0.165  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9904	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9903	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9903	  0.090  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9903	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9903	  0.103  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9903	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9903	  0.265  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9903	  0.161  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9902	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9902	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9902	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9902	  0.157  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9902	  0.163  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9902	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9902	  0.185  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9901	  0.185  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9901	  0.162  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9901	  0.165  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9901	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9901	  0.091  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9901	  0.089  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9901	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9900	  0.104  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9900	  0.094  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9900	  0.091  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9900	  0.200  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9900	  0.161  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9900	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9900	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9899	  0.154  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9899	  0.090  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9899	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9899	  0.104  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9899	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9899	  0.272  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9899	  0.161  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9898	  0.163  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9898	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9898	  0.154  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9898	  0.157  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9898	  0.163  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9898	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9898	  0.185  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9897	  0.110  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9897	  0.162  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9897	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9897	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9897	  0.091  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9897	  0.091  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9897	  0.183  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9896	  0.103  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9896	  0.094  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9896	  0.091  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9896	  0.194  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9896	  0.160  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9896	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9896	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9896	  0.154  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9895	  0.090  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9895	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9895	  0.105  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9895	  0.094  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9895	  0.290  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9895	  0.160  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9895	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9894	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9894	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9894	  0.158  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9894	  0.161  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9894	  0.104  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9894	  0.207  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9893	  0.188  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9893	  0.162  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9893	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9893	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9893	  0.091  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9893	  0.091  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9893	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9892	  0.103  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9892	  0.094  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9892	  0.089  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9892	  0.194  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9892	  0.160  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9892	  0.165  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9892	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9892	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9891	  0.090  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9891	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9891	  0.103  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9891	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9891	  0.174  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9891	  0.157  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9891	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9890	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9890	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9890	  0.157  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9890	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9890	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9890	  0.176  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9890	  0.111  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9890	  0.161  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9889	  0.164  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9889	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9889	  0.091  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9889	  0.089  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9889	  0.161  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9889	  0.102  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9889	  0.094  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9889	  0.091  xnsched_init+0x0 [xeno_nucleus] (xnpod_init+0x33a [xeno_nucleus])
     +func               -9888	  0.196  xnsched_rt_init+0x0 [xeno_nucleus] (xnsched_init+0x4e [xeno_nucleus])
     +func               -9888	  0.157  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |   +begin   0x80000000 -9888	  0.162  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0xbe [xeno_nucleus])
 |  *+func               -9888	  0.102  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9888	  0.154  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9888	  0.089  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |   +begin   0x80000000 -9888	  0.162  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0xf3 [xeno_nucleus])
 |  *+func               -9887	  0.101  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9887	  0.094  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9887	  0.254  xnthread_init+0x0 [xeno_nucleus] (xnsched_init+0x174 [xeno_nucleus])
     +func               -9887	  0.158  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |   +begin   0x80000000 -9887	  0.162  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x15f [xeno_nucleus])
 |  *+func               -9887	  0.103  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9887	  0.152  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9886	  0.157  __xntimer_init+0x0 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |   +begin   0x80000000 -9886	  0.161  __xntimer_init+0xe9 [xeno_nucleus] (xnthread_init+0x19f [xeno_nucleus])
 |  *+func               -9886	  0.121  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9886	  0.200  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9886	  0.161  xnsched_set_policy+0x0 [xeno_nucleus] (xnthread_init+0x3ac [xeno_nucleus])
     +func               -9886	  0.161  __xntimer_init+0x0 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |   +begin   0x80000000 -9885	  0.165  __xntimer_init+0xe9 [xeno_nucleus] (xnsched_init+0x26a [xeno_nucleus])
 |  *+func               -9885	  0.101  __ipipe_restore_head+0x0 (__xntimer_init+0x21b [xeno_nucleus])
 |   +end     0x80000000 -9885	  0.092  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9885	  0.091  xntimer_migrate+0x0 [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |   +begin   0x80000000 -9885	  0.162  xntimer_migrate+0x2f [xeno_nucleus] (xnsched_init+0x2c4 [xeno_nucleus])
 |  *+func               -9885	  0.100  __ipipe_restore_head+0x0 (xntimer_migrate+0x140 [xeno_nucleus])
 |   +end     0x80000000 -9885	  0.103  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9885	  0.217  ipipe_virtualize_irq+0x0 (xnpod_init+0x3b1 [xeno_nucleus])
     +func               -9884	  0.112  ipipe_request_irq+0x0 (ipipe_virtualize_irq+0x13)
     +func               -9884	  0.098  __ipipe_spin_lock_irqsave+0x0 (ipipe_request_irq+0x48)
 |   +begin   0x80000001 -9884	  0.281  __ipipe_spin_lock_irqsave+0x93 (ipipe_request_irq+0x48)
 |   #func               -9884	  0.170  __ipipe_spin_unlock_irqrestore+0x0 (ipipe_request_irq+0x6e)
 |   +end     0x80000001 -9884	  0.183  ipipe_trace_end+0x19 (__ipipe_spin_unlock_irqrestore+0x39)
     +func               -9884	  0.091  xnregistry_init+0x0 [xeno_nucleus] (xnpod_init+0x3b6 [xeno_nucleus])
     +func               -9883	  0.091  kmalloc_order_trace+0x0 (xnregistry_init+0x1e [xeno_nucleus])
     +func               -9883	  0.089  __get_free_pages+0x0 (kmalloc_order_trace+0x2e)
     +func               -9883	  0.092  __alloc_pages_nodemask+0x0 (__get_free_pages+0x17)
     +func               -9883	  0.090  ipipe_root_only+0x0 (__alloc_pages_nodemask+0x1e5)
 |   +begin   0x80000001 -9883	  0.123  ipipe_root_only+0xa3 (__alloc_pages_nodemask+0x1e5)
 |   +end     0x80000001 -9883	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9883	  0.101  _cond_resched+0x0 (__alloc_pages_nodemask+0x1ea)
     +func               -9883	  0.089  next_zones_zonelist+0x0 (__alloc_pages_nodemask+0x115)
     +func               -9883	  0.091  get_page_from_freelist+0x0 (__alloc_pages_nodemask+0x148)
     +func               -9883	  0.096  next_zones_zonelist+0x0 (get_page_from_freelist+0x68)
     +func               -9883	  0.114  __zone_watermark_ok+0x0 (get_page_from_freelist+0x12c)
     +func               -9882	  0.121  _raw_spin_lock_irqsave+0x0 (get_page_from_freelist+0x5de)
     #func               -9882	  0.142  __rmqueue+0x0 (get_page_from_freelist+0x5ef)
     #func               -9882	  0.112  get_pageblock_flags_group+0x0 (get_page_from_freelist+0x60b)
     #func               -9882	  0.092  __mod_zone_page_state+0x0 (get_page_from_freelist+0x622)
     #func               -9882	  0.088  ipipe_restore_root+0x0 (get_page_from_freelist+0x42f)
     #func               -9882	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9882	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9882	  0.094  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9882	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9881	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9881	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9881	  0.437  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9881	  0.162  xnvfile_init_dir+0x0 [xeno_nucleus] (xnregistry_init+0x48 [xeno_nucleus])
     +func               -9881	  0.154  proc_mkdir+0x0 (xnvfile_init_dir+0x23 [xeno_nucleus])
     +func               -9880	  0.156  proc_mkdir_data+0x0 (proc_mkdir+0x15)
     +func               -9880	  0.162  __proc_create+0x0 (proc_mkdir_data+0x3a)
     +func               -9880	  0.190  _raw_spin_lock+0x0 (__proc_create+0x3a)
     +func               -9880	  0.190  __xlate_proc_name+0x0 (__proc_create+0x49)
     +func               -9880	  0.097  __kmalloc+0x0 (__proc_create+0x9b)
     +func               -9880	  0.107  kmalloc_slab+0x0 (__kmalloc+0x2e)
     +func               -9880	  0.091  ipipe_root_only+0x0 (__kmalloc+0x55)
 |   +begin   0x80000001 -9879	  0.120  ipipe_root_only+0xa3 (__kmalloc+0x55)
 |   +end     0x80000001 -9879	  0.107  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9879	  0.145  _cond_resched+0x0 (__kmalloc+0x5a)
     +func               -9879	  0.337  __slab_alloc.constprop.69+0x0 (__kmalloc+0x133)
     #func               -9879	  0.098  ipipe_restore_root+0x0 (__slab_alloc.constprop.69+0x4ef)
     #func               -9879	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9879	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9878	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9878	  0.089  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9878	  0.090  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9878	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9878	  0.130  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9878	  0.091  proc_register+0x0 (proc_mkdir_data+0x52)
     +func               -9878	  0.302  proc_alloc_inum+0x0 (proc_register+0x20)
     +func               -9878	  0.109  kmem_cache_alloc+0x0 (__idr_pre_get+0x74)
     +func               -9877	  0.094  ipipe_root_only+0x0 (kmem_cache_alloc+0x31)
 |   +begin   0x80000001 -9877	  0.121  ipipe_root_only+0xa3 (kmem_cache_alloc+0x31)
 |   +end     0x80000001 -9877	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9877	  0.383  _cond_resched+0x0 (kmem_cache_alloc+0x36)
     +func               -9877	  0.123  _raw_spin_lock_irqsave+0x0 (__idr_pre_get+0x38)
     #func               -9877	  0.096  __ipipe_spin_unlock_debug+0x0 (__idr_pre_get+0x54)
     #func               -9876	  0.089  _raw_spin_unlock_irqrestore+0x0 (__idr_pre_get+0x5f)
     #func               -9876	  0.087  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x1c)
     #func               -9876	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9876	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9876	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9876	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9876	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9876	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9876	  0.114  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9876	  0.430  _raw_spin_lock_irq+0x0 (proc_alloc_inum+0x24)
     #func               -9875	  0.109  _raw_spin_lock_irqsave+0x0 (get_from_free_list+0x1a)
     #func               -9875	  0.109  __ipipe_spin_unlock_debug+0x0 (get_from_free_list+0x44)
     #func               -9875	  0.100  _raw_spin_unlock_irqrestore+0x0 (get_from_free_list+0x4f)
     #func               -9875	  0.088  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -9875	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9875	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9875	  0.131  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9874	  0.202  kmem_cache_free+0x0 (ida_get_new_above+0x228)
     #func               -9874	  0.097  ipipe_unstall_root+0x0 (proc_alloc_inum+0x45)
 |   #begin   0x80000000 -9874	  0.091  ipipe_unstall_root+0x1c (proc_alloc_inum+0x45)
 |   #func               -9874	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9874	  0.177  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9874	  1.247  _raw_spin_lock+0x0 (proc_register+0x6e)
     +func               -9872	  0.171  xnvfile_init_regular+0x0 [xeno_nucleus] (xnregistry_init+0x6a [xeno_nucleus])
     +func               -9872	  0.090  proc_create_data+0x0 (xnvfile_init_regular+0x41 [xeno_nucleus])
     +func               -9872	  0.100  __proc_create+0x0 (proc_create_data+0x4d)
     +func               -9872	  0.104  _raw_spin_lock+0x0 (__proc_create+0x3a)
     +func               -9872	  0.138  __xlate_proc_name+0x0 (__proc_create+0x49)
     +func               -9872	  0.101  __kmalloc+0x0 (__proc_create+0x9b)
     +func               -9872	  0.090  kmalloc_slab+0x0 (__kmalloc+0x2e)
     +func               -9872	  0.089  ipipe_root_only+0x0 (__kmalloc+0x55)
 |   +begin   0x80000001 -9872	  0.122  ipipe_root_only+0xa3 (__kmalloc+0x55)
 |   +end     0x80000001 -9871	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9871	  0.144  _cond_resched+0x0 (__kmalloc+0x5a)
     +func               -9871	  0.101  proc_register+0x0 (proc_create_data+0x69)
     +func               -9871	  0.091  proc_alloc_inum+0x0 (proc_register+0x20)
     +func               -9871	  0.090  kmem_cache_alloc+0x0 (__idr_pre_get+0x74)
     +func               -9871	  0.092  ipipe_root_only+0x0 (kmem_cache_alloc+0x31)
 |   +begin   0x80000001 -9871	  0.121  ipipe_root_only+0xa3 (kmem_cache_alloc+0x31)
 |   +end     0x80000001 -9871	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9871	  0.123  _cond_resched+0x0 (kmem_cache_alloc+0x36)
     +func               -9870	  0.111  _raw_spin_lock_irqsave+0x0 (__idr_pre_get+0x38)
     #func               -9870	  0.101  __ipipe_spin_unlock_debug+0x0 (__idr_pre_get+0x54)
     #func               -9870	  0.096  _raw_spin_unlock_irqrestore+0x0 (__idr_pre_get+0x5f)
     #func               -9870	  0.089  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x1c)
     #func               -9870	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9870	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9870	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9870	  0.090  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9870	  0.092  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9870	  0.142  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9869	  0.090  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9869	  0.141  _raw_spin_lock_irq+0x0 (proc_alloc_inum+0x24)
     #func               -9869	  0.111  _raw_spin_lock_irqsave+0x0 (get_from_free_list+0x1a)
     #func               -9869	  0.096  __ipipe_spin_unlock_debug+0x0 (get_from_free_list+0x44)
     #func               -9869	  0.097  _raw_spin_unlock_irqrestore+0x0 (get_from_free_list+0x4f)
     #func               -9869	  0.087  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -9869	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9869	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9869	  0.112  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9868	  0.094  kmem_cache_free+0x0 (ida_get_new_above+0x228)
     #func               -9868	  0.087  ipipe_unstall_root+0x0 (proc_alloc_inum+0x45)
 |   #begin   0x80000000 -9868	  0.091  ipipe_unstall_root+0x1c (proc_alloc_inum+0x45)
 |   #func               -9868	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9868	  0.109  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9868	  0.191  _raw_spin_lock+0x0 (proc_register+0x6e)
     +func               -9868	  0.095  rthal_apc_alloc+0x0 (xnregistry_init+0x89 [xeno_nucleus])
 |   +begin   0x80000000 -9868	  0.357  rthal_apc_alloc+0x3b (xnregistry_init+0x89 [xeno_nucleus])
 |  *+func               -9867	  0.115  __ipipe_restore_head+0x0 (rthal_apc_alloc+0x123)
 |   +end     0x80000000 -9867	  5.592  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -9862	  0.185  kmem_cache_alloc_trace+0x0 (xnregistry_init+0x1c5 [xeno_nucleus])
     +func               -9861	  0.090  ipipe_root_only+0x0 (kmem_cache_alloc_trace+0x38)
 |   +begin   0x80000001 -9861	  0.130  ipipe_root_only+0xa3 (kmem_cache_alloc_trace+0x38)
 |   +end     0x80000001 -9861	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     +func               -9861	  0.211  _cond_resched+0x0 (kmem_cache_alloc_trace+0x3d)
     +func               -9861	  0.322  __slab_alloc.constprop.69+0x0 (kmem_cache_alloc_trace+0xf3)
     #func               -9861	  0.096  ipipe_restore_root+0x0 (__slab_alloc.constprop.69+0x4ef)
     #func               -9860	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9860	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9860	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9860	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9860	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9860	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9860	  0.604  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9859	  0.191  xnsynch_init+0x0 [xeno_nucleus] (xnregistry_init+0x200 [xeno_nucleus])
     +func               -9859	  0.277  rthal_smi_init+0x0 (xnpod_init+0x3c6 [xeno_nucleus])
     +func               -9859	  0.170  pci_get_class+0x0 (rthal_smi_init+0x24)
     +func               -9859	  0.176  pci_get_dev_by_id+0x0 (pci_get_class+0x44)
     +func               -9858	  0.560  bus_find_device+0x0 (pci_get_dev_by_id+0x4e)
     +func               -9858	  0.249  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9858	  0.169  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9857	  0.201  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9857	  0.165  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9857	  0.261  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9857	  0.167  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9857	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9857	  0.168  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9856	  0.108  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9856	  0.164  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9856	  0.104  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9856	  0.157  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9856	  0.107  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9856	  0.165  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9856	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9855	  0.157  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9855	  0.107  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9855	  0.169  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9855	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9855	  0.268  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9855	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9855	  0.165  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9854	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9854	  0.156  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9854	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9854	  0.168  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9854	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9854	  0.165  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9854	  0.107  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9853	  0.156  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9853	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9853	  0.167  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9853	  0.105  _raw_spin_lock+0x0 (klist_next+0x20)
     +func               -9853	  0.255  match_pci_dev_by_id+0x0 (bus_find_device+0x62)
     +func               -9853	  0.196  get_device+0x0 (bus_find_device+0x98)
     +func               -9852	  0.269  _raw_spin_lock+0x0 (klist_put+0x25)
     +func               -9852	  0.115  printk+0x0 (rthal_smi_init+0xe6)
 |   +begin   0x80000001 -9852	  0.158  printk+0x66 (rthal_smi_init+0xe6)
 |   +end     0x80000001 -9852	  0.181  ipipe_trace_end+0x19 (printk+0x15c)
     +func               -9852	  0.298  vprintk_emit+0x0 (printk+0x178)
     #func               -9851	  0.378  _raw_spin_lock+0x0 (vprintk_emit+0x10f)
     #func               -9851	  0.422  log_store+0x0 (vprintk_emit+0x1cb)
     #func               -9851	  0.172  console_trylock+0x0 (vprintk_emit+0x1d6)
     #func               -9850	  0.088  down_trylock+0x0 (console_trylock+0x19)
     #func               -9850	  0.118  _raw_spin_lock_irqsave+0x0 (down_trylock+0x16)
     #func               -9850	  0.101  __ipipe_spin_unlock_debug+0x0 (down_trylock+0x2f)
     #func               -9850	  0.100  _raw_spin_unlock_irqrestore+0x0 (down_trylock+0x3a)
     #func               -9850	  0.085  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -9850	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9850	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9850	  0.310  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9849	  0.087  console_unlock+0x0 (vprintk_emit+0x285)
     #func               -9849	  0.127  _raw_spin_lock_irqsave+0x0 (console_unlock+0x34)
     #func               -9849	  0.089  __ipipe_spin_unlock_debug+0x0 (console_unlock+0x6f)
     #func               -9849	  0.098  _raw_spin_unlock_irqrestore+0x0 (console_unlock+0x7e)
     #func               -9849	  0.085  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -9849	  0.091  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9849	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9849	  0.111  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9849	  0.318  _raw_spin_lock_irqsave+0x0 (console_unlock+0x94)
     #func               -9848	  0.327  msg_print_text+0x0 (console_unlock+0x38b)
     #func               -9848	  0.109  print_prefix+0x0 (msg_print_text+0xc0)
     #func               -9848	  0.584  print_time.part.4+0x0 (print_prefix+0x6f)
     #func               -9847	  0.108  print_prefix+0x0 (msg_print_text+0x160)
     #func               -9847	  0.600  print_time.part.4+0x0 (print_prefix+0x6f)
     #func               -9847	  0.097  print_prefix+0x0 (msg_print_text+0xc0)
     #func               -9847	  0.177  print_time.part.4+0x0 (print_prefix+0x6f)
     #func               -9846	  0.097  print_prefix+0x0 (msg_print_text+0x160)
     #func               -9846	  0.292  print_time.part.4+0x0 (print_prefix+0x6f)
     #func               -9846	  0.109  print_prefix+0x0 (msg_print_text+0xc0)
     #func               -9846	  0.155  print_time.part.4+0x0 (print_prefix+0x6f)
     #func               -9846	  0.089  print_prefix+0x0 (msg_print_text+0x160)
     #func               -9846	  0.245  print_time.part.4+0x0 (print_prefix+0x6f)
     #func               -9845	  0.089  call_console_drivers.constprop.15+0x0 (console_unlock+0x3d9)
     #func               -9845	  0.100  ipipe_restore_root+0x0 (console_unlock+0x3e8)
     #func               -9845	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9845	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9845	  0.111  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9845	  0.141  _raw_spin_lock_irqsave+0x0 (console_unlock+0x94)
     #func               -9845	  0.098  up+0x0 (console_unlock+0x1cf)
     #func               -9845	  0.121  _raw_spin_lock_irqsave+0x0 (up+0x14)
     #func               -9844	  0.090  __ipipe_spin_unlock_debug+0x0 (up+0x30)
     #func               -9844	  0.088  _raw_spin_unlock_irqrestore+0x0 (up+0x3b)
     #func               -9844	  0.085  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -9844	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9844	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9844	  0.114  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9844	  0.107  _raw_spin_lock+0x0 (console_unlock+0x1db)
     #func               -9844	  0.098  __ipipe_spin_unlock_debug+0x0 (console_unlock+0x1f1)
     #func               -9844	  0.087  _raw_spin_unlock_irqrestore+0x0 (console_unlock+0x200)
     #func               -9844	  0.085  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -9844	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9843	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9843	  0.177  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9843	  0.225  wake_up_klogd+0x0 (console_unlock+0x21c)
     #func               -9843	  0.088  ipipe_restore_root+0x0 (vprintk_emit+0x216)
     #func               -9843	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -9843	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -9843	  0.102  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -9843	  0.088  ipipe_unstall_root+0x0 (ipipe_restore_root+0x3d)
 |   #begin   0x80000000 -9842	  0.091  ipipe_unstall_root+0x1c (ipipe_restore_root+0x3d)
 |   #func               -9842	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -9842	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -9842	  0.115  pci_dev_put+0x0 (rthal_smi_init+0xee)
     +func               -9842	  0.115  put_device+0x0 (pci_dev_put+0x1a)
     +func               -9842	  0.264  rthal_smi_disable+0x0 (xnpod_init+0x3cb [xeno_nucleus])
     +func               -9842	  0.100  xnshadow_grab_events+0x0 [xeno_nucleus] (xnpod_init+0x3d0 [xeno_nucleus])
     +func               -9842	  0.290  ipipe_catch_event+0x0 (xnshadow_grab_events+0x21 [xeno_nucleus])
     +func               -9841	  0.164  ipipe_set_hooks+0x0 (ipipe_catch_event+0xbb)
     +func               -9841	  0.144  ipipe_critical_enter+0x0 (ipipe_set_hooks+0x45)
 |   +begin   0x80000001 -9841	  0.311  ipipe_critical_enter+0x229 (ipipe_set_hooks+0x45)
 |   +func               -9841	  0.170  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -9840	  3.507  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func               -9837	  0.927  ipipe_critical_exit+0x0 (ipipe_set_hooks+0x14d)
 |   +end     0x80000001 -9836	  0.258  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
     +func               -9836	  0.171  ipipe_catch_event+0x0 (xnshadow_grab_events+0x39 [xeno_nucleus])
     +func               -9836	  0.117  ipipe_set_hooks+0x0 (ipipe_catch_event+0xbb)
     +func               -9835	  0.110  ipipe_critical_enter+0x0 (ipipe_set_hooks+0x45)
 |   +begin   0x80000001 -9835	  0.162  ipipe_critical_enter+0x229 (ipipe_set_hooks+0x45)
 |   +func               -9835	  0.120  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -9835	  3.310  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func               -9832	  0.885  ipipe_critical_exit+0x0 (ipipe_set_hooks+0x14d)
 |   +end     0x80000001 -9831	  0.207  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
     +func               -9831	  0.178  ipipe_catch_event+0x0 (xnshadow_grab_events+0x51 [xeno_nucleus])
     +func               -9830	  0.108  ipipe_set_hooks+0x0 (ipipe_catch_event+0xbb)
     +func               -9830	  0.107  ipipe_critical_enter+0x0 (ipipe_set_hooks+0x45)
 |   +begin   0x80000001 -9830	  0.167  ipipe_critical_enter+0x229 (ipipe_set_hooks+0x45)
 |   +func               -9830	  0.123  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -9830	1782.548  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func               -8047	  1.057  ipipe_critical_exit+0x0 (ipipe_set_hooks+0x14d)
 |   +end     0x80000001 -8046	 27.121  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
 |   +func               -8019	  0.201  __ipipe_handle_irq+0x0 (apic_timer_interrupt+0x7c)
 |   +func               -8019	  0.137  __ipipe_dispatch_irq+0x0 (__ipipe_handle_irq+0x8d)
 |   +func               -8019	  0.155  __ipipe_ack_apic+0x0 (__ipipe_dispatch_irq+0x357)
 |   +func               -8019	  0.152  __ipipe_set_irq_pending+0x0 (__ipipe_dispatch_irq+0x216)
 |   +func               -8019	  0.151  __ipipe_do_sync_pipeline+0x0 (__ipipe_dispatch_irq+0x2a1)
 |   +func               -8018	  0.127  __ipipe_do_sync_stage+0x0 (__ipipe_do_sync_pipeline+0x115)
 |   #end     0x80000000 -8018	  0.097  ipipe_trace_end+0x19 (__ipipe_do_sync_stage+0xe8)
     #func               -8018	  0.096  __ipipe_do_IRQ+0x0 (__ipipe_do_sync_stage+0x1c8)
     #func               -8018	  0.120  __ipipe_get_ioapic_irq_vector+0x0 (__ipipe_do_IRQ+0x24)
     #func               -8018	  0.090  smp_apic_timer_interrupt+0x0 (__ipipe_do_IRQ+0x79)
     #func               -8018	  0.088  irq_enter+0x0 (smp_apic_timer_interrupt+0x2a)
     #func               -8018	  0.135  rcu_irq_enter+0x0 (irq_enter+0x17)
     #func               -8018	  0.089  ipipe_restore_root+0x0 (rcu_irq_enter+0x92)
     #func               -8018	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -8018	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -8017	  0.147  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8017	  0.100  exit_idle+0x0 (smp_apic_timer_interrupt+0x2f)
     #func               -8017	  0.104  hrtimer_interrupt+0x0 (smp_apic_timer_interrupt+0x55)
     #func               -8017	  0.092  _raw_spin_lock+0x0 (hrtimer_interrupt+0x4f)
     #func               -8017	  0.123  ktime_get_update_offsets+0x0 (hrtimer_interrupt+0x81)
     #func               -8017	  0.110  __run_hrtimer+0x0 (hrtimer_interrupt+0xf7)
     #func               -8017	  0.150  __remove_hrtimer+0x0 (__run_hrtimer+0x67)
     #func               -8017	  0.089  tick_sched_timer+0x0 (__run_hrtimer+0x91)
     #func               -8016	  0.121  ktime_get+0x0 (tick_sched_timer+0x1f)
     #func               -8016	  0.128  _raw_spin_lock+0x0 (tick_sched_timer+0x8f)
     #func               -8016	  0.097  do_timer+0x0 (tick_sched_timer+0xcb)
     #func               -8016	  0.150  _raw_spin_lock_irqsave+0x0 (do_timer+0x2c)
     #func               -8016	  0.110  ntp_tick_length+0x0 (do_timer+0x83)
     #func               -8016	  0.148  ntp_tick_length+0x0 (do_timer+0x2ec)
     #func               -8016	  0.098  timekeeping_update.constprop.8+0x0 (do_timer+0x1e1)
     #func               -8016	  0.111  update_vsyscall+0x0 (timekeeping_update.constprop.8+0x1d)
     #func               -8015	  0.102  set_normalized_timespec+0x0 (update_vsyscall+0xf7)
     #func               -8015	  0.100  ipipe_update_hostrt+0x0 (update_vsyscall+0x12c)
     #func               -8015	  0.088  ipipe_root_only+0x0 (ipipe_update_hostrt+0x28)
 |   #begin   0x80000001 -8015	  0.117  ipipe_root_only+0xa3 (ipipe_update_hostrt+0x28)
 |   #end     0x80000001 -8015	  0.103  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8015	  0.095  __ipipe_notify_kevent+0x0 (ipipe_update_hostrt+0x7d)
     #func               -8015	  0.088  ipipe_root_only+0x0 (__ipipe_notify_kevent+0x21)
 |   #begin   0x80000001 -8015	  0.117  ipipe_root_only+0xa3 (__ipipe_notify_kevent+0x21)
 |   #end     0x80000001 -8015	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
 |   #begin   0x80000001 -8015	  0.116  __ipipe_notify_kevent+0xcb (ipipe_update_hostrt+0x7d)
 |   #end     0x80000001 -8014	  0.097  ipipe_trace_end+0x19 (__ipipe_notify_kevent+0xf2)
     #func               -8014	  0.096  ipipe_kevent_hook+0x0 (__ipipe_notify_kevent+0x81)
     #func               -8014	  0.091  hostrt_event+0x0 [xeno_nucleus] (ipipe_kevent_hook+0x1f)
     #func               -8014	  0.089  __ipipe_spin_lock_irqsave+0x0 (hostrt_event+0x19 [xeno_nucleus])
 |   #begin   0x80000001 -8014	  0.167  __ipipe_spin_lock_irqsave+0x93 (hostrt_event+0x19 [xeno_nucleus])
 |   #func               -8014	  0.098  __ipipe_spin_unlock_irqrestore+0x0 (hostrt_event+0x87 [xeno_nucleus])
 |   #end     0x80000001 -8014	  0.095  ipipe_trace_end+0x19 (__ipipe_spin_unlock_irqrestore+0x39)
 |   #begin   0x80000001 -8014	  0.103  __ipipe_notify_kevent+0xde (ipipe_update_hostrt+0x7d)
 |   #end     0x80000001 -8014	  0.095  ipipe_trace_end+0x19 (__ipipe_notify_kevent+0xaa)
     #func               -8014	  0.087  raw_notifier_call_chain+0x0 (timekeeping_update.constprop.8+0x32)
     #func               -8013	  0.088  notifier_call_chain+0x0 (raw_notifier_call_chain+0x16)
     #func               -8013	  0.107  __ipipe_spin_unlock_debug+0x0 (do_timer+0x1f5)
     #func               -8013	  0.088  _raw_spin_unlock_irqrestore+0x0 (do_timer+0x204)
     #func               -8013	  0.085  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -8013	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -8013	  0.116  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -8013	  0.108  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8013	  0.109  calc_global_load+0x0 (do_timer+0x20c)
     #func               -8013	  0.100  update_root_process_times+0x0 (tick_sched_timer+0x3f)
     #func               -8013	  0.089  update_process_times+0x0 (update_root_process_times+0x57)
     #func               -8012	  0.108  account_process_tick+0x0 (update_process_times+0x2d)
     #func               -8012	  0.191  account_system_time+0x0 (account_process_tick+0x3d)
     #func               -8012	  0.185  cpuacct_account_field+0x0 (account_system_time+0xc6)
     #func               -8012	  0.091  acct_account_cputime+0x0 (account_system_time+0xce)
     #func               -8012	  0.128  __acct_update_integrals+0x0 (acct_account_cputime+0x1c)
     #func               -8012	  0.135  jiffies_to_timeval+0x0 (__acct_update_integrals+0x73)
     #func               -8012	  0.084  ipipe_restore_root+0x0 (__acct_update_integrals+0x93)
     #func               -8012	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -8011	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -8011	  0.107  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8011	  0.092  hrtimer_run_queues+0x0 (update_process_times+0x32)
     #func               -8011	  0.109  raise_softirq+0x0 (update_process_times+0x3c)
     #func               -8011	  0.088  ipipe_restore_root+0x0 (raise_softirq+0xda)
     #func               -8011	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -8011	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -8011	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8011	  0.148  rcu_check_callbacks+0x0 (update_process_times+0x47)
     #func               -8010	  0.152  cpu_needs_another_gp+0x0 (rcu_check_callbacks+0x21b)
     #func               -8010	  0.214  cpu_needs_another_gp+0x0 (rcu_check_callbacks+0x21b)
     #func               -8010	  0.120  wake_up_klogd_work_func+0x0 (__irq_work_run+0x8f)
     #func               -8010	  0.089  __wake_up+0x0 (wake_up_klogd_work_func+0x42)
     #func               -8010	  0.121  _raw_spin_lock_irqsave+0x0 (__wake_up+0x23)
     #func               -8010	  0.092  __wake_up_common+0x0 (__wake_up+0x39)
     #func               -8010	  0.087  ipipe_root_only+0x0 (__wake_up_common+0x2f)
 |   #begin   0x80000001 -8010	  0.118  ipipe_root_only+0xa3 (__wake_up_common+0x2f)
 |   #end     0x80000001 -8009	  0.415  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8009	  0.101  autoremove_wake_function+0x0 (__wake_up_common+0x60)
     #func               -8009	  0.096  default_wake_function+0x0 (autoremove_wake_function+0x12)
     #func               -8009	  0.101  try_to_wake_up+0x0 (default_wake_function+0x12)
     #func               -8009	  0.355  _raw_spin_lock_irqsave+0x0 (try_to_wake_up+0x31)
     #func               -8008	  0.241  task_waking_fair+0x0 (try_to_wake_up+0xb4)
     #func               -8008	  0.172  select_task_rq_fair+0x0 (try_to_wake_up+0xc6)
     #func               -8008	  0.120  source_load+0x0 (select_task_rq_fair+0x3ba)
     #func               -8008	  0.274  target_load+0x0 (select_task_rq_fair+0x3c9)
     #func               -8008	  0.335  effective_load.isra.35+0x0 (select_task_rq_fair+0x442)
     #func               -8007	  0.164  effective_load.isra.35+0x0 (select_task_rq_fair+0x49a)
     #func               -8007	  0.148  idle_cpu+0x0 (select_task_rq_fair+0x50f)
     #func               -8007	  0.114  _raw_spin_lock+0x0 (try_to_wake_up+0x1c0)
     #func               -8007	  0.090  ttwu_do_activate.constprop.82+0x0 (try_to_wake_up+0x1cb)
     #func               -8007	  0.095  activate_task+0x0 (ttwu_do_activate.constprop.82+0x33)
     #func               -8007	  0.096  enqueue_task+0x0 (activate_task+0x23)
     #func               -8007	  0.180  update_rq_clock.part.71+0x0 (enqueue_task+0x6c)
     #func               -8006	  0.121  enqueue_task_fair+0x0 (enqueue_task+0x51)
     #func               -8006	  0.229  update_curr+0x0 (enqueue_task_fair+0x44c)
     #func               -8006	  0.121  __compute_runnable_contrib.part.48+0x0 (enqueue_task_fair+0xf61)
     #func               -8006	  0.140  update_cfs_rq_blocked_load+0x0 (enqueue_task_fair+0x28e)
     #func               -8006	  0.115  account_entity_enqueue+0x0 (enqueue_task_fair+0x299)
     #func               -8006	  0.124  update_cfs_shares+0x0 (enqueue_task_fair+0x2a1)
     #func               -8006	  0.222  place_entity+0x0 (enqueue_task_fair+0x2ae)
     #func               -8005	  0.134  __enqueue_entity+0x0 (enqueue_task_fair+0x3ec)
     #func               -8005	  0.208  update_curr+0x0 (enqueue_task_fair+0x44c)
     #func               -8005	  0.129  update_cfs_rq_blocked_load+0x0 (enqueue_task_fair+0x28e)
     #func               -8005	  0.124  account_entity_enqueue+0x0 (enqueue_task_fair+0x299)
     #func               -8005	  0.120  update_cfs_shares+0x0 (enqueue_task_fair+0x2a1)
     #func               -8005	  0.191  place_entity+0x0 (enqueue_task_fair+0x2ae)
     #func               -8004	  0.165  __enqueue_entity+0x0 (enqueue_task_fair+0x3ec)
     #func               -8004	  0.110  hrtick_update+0x0 (enqueue_task_fair+0xb1e)
     #func               -8004	  0.101  ttwu_do_wakeup+0x0 (ttwu_do_activate.constprop.82+0x5d)
     #func               -8004	  0.160  check_preempt_curr+0x0 (ttwu_do_wakeup+0x19)
     #func               -8004	  0.211  resched_task+0x0 (check_preempt_curr+0x75)
     #func               -8004	  0.122  native_smp_send_reschedule+0x0 (resched_task+0x64)
     #func               -8004	  0.098  flat_send_IPI_mask+0x0 (native_smp_send_reschedule+0x47)
 |   #begin   0x80000001 -8003	  0.142  flat_send_IPI_mask+0xef (native_smp_send_reschedule+0x47)
 |   #end     0x80000001 -8003	  0.112  ipipe_trace_end+0x19 (flat_send_IPI_mask+0xaf)
     #func               -8003	  0.102  ttwu_stat+0x0 (try_to_wake_up+0x1e8)
     #func               -8003	  0.087  __ipipe_spin_unlock_debug+0x0 (try_to_wake_up+0x1f0)
     #func               -8003	  0.088  _raw_spin_unlock_irqrestore+0x0 (try_to_wake_up+0x1fb)
     #func               -8003	  0.087  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -8003	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -8003	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -8003	  0.128  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8002	  0.088  __ipipe_spin_unlock_debug+0x0 (__wake_up+0x41)
     #func               -8002	  0.088  _raw_spin_unlock_irqrestore+0x0 (__wake_up+0x4c)
     #func               -8002	  0.087  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -8002	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -8002	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -8002	  0.125  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -8002	  0.125  scheduler_tick+0x0 (update_process_times+0x66)
     #func               -8002	  0.115  _raw_spin_lock+0x0 (scheduler_tick+0x44)
     #func               -8002	  0.164  update_rq_clock.part.71+0x0 (scheduler_tick+0x1b8)
     #func               -8001	  0.098  task_tick_fair+0x0 (scheduler_tick+0x151)
     #func               -8001	  0.183  update_curr+0x0 (task_tick_fair+0x2ab)
     #func               -8001	  0.117  update_min_vruntime+0x0 (update_curr+0x79)
     #func               -8001	  0.120  cpuacct_charge+0x0 (update_curr+0x9f)
     #func               -8001	  0.148  update_cfs_rq_blocked_load+0x0 (task_tick_fair+0x212)
     #func               -8001	  0.148  update_cfs_shares+0x0 (task_tick_fair+0x21a)
     #func               -8001	  0.101  update_curr+0x0 (update_cfs_shares+0xd0)
     #func               -8001	  0.102  update_min_vruntime+0x0 (update_curr+0x79)
     #func               -8000	  0.122  account_entity_dequeue+0x0 (update_cfs_shares+0x87)
     #func               -8000	  0.104  account_entity_enqueue+0x0 (update_cfs_shares+0xa4)
     #func               -8000	  0.121  update_curr+0x0 (task_tick_fair+0x2ab)
     #func               -8000	  0.105  update_cfs_rq_blocked_load+0x0 (task_tick_fair+0x212)
     #func               -8000	  0.210  update_cfs_shares+0x0 (task_tick_fair+0x21a)
     #func               -8000	  0.115  trigger_load_balance+0x0 (scheduler_tick+0x185)
     #func               -8000	  0.132  run_posix_cpu_timers+0x0 (update_process_times+0x6e)
     #func               -8000	  0.092  profile_tick+0x0 (tick_sched_timer+0x49)
     #func               -7999	  0.110  hrtimer_forward+0x0 (tick_sched_timer+0x5b)
     #func               -7999	  0.094  _raw_spin_lock+0x0 (__run_hrtimer+0xa1)
     #func               -7999	  0.164  enqueue_hrtimer+0x0 (__run_hrtimer+0xbc)
     #func               -7999	  0.089  tick_program_event+0x0 (hrtimer_interrupt+0x136)
     #func               -7999	  0.089  clockevents_program_event+0x0 (tick_program_event+0x24)
     #func               -7999	  0.107  ktime_get+0x0 (clockevents_program_event+0x39)
     #func               -7999	  0.176  lapic_next_deadline+0x0 (clockevents_program_event+0x6b)
     #func               -7999	  0.098  irq_exit+0x0 (smp_apic_timer_interrupt+0x5a)
     #func               -7999	  0.121  do_softirq+0x0 (irq_exit+0x7d)
     #func               -7998	  0.088  __do_softirq+0x0 (call_softirq+0x1e)
     #func               -7998	  0.090  msecs_to_jiffies+0x0 (__do_softirq+0x20)
     #func               -7998	  0.088  ipipe_unstall_root+0x0 (__do_softirq+0xa8)
 |   #begin   0x80000000 -7998	  0.091  ipipe_unstall_root+0x1c (__do_softirq+0xa8)
 |   #func               -7998	  0.141  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -7998	  0.120  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -7998	  0.098  run_timer_softirq+0x0 (__do_softirq+0xf7)
     +func               -7998	  0.110  hrtimer_run_pending+0x0 (run_timer_softirq+0x24)
     +func               -7998	  0.151  _raw_spin_lock_irq+0x0 (run_timer_softirq+0x3d)
     #func               -7997	  0.090  ipipe_unstall_root+0x0 (run_timer_softirq+0x18f)
 |   #begin   0x80000000 -7997	  0.091  ipipe_unstall_root+0x1c (run_timer_softirq+0x18f)
 |   #func               -7997	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -7997	  0.089  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -7997	  0.105  rcu_bh_qs+0x0 (__do_softirq+0x11b)
     #func               -7997	  0.110  __local_bh_enable+0x0 (__do_softirq+0x17a)
     #func               -7997	  0.088  ipipe_restore_root+0x0 (do_softirq+0x8a)
     #func               -7997	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -7997	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -7996	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -7996	  0.118  rcu_irq_exit+0x0 (irq_exit+0x50)
     #func               -7996	  0.088  ipipe_restore_root+0x0 (rcu_irq_exit+0x8a)
     #func               -7996	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -7996	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -7996	  0.355  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
 |   +end     0x000000ef -7996	  0.381  ipipe_trace_end+0x19 (apic_timer_interrupt+0x8f)
     +func               -7995	  0.137  ipipe_catch_event+0x0 (xnshadow_grab_events+0x69 [xeno_nucleus])
     +func               -7995	  0.092  ipipe_set_hooks+0x0 (ipipe_catch_event+0xbb)
     +func               -7995	  0.088  ipipe_critical_enter+0x0 (ipipe_set_hooks+0x45)
 |   +begin   0x80000001 -7995	  0.160  ipipe_critical_enter+0x229 (ipipe_set_hooks+0x45)
 |   +func               -7995	  0.100  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -7995	  3.098  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func               -7992	  1.004  ipipe_critical_exit+0x0 (ipipe_set_hooks+0x14d)
 |   +end     0x80000001 -7991	  0.203  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
     +func               -7990	  0.172  ipipe_catch_event+0x0 (xnshadow_grab_events+0x81 [xeno_nucleus])
     +func               -7990	  0.111  ipipe_set_hooks+0x0 (ipipe_catch_event+0xbb)
     +func               -7990	  0.110  ipipe_critical_enter+0x0 (ipipe_set_hooks+0x45)
 |   +begin   0x80000001 -7990	  0.164  ipipe_critical_enter+0x229 (ipipe_set_hooks+0x45)
 |   +func               -7990	  0.122  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -7990	  3.148  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func               -7987	  1.029  ipipe_critical_exit+0x0 (ipipe_set_hooks+0x14d)
 |   +end     0x80000001 -7985	  0.189  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
     +func               -7985	  0.243  ipipe_catch_event+0x0 (xnshadow_grab_events+0x99 [xeno_nucleus])
     +func               -7985	  0.131  ipipe_set_hooks+0x0 (ipipe_catch_event+0xbb)
     +func               -7985	  0.110  ipipe_critical_enter+0x0 (ipipe_set_hooks+0x45)
 |   +begin   0x80000001 -7985	  0.172  ipipe_critical_enter+0x229 (ipipe_set_hooks+0x45)
 |   +func               -7985	  0.120  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -7985	3934.961  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func               -4050	  1.023  ipipe_critical_exit+0x0 (ipipe_set_hooks+0x14d)
 |   +end     0x80000001 -4049	 26.631  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
 |   +func               -4022	  0.205  __ipipe_handle_irq+0x0 (apic_timer_interrupt+0x7c)
 |   +func               -4022	  0.115  __ipipe_dispatch_irq+0x0 (__ipipe_handle_irq+0x8d)
 |   +func               -4022	  0.128  __ipipe_ack_apic+0x0 (__ipipe_dispatch_irq+0x357)
 |   +func               -4021	  0.143  __ipipe_set_irq_pending+0x0 (__ipipe_dispatch_irq+0x216)
 |   +func               -4021	  0.152  __ipipe_do_sync_pipeline+0x0 (__ipipe_dispatch_irq+0x2a1)
 |   +func               -4021	  0.129  __ipipe_do_sync_stage+0x0 (__ipipe_do_sync_pipeline+0x115)
 |   #end     0x80000000 -4021	  0.107  ipipe_trace_end+0x19 (__ipipe_do_sync_stage+0xe8)
     #func               -4021	  0.089  __ipipe_do_IRQ+0x0 (__ipipe_do_sync_stage+0x1c8)
     #func               -4021	  0.102  __ipipe_get_ioapic_irq_vector+0x0 (__ipipe_do_IRQ+0x24)
     #func               -4021	  0.097  smp_apic_timer_interrupt+0x0 (__ipipe_do_IRQ+0x79)
     #func               -4021	  0.097  irq_enter+0x0 (smp_apic_timer_interrupt+0x2a)
     #func               -4021	  0.127  rcu_irq_enter+0x0 (irq_enter+0x17)
     #func               -4020	  0.089  ipipe_restore_root+0x0 (rcu_irq_enter+0x92)
     #func               -4020	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -4020	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -4020	  0.123  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -4020	  0.103  exit_idle+0x0 (smp_apic_timer_interrupt+0x2f)
     #func               -4020	  0.105  hrtimer_interrupt+0x0 (smp_apic_timer_interrupt+0x55)
     #func               -4020	  0.092  _raw_spin_lock+0x0 (hrtimer_interrupt+0x4f)
     #func               -4020	  0.117  ktime_get_update_offsets+0x0 (hrtimer_interrupt+0x81)
     #func               -4020	  0.100  __run_hrtimer+0x0 (hrtimer_interrupt+0xf7)
     #func               -4019	  0.107  __remove_hrtimer+0x0 (__run_hrtimer+0x67)
     #func               -4019	  0.088  tick_sched_timer+0x0 (__run_hrtimer+0x91)
     #func               -4019	  0.111  ktime_get+0x0 (tick_sched_timer+0x1f)
     #func               -4019	  0.092  _raw_spin_lock+0x0 (tick_sched_timer+0x8f)
     #func               -4019	  0.094  do_timer+0x0 (tick_sched_timer+0xcb)
     #func               -4019	  0.148  _raw_spin_lock_irqsave+0x0 (do_timer+0x2c)
     #func               -4019	  0.096  ntp_tick_length+0x0 (do_timer+0x83)
     #func               -4019	  0.141  ntp_tick_length+0x0 (do_timer+0x2ec)
     #func               -4019	  0.101  timekeeping_update.constprop.8+0x0 (do_timer+0x1e1)
     #func               -4018	  0.094  update_vsyscall+0x0 (timekeeping_update.constprop.8+0x1d)
     #func               -4018	  0.102  set_normalized_timespec+0x0 (update_vsyscall+0xf7)
     #func               -4018	  0.089  ipipe_update_hostrt+0x0 (update_vsyscall+0x12c)
     #func               -4018	  0.089  ipipe_root_only+0x0 (ipipe_update_hostrt+0x28)
 |   #begin   0x80000001 -4018	  0.120  ipipe_root_only+0xa3 (ipipe_update_hostrt+0x28)
 |   #end     0x80000001 -4018	  0.097  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -4018	  0.095  __ipipe_notify_kevent+0x0 (ipipe_update_hostrt+0x7d)
     #func               -4018	  0.088  ipipe_root_only+0x0 (__ipipe_notify_kevent+0x21)
 |   #begin   0x80000001 -4018	  0.121  ipipe_root_only+0xa3 (__ipipe_notify_kevent+0x21)
 |   #end     0x80000001 -4018	  0.095  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
 |   #begin   0x80000001 -4017	  0.103  __ipipe_notify_kevent+0xcb (ipipe_update_hostrt+0x7d)
 |   #end     0x80000001 -4017	  0.092  ipipe_trace_end+0x19 (__ipipe_notify_kevent+0xf2)
     #func               -4017	  0.089  ipipe_kevent_hook+0x0 (__ipipe_notify_kevent+0x81)
     #func               -4017	  0.089  hostrt_event+0x0 [xeno_nucleus] (ipipe_kevent_hook+0x1f)
     #func               -4017	  0.089  __ipipe_spin_lock_irqsave+0x0 (hostrt_event+0x19 [xeno_nucleus])
 |   #begin   0x80000001 -4017	  0.137  __ipipe_spin_lock_irqsave+0x93 (hostrt_event+0x19 [xeno_nucleus])
 |   #func               -4017	  0.092  __ipipe_spin_unlock_irqrestore+0x0 (hostrt_event+0x87 [xeno_nucleus])
 |   #end     0x80000001 -4017	  0.097  ipipe_trace_end+0x19 (__ipipe_spin_unlock_irqrestore+0x39)
 |   #begin   0x80000001 -4017	  0.092  __ipipe_notify_kevent+0xde (ipipe_update_hostrt+0x7d)
 |   #end     0x80000001 -4017	  0.097  ipipe_trace_end+0x19 (__ipipe_notify_kevent+0xaa)
     #func               -4017	  0.089  raw_notifier_call_chain+0x0 (timekeeping_update.constprop.8+0x32)
     #func               -4016	  0.090  notifier_call_chain+0x0 (raw_notifier_call_chain+0x16)
     #func               -4016	  0.087  __ipipe_spin_unlock_debug+0x0 (do_timer+0x1f5)
     #func               -4016	  0.088  _raw_spin_unlock_irqrestore+0x0 (do_timer+0x204)
     #func               -4016	  0.087  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func               -4016	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -4016	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -4016	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -4016	  0.091  calc_global_load+0x0 (do_timer+0x20c)
     #func               -4016	  0.090  update_root_process_times+0x0 (tick_sched_timer+0x3f)
     #func               -4016	  0.087  update_process_times+0x0 (update_root_process_times+0x57)
     #func               -4015	  0.110  account_process_tick+0x0 (update_process_times+0x2d)
     #func               -4015	  0.109  account_system_time+0x0 (account_process_tick+0x3d)
     #func               -4015	  0.098  cpuacct_account_field+0x0 (account_system_time+0xc6)
     #func               -4015	  0.087  acct_account_cputime+0x0 (account_system_time+0xce)
     #func               -4015	  0.109  __acct_update_integrals+0x0 (acct_account_cputime+0x1c)
     #func               -4015	  0.097  jiffies_to_timeval+0x0 (__acct_update_integrals+0x73)
     #func               -4015	  0.087  ipipe_restore_root+0x0 (__acct_update_integrals+0x93)
     #func               -4015	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -4015	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -4015	  0.107  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -4014	  0.089  hrtimer_run_queues+0x0 (update_process_times+0x32)
     #func               -4014	  0.109  raise_softirq+0x0 (update_process_times+0x3c)
     #func               -4014	  0.088  ipipe_restore_root+0x0 (raise_softirq+0xda)
     #func               -4014	  0.090  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -4014	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -4014	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -4014	  0.130  rcu_check_callbacks+0x0 (update_process_times+0x47)
     #func               -4014	  0.149  cpu_needs_another_gp+0x0 (rcu_check_callbacks+0x21b)
     #func               -4014	  0.117  cpu_needs_another_gp+0x0 (rcu_check_callbacks+0x21b)
     #func               -4013	  0.101  scheduler_tick+0x0 (update_process_times+0x66)
     #func               -4013	  0.116  _raw_spin_lock+0x0 (scheduler_tick+0x44)
     #func               -4013	  0.149  update_rq_clock.part.71+0x0 (scheduler_tick+0x1b8)
     #func               -4013	  0.101  task_tick_fair+0x0 (scheduler_tick+0x151)
     #func               -4013	  0.101  update_curr+0x0 (task_tick_fair+0x2ab)
     #func               -4013	  0.088  update_min_vruntime+0x0 (update_curr+0x79)
     #func               -4013	  0.123  cpuacct_charge+0x0 (update_curr+0x9f)
     #func               -4013	  0.137  update_cfs_rq_blocked_load+0x0 (task_tick_fair+0x212)
     #func               -4013	  0.127  update_cfs_shares+0x0 (task_tick_fair+0x21a)
     #func               -4012	  0.098  update_curr+0x0 (update_cfs_shares+0xd0)
     #func               -4012	  0.091  update_min_vruntime+0x0 (update_curr+0x79)
     #func               -4012	  0.095  account_entity_dequeue+0x0 (update_cfs_shares+0x87)
     #func               -4012	  0.094  account_entity_enqueue+0x0 (update_cfs_shares+0xa4)
     #func               -4012	  0.122  update_curr+0x0 (task_tick_fair+0x2ab)
     #func               -4012	  0.118  update_cfs_rq_blocked_load+0x0 (task_tick_fair+0x212)
     #func               -4012	  0.176  update_cfs_shares+0x0 (task_tick_fair+0x21a)
     #func               -4012	  0.087  trigger_load_balance+0x0 (scheduler_tick+0x185)
     #func               -4012	  0.104  run_posix_cpu_timers+0x0 (update_process_times+0x6e)
     #func               -4011	  0.091  profile_tick+0x0 (tick_sched_timer+0x49)
     #func               -4011	  0.090  hrtimer_forward+0x0 (tick_sched_timer+0x5b)
     #func               -4011	  0.092  _raw_spin_lock+0x0 (__run_hrtimer+0xa1)
     #func               -4011	  0.120  enqueue_hrtimer+0x0 (__run_hrtimer+0xbc)
     #func               -4011	  0.090  tick_program_event+0x0 (hrtimer_interrupt+0x136)
     #func               -4011	  0.090  clockevents_program_event+0x0 (tick_program_event+0x24)
     #func               -4011	  0.096  ktime_get+0x0 (clockevents_program_event+0x39)
     #func               -4011	  0.161  lapic_next_deadline+0x0 (clockevents_program_event+0x6b)
     #func               -4011	  0.102  irq_exit+0x0 (smp_apic_timer_interrupt+0x5a)
     #func               -4010	  0.112  do_softirq+0x0 (irq_exit+0x7d)
     #func               -4010	  0.089  __do_softirq+0x0 (call_softirq+0x1e)
     #func               -4010	  0.092  msecs_to_jiffies+0x0 (__do_softirq+0x20)
     #func               -4010	  0.088  ipipe_unstall_root+0x0 (__do_softirq+0xa8)
 |   #begin   0x80000000 -4010	  0.092  ipipe_unstall_root+0x1c (__do_softirq+0xa8)
 |   #func               -4010	  0.142  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -4010	  0.090  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -4010	  0.096  run_timer_softirq+0x0 (__do_softirq+0xf7)
     +func               -4010	  0.096  hrtimer_run_pending+0x0 (run_timer_softirq+0x24)
     +func               -4010	  0.128  _raw_spin_lock_irq+0x0 (run_timer_softirq+0x3d)
     #func               -4009	  0.110  ipipe_unstall_root+0x0 (run_timer_softirq+0x18f)
 |   #begin   0x80000000 -4009	  0.091  ipipe_unstall_root+0x1c (run_timer_softirq+0x18f)
 |   #func               -4009	  0.142  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000 -4009	  0.090  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func               -4009	  0.102  rcu_bh_qs+0x0 (__do_softirq+0x11b)
     #func               -4009	  0.107  __local_bh_enable+0x0 (__do_softirq+0x17a)
     #func               -4009	  0.090  ipipe_restore_root+0x0 (do_softirq+0x8a)
     #func               -4009	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -4009	  0.121  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -4009	  0.104  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -4008	  0.121  rcu_irq_exit+0x0 (irq_exit+0x50)
     #func               -4008	  0.090  ipipe_restore_root+0x0 (rcu_irq_exit+0x8a)
     #func               -4008	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -4008	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -4008	  0.300  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
 |   +end     0x000000ef -4008	  0.482  ipipe_trace_end+0x19 (apic_timer_interrupt+0x8f)
     +func               -4007	  0.107  xnpod_enable_timesource+0x0 [xeno_nucleus] (xnpod_init+0x3d5 [xeno_nucleus])
 |   +begin   0x80000000 -4007	  0.249  xnpod_enable_timesource+0x28 [xeno_nucleus] (xnpod_init+0x3d5 [xeno_nucleus])
 |  *+func               -4007	  0.715  xnintr_init+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x11b [xeno_nucleus])
 |  *+func               -4006	  0.112  __ipipe_restore_head+0x0 (xnpod_enable_timesource+0x16c [xeno_nucleus])
 |   +end     0x80000000 -4006	  0.088  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -4006	  0.102  do_gettimeofday+0x0 (xnpod_enable_timesource+0x175 [xeno_nucleus])
     +func               -4006	  0.087  getnstimeofday+0x0 (do_gettimeofday+0x1a)
     +func               -4006	  0.232  __getnstimeofday+0x0 (getnstimeofday+0xe)
     +func               -4006	  0.197  xnarch_tsc_to_ns+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x1a1 [xeno_nucleus])
     +func               -4005	  0.177  rthal_timer_request+0x0 (xnpod_enable_timesource+0x227 [xeno_nucleus])
     +func               -4005	  0.194  ipipe_timer_start+0x0 (rthal_timer_request+0x19)
     +func               -4005	  0.092  ipipe_critical_enter+0x0 (ipipe_timer_start+0x44)
 |   +begin   0x80000001 -4005	  0.142  ipipe_critical_enter+0x229 (ipipe_timer_start+0x44)
 |   +func               -4005	  0.101  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -4005	  3.056  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func               -4002	  0.120  ipipe_request_irq+0x0 (ipipe_timer_start+0x79)
 |   +func               -4001	  0.195  __ipipe_spin_lock_irqsave+0x0 (ipipe_request_irq+0x48)
 |   #func               -4001	  0.192  __ipipe_spin_unlock_irqrestore+0x0 (ipipe_request_irq+0x6e)
 |   +func               -4001	  1.108  ipipe_critical_exit+0x0 (ipipe_timer_start+0x9a)
 |   +end     0x80000001 -4000	  0.192  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
     +func               -4000	  0.418  irq_to_desc+0x0 (ipipe_timer_start+0xa1)
     +func               -3999	  0.130  rthal_irq_request+0x0 (rthal_timer_request+0x97)
     +func               -3999	  0.123  ipipe_virtualize_irq+0x0 (rthal_irq_request+0x30)
     +func               -3999	  0.107  ipipe_request_irq+0x0 (ipipe_virtualize_irq+0x13)
     +func               -3999	  0.128  __ipipe_spin_lock_irqsave+0x0 (ipipe_request_irq+0x48)
 |   +begin   0x80000001 -3999	  0.224  __ipipe_spin_lock_irqsave+0x93 (ipipe_request_irq+0x48)
 |   #func               -3999	  0.131  __ipipe_spin_unlock_irqrestore+0x0 (ipipe_request_irq+0x6e)
 |   +end     0x80000001 -3998	  0.137  ipipe_trace_end+0x19 (__ipipe_spin_unlock_irqrestore+0x39)
 |   +begin   0x80000000 -3998	  0.263  xnpod_enable_timesource+0x244 [xeno_nucleus] (xnpod_init+0x3d5 [xeno_nucleus])
 |  *+func               -3998	  0.209  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x395 [xeno_nucleus])
 |  *+func               -3998	  0.240  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *+func               -3998	  0.201  xntimer_next_local_shot+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x225 [xeno_nucleus])
 |  *+event   tick@-3997 -3997	  0.194  xntimer_next_local_shot+0x63 [xeno_nucleus] (xntimer_start_aperiodic+0x225 [xeno_nucleus])
 |  *+func               -3997	  0.162  ipipe_timer_set+0x0 (xntimer_next_local_shot+0x6b [xeno_nucleus])
 |  *+func               -3997	  0.124  ipipe_raise_irq+0x0 (ipipe_timer_set+0x7f)
 |  *+func               -3997	  0.164  __ipipe_handle_irq+0x0 (ipipe_raise_irq+0x3b)
 |  *+func               -3997	  0.176  __ipipe_dispatch_irq+0x0 (__ipipe_handle_irq+0x8d)
 |  *+func               -3997	  0.262  __ipipe_set_irq_pending+0x0 (__ipipe_dispatch_irq+0x564)
 |  *+func               -3996	  0.150  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x329 [xeno_nucleus])
 |  *+func               -3996	  0.130  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *+func               -3996	  0.234  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x190 [xeno_nucleus])
 |  *+func               -3996	  0.177  __ipipe_restore_head+0x0 (xnpod_enable_timesource+0x378 [xeno_nucleus])
 |   +func               -3996	  0.200  __ipipe_do_sync_pipeline+0x0 (__ipipe_sync_pipeline+0x38)
 |  + func               -3995	  0.298  __ipipe_do_sync_stage+0x0 (__ipipe_do_sync_pipeline+0x97)
 |  # func               -3995	  0.344  xnintr_clock_handler+0x0 [xeno_nucleus] (__ipipe_do_sync_stage+0x103)
 |  # func               -3995	  0.227  xntimer_tick_aperiodic+0x0 [xeno_nucleus] (xnintr_clock_handler+0x142 [xeno_nucleus])
 |  # func               -3995	  0.131  xntimer_next_local_shot+0x0 [xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # event   tick@996052-3994	  0.108  xntimer_next_local_shot+0x63 [xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # func               -3994	  0.116  ipipe_timer_set+0x0 (xntimer_next_local_shot+0x6b [xeno_nucleus])
 |  # func               -3994	  0.524  lapic_next_deadline+0x0 (ipipe_timer_set+0x6a)
 |  # func               -3994	  0.208  __xnpod_schedule+0x0 [xeno_nucleus] (xnintr_clock_handler+0x305 [xeno_nucleus])
 |  # [ 2027] -<?>-   -1 -3993	  0.181  __xnpod_schedule+0x168 [xeno_nucleus] (xnintr_clock_handler+0x305 [xeno_nucleus])
 |  # func               -3993	  0.098  ipipe_send_ipi+0x0 (__xnpod_schedule+0xab5 [xeno_nucleus])
 |  # func               -3993	  0.204  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |  # func               -3993	  0.316  xnsched_pick_next+0x0 [xeno_nucleus] (__xnpod_schedule+0x2e5 [xeno_nucleus])
 |  # func               -3993	  0.137  xnintr_host_tick+0x0 [xeno_nucleus] (__xnpod_schedule+0x937 [xeno_nucleus])
 |  # func               -3993	  0.307  __ipipe_set_irq_pending+0x0 (xnintr_host_tick+0x3a [xeno_nucleus])
 |   +func               -3992	  0.178  __ipipe_do_sync_stage+0x0 (__ipipe_do_sync_pipeline+0x115)
 |   #end     0x80000000 -3992	  0.121  ipipe_trace_end+0x19 (__ipipe_do_sync_stage+0xe8)
     #func               -3992	  0.091  __ipipe_do_IRQ+0x0 (__ipipe_do_sync_stage+0x1c8)
     #func               -3992	  0.098  __ipipe_get_ioapic_irq_vector+0x0 (__ipipe_do_IRQ+0x24)
     #func               -3992	  0.108  smp_apic_timer_interrupt+0x0 (__ipipe_do_IRQ+0x79)
     #func               -3992	  0.114  irq_enter+0x0 (smp_apic_timer_interrupt+0x2a)
     #func               -3992	  0.157  rcu_irq_enter+0x0 (irq_enter+0x17)
     #func               -3991	  0.097  ipipe_restore_root+0x0 (rcu_irq_enter+0x92)
     #func               -3991	  0.109  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -3991	  0.152  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -3991	  0.136  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func               -3991	  0.098  exit_idle+0x0 (smp_apic_timer_interrupt+0x2f)
     #func               -3991	  0.107  hrtimer_interrupt+0x0 (smp_apic_timer_interrupt+0x55)
     #func               -3991	  0.122  _raw_spin_lock+0x0 (hrtimer_interrupt+0x4f)
     #func               -3991	  0.169  ktime_get_update_offsets+0x0 (hrtimer_interrupt+0x81)
     #func               -3990	  0.107  tick_program_event+0x0 (hrtimer_interrupt+0x136)
     #func               -3990	  0.096  clockevents_program_event+0x0 (tick_program_event+0x24)
     #func               -3990	  0.216  ktime_get+0x0 (clockevents_program_event+0x39)
     #func               -3990	  0.123  xnarch_next_htick_shot+0x0 [xeno_nucleus] (clockevents_program_event+0x6b)
 |   #begin   0x80000000 -3990	  0.218  xnarch_next_htick_shot+0x2b [xeno_nucleus] (clockevents_program_event+0x6b)
 |  *#func               -3990	  1.936  __xnlock_spin+0x0 [xeno_nucleus] (xnarch_next_htick_shot+0x28d [xeno_nucleus])
 |  *#func               -3988	  0.120  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnarch_next_htick_shot+0xff [xeno_nucleus])
 |  *#func               -3988	  0.160  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *#func               -3987	  0.144  xntimer_next_local_shot+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x225 [xeno_nucleus])
 |  *#event   tick@-56   -3987	  0.103  xntimer_next_local_shot+0x63 [xeno_nucleus] (xntimer_start_aperiodic+0x225 [xeno_nucleus])
 |  *#func               -3987	  0.127  ipipe_timer_set+0x0 (xntimer_next_local_shot+0x6b [xeno_nucleus])
 |  *#func               -3987	  0.315  lapic_next_deadline+0x0 (ipipe_timer_set+0x6a)
 |  *#func               -3987	  0.131  __ipipe_restore_head+0x0 (xnarch_next_htick_shot+0x142 [xeno_nucleus])
 |   #end     0x80000000 -3987	  0.158  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     #func               -3986	  0.162  irq_exit+0x0 (smp_apic_timer_interrupt+0x5a)
     #func               -3986	  0.149  rcu_irq_exit+0x0 (irq_exit+0x50)
     #func               -3986	  0.116  ipipe_restore_root+0x0 (rcu_irq_exit+0x8a)
     #func               -3986	  0.109  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001 -3986	  0.143  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001 -3986	  0.258  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
 |   +end     0x80000000 -3985	  0.188  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func               -3985	  0.118  rthal_timer_request+0x0 (xnpod_enable_timesource+0x227 [xeno_nucleus])
     +func               -3985	  0.242  ipipe_timer_start+0x0 (rthal_timer_request+0x19)
     +func               -3985	  0.109  ipipe_critical_enter+0x0 (ipipe_timer_start+0x44)
 |   +begin   0x80000001 -3985	  0.164  ipipe_critical_enter+0x229 (ipipe_timer_start+0x44)
 |   +func               -3985	  0.121  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func               -3985	3935.843  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func                 -49	  0.982  ipipe_critical_exit+0x0 (ipipe_timer_start+0x9a)
 |   +end     0x80000001   -48	 22.585  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
 |   +func                 -25	  0.211  __ipipe_handle_irq+0x0 (apic_timer_interrupt+0x7c)
 |   +func                 -25	  0.128  __ipipe_dispatch_irq+0x0 (__ipipe_handle_irq+0x8d)
 |   +func                 -25	  0.136  __ipipe_ack_hrtimer_irq+0x0 (__ipipe_dispatch_irq+0x357)
 |   +func                 -25	  0.209  lapic_itimer_ack+0x0 (__ipipe_ack_hrtimer_irq+0x4b)
 |  # func                 -24	  0.220  xnintr_clock_handler+0x0 [xeno_nucleus] (__ipipe_dispatch_irq+0x14c)
 |  # func                 -24	  0.120  xntimer_tick_aperiodic+0x0 [xeno_nucleus] (xnintr_clock_handler+0x142 [xeno_nucleus])
 |  # func                 -24	  0.104  xntimer_next_local_shot+0x0 [xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # event   tick@996052  -24	  0.088  xntimer_next_local_shot+0x63 [xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # func                 -24	  0.107  ipipe_timer_set+0x0 (xntimer_next_local_shot+0x6b [xeno_nucleus])
 |  # func                 -24	  0.282  lapic_next_deadline+0x0 (ipipe_timer_set+0x6a)
 |  # func                 -24	  0.183  __ipipe_set_irq_pending+0x0 (xnintr_clock_handler+0x274 [xeno_nucleus])
 |   +func                 -23	  0.154  __ipipe_do_sync_pipeline+0x0 (__ipipe_dispatch_irq+0x1f2)
 |   +func                 -23	  0.129  __ipipe_do_sync_stage+0x0 (__ipipe_do_sync_pipeline+0x115)
 |   #end     0x80000000   -23	  0.104  ipipe_trace_end+0x19 (__ipipe_do_sync_stage+0xe8)
     #func                 -23	  0.088  __ipipe_do_IRQ+0x0 (__ipipe_do_sync_stage+0x1c8)
     #func                 -23	  0.091  __ipipe_get_ioapic_irq_vector+0x0 (__ipipe_do_IRQ+0x24)
     #func                 -23	  0.085  smp_apic_timer_interrupt+0x0 (__ipipe_do_IRQ+0x79)
     #func                 -23	  0.087  irq_enter+0x0 (smp_apic_timer_interrupt+0x2a)
     #func                 -23	  0.109  rcu_irq_enter+0x0 (irq_enter+0x17)
     #func                 -22	  0.087  ipipe_restore_root+0x0 (rcu_irq_enter+0x92)
     #func                 -22	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001   -22	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001   -22	  0.129  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func                 -22	  0.096  exit_idle+0x0 (smp_apic_timer_interrupt+0x2f)
     #func                 -22	  0.089  hrtimer_interrupt+0x0 (smp_apic_timer_interrupt+0x55)
     #func                 -22	  0.092  _raw_spin_lock+0x0 (hrtimer_interrupt+0x4f)
     #func                 -22	  0.144  ktime_get_update_offsets+0x0 (hrtimer_interrupt+0x81)
     #func                 -22	  0.101  __run_hrtimer+0x0 (hrtimer_interrupt+0xf7)
     #func                 -22	  0.117  __remove_hrtimer+0x0 (__run_hrtimer+0x67)
     #func                 -21	  0.091  tick_sched_timer+0x0 (__run_hrtimer+0x91)
     #func                 -21	  0.117  ktime_get+0x0 (tick_sched_timer+0x1f)
     #func                 -21	  0.094  _raw_spin_lock+0x0 (tick_sched_timer+0x8f)
     #func                 -21	  0.101  do_timer+0x0 (tick_sched_timer+0xcb)
     #func                 -21	  0.144  _raw_spin_lock_irqsave+0x0 (do_timer+0x2c)
     #func                 -21	  0.095  ntp_tick_length+0x0 (do_timer+0x83)
     #func                 -21	  0.122  ntp_tick_length+0x0 (do_timer+0x2ec)
     #func                 -21	  0.090  timekeeping_update.constprop.8+0x0 (do_timer+0x1e1)
     #func                 -21	  0.092  update_vsyscall+0x0 (timekeeping_update.constprop.8+0x1d)
     #func                 -20	  0.103  set_normalized_timespec+0x0 (update_vsyscall+0xf7)
     #func                 -20	  0.089  ipipe_update_hostrt+0x0 (update_vsyscall+0x12c)
     #func                 -20	  0.087  ipipe_root_only+0x0 (ipipe_update_hostrt+0x28)
 |   #begin   0x80000001   -20	  0.120  ipipe_root_only+0xa3 (ipipe_update_hostrt+0x28)
 |   #end     0x80000001   -20	  0.096  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func                 -20	  0.094  __ipipe_notify_kevent+0x0 (ipipe_update_hostrt+0x7d)
     #func                 -20	  0.087  ipipe_root_only+0x0 (__ipipe_notify_kevent+0x21)
 |   #begin   0x80000001   -20	  0.118  ipipe_root_only+0xa3 (__ipipe_notify_kevent+0x21)
 |   #end     0x80000001   -20	  0.097  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
 |   #begin   0x80000001   -20	  0.103  __ipipe_notify_kevent+0xcb (ipipe_update_hostrt+0x7d)
 |   #end     0x80000001   -19	  0.092  ipipe_trace_end+0x19 (__ipipe_notify_kevent+0xf2)
     #func                 -19	  0.107  ipipe_kevent_hook+0x0 (__ipipe_notify_kevent+0x81)
     #func                 -19	  0.084  hostrt_event+0x0 [xeno_nucleus] (ipipe_kevent_hook+0x1f)
     #func                 -19	  0.088  __ipipe_spin_lock_irqsave+0x0 (hostrt_event+0x19 [xeno_nucleus])
 |   #begin   0x80000001   -19	  0.147  __ipipe_spin_lock_irqsave+0x93 (hostrt_event+0x19 [xeno_nucleus])
 |   #func                 -19	  0.094  __ipipe_spin_unlock_irqrestore+0x0 (hostrt_event+0x87 [xeno_nucleus])
 |   #end     0x80000001   -19	  0.095  ipipe_trace_end+0x19 (__ipipe_spin_unlock_irqrestore+0x39)
 |   #begin   0x80000001   -19	  0.090  __ipipe_notify_kevent+0xde (ipipe_update_hostrt+0x7d)
 |   #end     0x80000001   -19	  0.095  ipipe_trace_end+0x19 (__ipipe_notify_kevent+0xaa)
     #func                 -19	  0.088  raw_notifier_call_chain+0x0 (timekeeping_update.constprop.8+0x32)
     #func                 -18	  0.089  notifier_call_chain+0x0 (raw_notifier_call_chain+0x16)
     #func                 -18	  0.085  __ipipe_spin_unlock_debug+0x0 (do_timer+0x1f5)
     #func                 -18	  0.087  _raw_spin_unlock_irqrestore+0x0 (do_timer+0x204)
     #func                 -18	  0.085  ipipe_restore_root+0x0 (_raw_spin_unlock_irqrestore+0x2a)
     #func                 -18	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001   -18	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001   -18	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func                 -18	  0.092  calc_global_load+0x0 (do_timer+0x20c)
     #func                 -18	  0.087  update_root_process_times+0x0 (tick_sched_timer+0x3f)
     #func                 -18	  0.087  update_process_times+0x0 (update_root_process_times+0x57)
     #func                 -18	  0.101  account_process_tick+0x0 (update_process_times+0x2d)
     #func                 -17	  0.102  account_system_time+0x0 (account_process_tick+0x3d)
     #func                 -17	  0.100  cpuacct_account_field+0x0 (account_system_time+0xc6)
     #func                 -17	  0.084  acct_account_cputime+0x0 (account_system_time+0xce)
     #func                 -17	  0.109  __acct_update_integrals+0x0 (acct_account_cputime+0x1c)
     #func                 -17	  0.100  jiffies_to_timeval+0x0 (__acct_update_integrals+0x73)
     #func                 -17	  0.085  ipipe_restore_root+0x0 (__acct_update_integrals+0x93)
     #func                 -17	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001   -17	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001   -17	  0.107  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func                 -17	  0.094  hrtimer_run_queues+0x0 (update_process_times+0x32)
     #func                 -16	  0.109  raise_softirq+0x0 (update_process_times+0x3c)
     #func                 -16	  0.087  ipipe_restore_root+0x0 (raise_softirq+0xda)
     #func                 -16	  0.087  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001   -16	  0.120  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001   -16	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func                 -16	  0.111  rcu_check_callbacks+0x0 (update_process_times+0x47)
     #func                 -16	  0.125  cpu_needs_another_gp+0x0 (rcu_check_callbacks+0x21b)
     #func                 -16	  0.100  cpu_needs_another_gp+0x0 (rcu_check_callbacks+0x21b)
     #func                 -16	  0.088  scheduler_tick+0x0 (update_process_times+0x66)
     #func                 -16	  0.117  _raw_spin_lock+0x0 (scheduler_tick+0x44)
     #func                 -15	  0.161  update_rq_clock.part.71+0x0 (scheduler_tick+0x1b8)
     #func                 -15	  0.090  task_tick_fair+0x0 (scheduler_tick+0x151)
     #func                 -15	  0.136  update_curr+0x0 (task_tick_fair+0x2ab)
     #func                 -15	  0.089  update_min_vruntime+0x0 (update_curr+0x79)
     #func                 -15	  0.121  cpuacct_charge+0x0 (update_curr+0x9f)
     #func                 -15	  0.127  update_cfs_rq_blocked_load+0x0 (task_tick_fair+0x212)
     #func                 -15	  0.111  update_cfs_shares+0x0 (task_tick_fair+0x21a)
     #func                 -15	  0.098  update_curr+0x0 (update_cfs_shares+0xd0)
     #func                 -14	  0.088  update_min_vruntime+0x0 (update_curr+0x79)
     #func                 -14	  0.094  account_entity_dequeue+0x0 (update_cfs_shares+0x87)
     #func                 -14	  0.097  account_entity_enqueue+0x0 (update_cfs_shares+0xa4)
     #func                 -14	  0.120  update_curr+0x0 (task_tick_fair+0x2ab)
     #func                 -14	  0.125  update_cfs_rq_blocked_load+0x0 (task_tick_fair+0x212)
     #func                 -14	  0.187  update_cfs_shares+0x0 (task_tick_fair+0x21a)
     #func                 -14	  0.087  trigger_load_balance+0x0 (scheduler_tick+0x185)
     #func                 -14	  0.110  run_posix_cpu_timers+0x0 (update_process_times+0x6e)
     #func                 -14	  0.097  profile_tick+0x0 (tick_sched_timer+0x49)
     #func                 -13	  0.089  hrtimer_forward+0x0 (tick_sched_timer+0x5b)
     #func                 -13	  0.092  _raw_spin_lock+0x0 (__run_hrtimer+0xa1)
     #func                 -13	  0.104  enqueue_hrtimer+0x0 (__run_hrtimer+0xbc)
     #func                 -13	  0.087  tick_program_event+0x0 (hrtimer_interrupt+0x136)
     #func                 -13	  0.088  clockevents_program_event+0x0 (tick_program_event+0x24)
     #func                 -13	  0.108  ktime_get+0x0 (clockevents_program_event+0x39)
     #func                 -13	  0.095  xnarch_next_htick_shot+0x0 [xeno_nucleus] (clockevents_program_event+0x6b)
 |   #begin   0x80000000   -13	  0.131  xnarch_next_htick_shot+0x2b [xeno_nucleus] (clockevents_program_event+0x6b)
 |  *#func                 -13	  0.100  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnarch_next_htick_shot+0xff [xeno_nucleus])
 |  *#func                 -13	  0.140  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *#func                 -12	  0.111  xntimer_next_local_shot+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x225 [xeno_nucleus])
 |  *#event   tick@3939    -12	  0.109  xntimer_next_local_shot+0x63 [xeno_nucleus] (xntimer_start_aperiodic+0x225 [xeno_nucleus])
 |  *#func                 -12	  0.098  ipipe_timer_set+0x0 (xntimer_next_local_shot+0x6b [xeno_nucleus])
 |  *#func                 -12	  0.198  lapic_next_deadline+0x0 (ipipe_timer_set+0x6a)
 |  *#func                 -12	  0.101  __ipipe_restore_head+0x0 (xnarch_next_htick_shot+0x142 [xeno_nucleus])
 |   #end     0x80000000   -12	  0.102  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     #func                 -12	  0.118  irq_exit+0x0 (smp_apic_timer_interrupt+0x5a)
     #func                 -12	  0.134  do_softirq+0x0 (irq_exit+0x7d)
     #func                 -11	  0.097  __do_softirq+0x0 (call_softirq+0x1e)
     #func                 -11	  0.100  msecs_to_jiffies+0x0 (__do_softirq+0x20)
     #func                 -11	  0.095  ipipe_unstall_root+0x0 (__do_softirq+0xa8)
 |   #begin   0x80000000   -11	  0.091  ipipe_unstall_root+0x1c (__do_softirq+0xa8)
 |   #func                 -11	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000   -11	  0.097  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func                 -11	  0.089  run_timer_softirq+0x0 (__do_softirq+0xf7)
     +func                 -11	  0.096  hrtimer_run_pending+0x0 (run_timer_softirq+0x24)
     +func                 -11	  0.136  _raw_spin_lock_irq+0x0 (run_timer_softirq+0x3d)
     #func                 -11	  0.089  ipipe_unstall_root+0x0 (run_timer_softirq+0x18f)
 |   #begin   0x80000000   -10	  0.091  ipipe_unstall_root+0x1c (run_timer_softirq+0x18f)
 |   #func                 -10	  0.140  ipipe_root_only+0x0 (ipipe_unstall_root+0x21)
 |   +end     0x80000000   -10	  0.090  ipipe_trace_end+0x19 (ipipe_unstall_root+0x59)
     +func                 -10	  0.101  rcu_bh_qs+0x0 (__do_softirq+0x11b)
     #func                 -10	  0.105  __local_bh_enable+0x0 (__do_softirq+0x17a)
     #func                 -10	  0.090  ipipe_restore_root+0x0 (do_softirq+0x8a)
     #func                 -10	  0.089  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001   -10	  0.117  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001   -10	  0.105  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
     #func                  -9	  0.107  rcu_irq_exit+0x0 (irq_exit+0x50)
     #func                  -9	  0.087  ipipe_restore_root+0x0 (rcu_irq_exit+0x8a)
     #func                  -9	  0.088  ipipe_root_only+0x0 (ipipe_restore_root+0x12)
 |   #begin   0x80000001    -9	  0.118  ipipe_root_only+0xa3 (ipipe_restore_root+0x12)
 |   #end     0x80000001    -9	  0.291  ipipe_trace_end+0x19 (ipipe_root_only+0x86)
 |   +end     0x000000ef    -9	  0.210  ipipe_trace_end+0x19 (apic_timer_interrupt+0x8f)
     +func                  -9	  0.151  irq_to_desc+0x0 (ipipe_timer_start+0xa1)
 |   +begin   0x80000000    -8	  0.142  xnpod_enable_timesource+0x244 [xeno_nucleus] (xnpod_init+0x3d5 [xeno_nucleus])
 |  *+func                  -8	  0.108  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x395 [xeno_nucleus])
 |  *+func                  -8	  0.177  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *+func                  -8	  0.107  ipipe_send_ipi+0x0 (xntimer_start_aperiodic+0x211 [xeno_nucleus])
 |  *+func                  -8	  0.155  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |  *+func                  -8	  0.107  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x329 [xeno_nucleus])
 |  *+func                  -8	  0.105  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *+func                  -8	  0.181  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x190 [xeno_nucleus])
 |  *+func                  -7	  0.104  __ipipe_restore_head+0x0 (xnpod_enable_timesource+0x378 [xeno_nucleus])
 |   +end     0x80000000    -7	  0.105  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func                  -7	  0.098  rthal_timer_request+0x0 (xnpod_enable_timesource+0x227 [xeno_nucleus])
     +func                  -7	  0.196  ipipe_timer_start+0x0 (rthal_timer_request+0x19)
     +func                  -7	  0.091  ipipe_critical_enter+0x0 (ipipe_timer_start+0x44)
 |   +begin   0x80000001    -7	  0.143  ipipe_critical_enter+0x229 (ipipe_timer_start+0x44)
 |   +func                  -7	  0.097  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
 |   +func                  -7	  3.857  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |   +func                  -3	  0.998  ipipe_critical_exit+0x0 (ipipe_timer_start+0x9a)
 |   +end     0x80000001    -2	  0.116  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
     +func                  -2	  0.122  irq_to_desc+0x0 (ipipe_timer_start+0xa1)
 |   +begin   0x80000000    -1	  0.208  xnpod_enable_timesource+0x244 [xeno_nucleus] (xnpod_init+0x3d5 [xeno_nucleus])
 |  *+func                  -1	  0.122  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x395 [xeno_nucleus])
 |  *+func                  -1	  0.171  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *+func                  -1	  0.123  ipipe_send_ipi+0x0 (xntimer_start_aperiodic+0x211 [xeno_nucleus])
 |  *+func                  -1	  0.168  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
 |  *+func                  -1	  0.116  xntimer_start_aperiodic+0x0 [xeno_nucleus] (xnpod_enable_timesource+0x329 [xeno_nucleus])
 |  *+func                   0	  0.118  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x179 [xeno_nucleus])
 |  *+func                   0	  0.201  xnarch_ns_to_tsc+0x0 [xeno_nucleus] (xntimer_start_aperiodic+0x190 [xeno_nucleus])
 |  *+func                   0	  0.130  __ipipe_restore_head+0x0 (xnpod_enable_timesource+0x378 [xeno_nucleus])
 |   +end     0x80000000     0	  0.130  ipipe_trace_end+0x19 (__ipipe_restore_head+0x6c)
     +func                   0	  0.109  rthal_timer_request+0x0 (xnpod_enable_timesource+0x227 [xeno_nucleus])
     +func                   0	  0.191  ipipe_timer_start+0x0 (rthal_timer_request+0x19)
     +func                   0	  0.115  ipipe_critical_enter+0x0 (ipipe_timer_start+0x44)
>|   +begin   0x80000001     0	  0.160  ipipe_critical_enter+0x229 (ipipe_timer_start+0x44)
:|   +func                   0	  0.127  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
:|   +func                   0! 19678406.428  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
:|   +func               19678406	  0.824  ipipe_send_ipi+0x0 (ipipe_critical_enter+0x1da)
:|   +func               19678407+   3.476  flat_send_IPI_mask+0x0 (ipipe_send_ipi+0x56)
:|   +func               19678411+   1.008  ipipe_critical_exit+0x0 (ipipe_timer_start+0x9a)
<|   +end     0x80000001 19678412	  0.490  ipipe_trace_end+0x19 (ipipe_critical_exit+0x8a)
 |   +begin   0x000000ef 19678412	  0.000  apic_timer_interrupt+0x6d (ipipe_critical_exit+0x76)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-23 13:45   ` Jeroen Van den Keybus
@ 2014-04-23 14:07     ` Gilles Chanteperdrix
  2014-04-23 20:54       ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-23 14:07 UTC (permalink / raw)
  To: Jeroen Van den Keybus; +Cc: xenomai

On 04/23/2014 03:45 PM, Jeroen Van den Keybus wrote:
> I've attached an I-trace from what happens when 'modprobe xeno_native'
> stalls. I could use some hints as where to start looking into this
> issue. Right now, I would say I have a look at which code paths are
> traversed and not when CONFIG_IPIPE_TRACE is unset and set
> respectively.

Do you get the same behaviour if you only enable the tracer after 
loading the xeno_native module?


-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-23 14:07     ` Gilles Chanteperdrix
@ 2014-04-23 20:54       ` Jeroen Van den Keybus
  2014-04-23 20:56         ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-23 20:54 UTC (permalink / raw)
  To: Gilles Chanteperdrix, xenomai

> Do you get the same behaviour if you only enable the tracer after loading
> the xeno_native module?

Yes. but I just found out that the intel_idle driver option is causing
the stalls. Limiting to C1 state (using intel_idle.max_cstate=1)
apparently isn't cutting it and other nasty things must be happening
in this driver. After turning it off, the random stalling stopped.
Good.

Apart from that, I somehow managed to stop the high latencies from
occurring, I then saved the .config and then double-checked that (by
turning it off again) turning _on_ 'Run-time PM core functionality'
had solved the problem.

Surprisingly, the latencies are high again. But even with the stored
.config I have not been able to recreate the situation with tracer off
and normal latencies when reading xenomai/stat.


Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-23 20:54       ` Jeroen Van den Keybus
@ 2014-04-23 20:56         ` Gilles Chanteperdrix
  2014-04-23 21:39           ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-23 20:56 UTC (permalink / raw)
  To: Jeroen Van den Keybus; +Cc: xenomai

On 04/23/2014 10:54 PM, Jeroen Van den Keybus wrote:
> Surprisingly, the latencies are high again. But even with the stored
> .config I have not been able to recreate the situation with tracer off
> and normal latencies when reading xenomai/stat.

You mean with tracer on? Then you have a trace for the high latency?


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-23 20:56         ` Gilles Chanteperdrix
@ 2014-04-23 21:39           ` Jeroen Van den Keybus
  2014-04-23 22:25             ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-23 21:39 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

> You mean with tracer on? Then you have a trace for the high latency?

No, unfortunately. Every time I enable the tracer, all is well, no latencies.

Since this is so elusive, I started to install all kernels I built
(and keep the configs along). I just found that one of them does not
have the high latencies and does not have the I-pipe debugging on
either. Hooray.

However, when I powerdown the machine and reboot this very kernel, the
problem reappears.


Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-23 21:39           ` Jeroen Van den Keybus
@ 2014-04-23 22:25             ` Gilles Chanteperdrix
  2014-04-24  8:57               ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-04-23 22:25 UTC (permalink / raw)
  To: Jeroen Van den Keybus; +Cc: xenomai

On 04/23/2014 11:39 PM, Jeroen Van den Keybus wrote:
>> You mean with tracer on? Then you have a trace for the high latency?
> 
> No, unfortunately. Every time I enable the tracer, all is well, no latencies.
> 
> Since this is so elusive, I started to install all kernels I built
> (and keep the configs along). I just found that one of them does not
> have the high latencies and does not have the I-pipe debugging on
> either. Hooray.
> 
> However, when I powerdown the machine and reboot this very kernel, the
> problem reappears.

Could you put a printk in the function vfile_stat_rewind to see if it
gets called (more than once) when the problem happens?


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-23 22:25             ` Gilles Chanteperdrix
@ 2014-04-24  8:57               ` Jeroen Van den Keybus
  2014-04-24 14:46                 ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-24  8:57 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

> Could you put a printk in the function vfile_stat_rewind to see if it
> gets called (more than once) when the problem happens?

I patched as follows:

static int vfile_stat_rewind(struct xnvfile_snapshot_iterator *it)
{
    struct vfile_stat_priv *priv = xnvfile_iterator_priv(it);
    int irqnr;
    int ret, irq;

    /*
     * The activity numbers on each valid interrupt descriptor are
     * grouped under a pseudo-thread.
     */
    priv->curr = getheadq(&nkpod->threadq);
    irq = priv->irq;
    priv->irq = 0;
    irqnr = xnintr_query_init(&priv->intr_it) * XNARCH_NR_CPUS;

    ret = irqnr + countq(&nkpod->threadq);

    printk(KERN_DEBUG "%s: priv=%p, ->curr=%p, ->irq=(%d), irqnr=%d, ret=%d\n",
           __FUNCTION__, priv, priv->curr, irq, irqnr, ret);

    return ret;
}


The result (3 first accesses are without latency running, the 2 last
ones with latency running - each time 120 µs delay)

[  173.098667] vfile_stat_rewind: priv=ffff880211863228,
->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=264
[  181.547424] vfile_stat_rewind: priv=ffff880211863b28,
->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=264
[  183.002400] vfile_stat_rewind: priv=ffff880211863228,
->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=264
[  201.475071] vfile_stat_rewind: priv=ffff880211863228,
->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=266
[  209.432070] vfile_stat_rewind: priv=ffff8802118631a8,
->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=266


So vfile... is called exactly once upon issuing 'cat /proc/xenomai/stat'.


Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-24  8:57               ` Jeroen Van den Keybus
@ 2014-04-24 14:46                 ` Jeroen Van den Keybus
  2014-04-25  8:15                   ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-24 14:46 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

I've been hammering on this stat entry with various tracing options to
try and catch a trace.

I did eventually catch one, but the function tracing was off at the time:

:|  # [ 2738] samplin 99  -140    0.050  xnpod_resume_thread+0xe8
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
:|  # event   tick@-40    -140    0.181  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
:|  # [    0] -<?>-   -1  -140    0.275  __xnpod_schedule+0x11d
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
:|  # [ 2738] samplin 99  -140    0.456  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
:|  # [ 2738] samplin 99  -139    0.307  __xnpod_schedule+0x11d
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
:|  # [    0] -<?>-   -1  -139! 105.640  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
:|  # [ 2738] samplin 99   -33    0.081  xnpod_resume_thread+0xe8
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
:|  # event   tick@59      -33!  32.250  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
:|  # [    0] -<?>-   -1    -1    0.349  __xnpod_schedule+0x11d
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
:|  # [ 2738] samplin 99     0    0.974  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
<   + freeze  0x00009088     0   45.490  xnshadow_sys_trace+0xed
[xeno_nucleus] (hisyscall_event+0x1a8 [xeno_nucleus])
 |  # [ 2738] samplin 99    45    0.350  __xnpod_schedule+0x11d
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
 |  # [    0] -<?>-   -1    45   52.196  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
 |  # [ 2738] samplin 99    98    0.078  xnpod_resume_thread+0xe8
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
 |  # event   tick@159      98    1.790  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # [    0] -<?>-   -1    99    0.283  __xnpod_schedule+0x11d
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
 |  # [ 2738] samplin 99   100    0.676  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
    + freeze  0x00009515   100    0.321  xnshadow_sys_trace+0xed
[xeno_nucleus] (hisyscall_event+0x1a8 [xeno_nucleus])
 |  # [ 2738] samplin 99   101    0.323  __xnpod_schedule+0x11d
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
 |  # [    0] -<?>-   -1   101  166.050  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
 |  # [ 2738] samplin 99   267    0.137  xnpod_resume_thread+0xe8
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
 |  # event   tick@359     267   84.425  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # [    0] -<?>-   -1   352    0.335  __xnpod_schedule+0x11d
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
 |  # [ 2738] samplin 99   352    0.672  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
    + freeze  0x0002e7eb   353  130.030  xnshadow_sys_trace+0xed
[xeno_nucleus] (hisyscall_event+0x1a8 [xeno_nucleus])
 |  # event   tick@559     483    0.715  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # [ 2738] samplin 99   483    0.315  __xnpod_schedule+0x11d
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
 |  # [    0] -<?>-   -1   484   87.028  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
 |  # [ 2738] samplin 99   571    0.050  xnpod_resume_thread+0xe8
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
 |  # event   tick@659     571    1.335  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
 |  # [    0] -<?>-   -1   572    0.283  __xnpod_schedule+0x11d
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
 |  # [ 2738] samplin 99   572    3.602  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
 |  # [ 2738] samplin 99   576    0.307  __xnpod_schedule+0x11d
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
 |  # [    0] -<?>-   -1   576  556.825  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
 |  # [ 2738] samplin 99  1133    0.196  xnpod_resume_thread+0xe8
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
 |  # event   tick@1159   1133    4.925  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])


When I configured with 'Instrument function entries', it is possible
to trigger the bug when the trace is disabled. If I enable it, it
cannot be triggered anymore, but I also noticed that the function
tracing, which worked correctly before , no longer worked properly and
a warning is issued:

[  497.121124] ------------[ cut here ]------------
[  497.121131] WARNING: at kernel/trace/ftrace.c:386
register_ftrace_function+0x1e4/0x230()
[  497.121132] Modules linked in: xeno_native xeno_nucleus i915
drm_kms_helper drm coretemp ghash_clmulni_intel aesni_intel aes_x86_64
lrw gf128mul glue_helper ablk_helper cryptd ehci_pci ehci_hcd video
lpc_ich rtc_cmos backlight mfd_core e1000e xhci_hcd igb usbcore
firewire_ohci firewire_core i2c_algo_bit hwmon ptp crc_itu_t pps_core
usb_common
[  497.121152] CPU: 0 PID: 2259 Comm: bash Tainted: G        W
3.10.18-ipipe-test-14 #82
[  497.121153] Hardware name: Supermicro X10SAE/X10SAE, BIOS 1.1a 01/03/2014
[  497.121154]  0000000000000009 ffff88020e9afe28 ffffffff8148ed1e
ffff88020e9afe60
[  497.121156]  ffffffff8103c071 00000000fffffff0 0000000000000002
0000000001eb0008
[  497.121158]  0000000000000002 0000000000000000 ffff88020e9afe70
ffffffff8103c14a
[  497.121160] Call Trace:
[  497.121164]  [<ffffffff8148ed1e>] dump_stack+0x19/0x1b
[  497.121167]  [<ffffffff8103c071>] warn_slowpath_common+0x61/0x80
[  497.121169]  [<ffffffff8103c14a>] warn_slowpath_null+0x1a/0x20
[  497.121171]  [<ffffffff810c85b4>] register_ftrace_function+0x1e4/0x230
[  497.121173]  [<ffffffff810c2afc>] __ipipe_wr_enable+0x10c/0x120
[  497.121177]  [<ffffffff81231911>] ? security_file_permission+0x21/0xa0
[  497.121180]  [<ffffffff811ae56d>] proc_reg_write+0x3d/0x80
[  497.121182]  [<ffffffff8114d17d>] vfs_write+0xbd/0x1e0
[  497.121184]  [<ffffffff8114db49>] SyS_write+0x49/0xa0
[  497.121187]  [<ffffffff8149c866>] system_call_fastpath+0x16/0x1b
[  497.121188] ---[ end trace 4c5426aeef218be7 ]---


When it is still working, the traces rather read like:

:   + func                -130    0.058  __rt_task_wait_period+0x0
[xeno_native] (hisyscall_event+0x1a8 [xeno_nucleus])
:   + func                -130    0.070  rt_task_wait_period+0x0
[xeno_native] (__rt_task_wait_period+0x1a [xeno_native])
:   + func                -130    0.109  xnpod_wait_thread_period+0x0
[xeno_nucleus] (rt_task_wait_period+0x4f [xeno_native])
:|  # func                -130    0.115  xnpod_suspend_thread+0x0
[xeno_nucleus] (xnpod_wait_thread_period+0x115 [xeno_nucleus])
:|  # func                -130    0.068  __xnpod_schedule+0x0
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
:|  # [ 2271] samplin 99  -130    0.056  __xnpod_schedule+0x11d
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
:|  # func                -130    0.470  xnsched_pick_next+0x0
[xeno_nucleus] (__xnpod_schedule+0x272 [xeno_nucleus])
:|  # [30287] -<?>-   -1  -129    0.245  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
:|   +func                -129+   1.082  __ipipe_do_sync_pipeline+0x0
(__ipipe_dispatch_irq+0x1c0)
:|   +func                -128    0.097  __ipipe_handle_exception+0x0
(device_not_available+0x1f)
:|   #func                -128    0.105  __ipipe_notify_trap+0x0
(__ipipe_handle_exception+0x75)
:|   #func                -128    0.062  do_device_not_available+0x0
(__ipipe_handle_exception+0xf7)
:|   #func                -128    0.087  user_exit+0x0
(do_device_not_available+0x17)
:|   #func                -128    0.056  ipipe_restore_root+0x0 (user_exit+0xb2)
:|   #func                -128    0.161  math_state_restore+0x0
(do_device_not_available+0x25)
:|   #func                -128!  93.612
__ipipe_restore_root_nosync+0x0 (__ipipe_handle_exception+0x106)
:|   +func                 -34    0.142  __ipipe_handle_irq+0x0
(apic_timer_interrupt+0x60)
:|   +func                 -34    0.081  __ipipe_dispatch_irq+0x0
(__ipipe_handle_irq+0x65)
:|   +func                 -34    0.090  __ipipe_ack_hrtimer_irq+0x0
(__ipipe_dispatch_irq+0x70)
:|   +func                 -34    0.131  lapic_itimer_ack+0x0
(__ipipe_ack_hrtimer_irq+0x3b)
:|  # func                 -33    0.164  xnintr_clock_handler+0x0
[xeno_nucleus] (__ipipe_dispatch_irq+0x171)
:|  # func                 -33    0.088  xntimer_tick_aperiodic+0x0
[xeno_nucleus] (xnintr_clock_handler+0x142 [xeno_nucleus])
:|  # func                 -33    0.077  xnthread_periodic_handler+0x0
[xeno_nucleus] (xntimer_tick_aperiodic+0xd5 [xeno_nucleus])
:|  # func                 -33    0.076  xnpod_resume_thread+0x0
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
:|  # [ 2271] samplin 99   -33    0.175  xnpod_resume_thread+0xe8
[xeno_nucleus] (xnthread_periodic_handler+0x35 [xeno_nucleus])
:|  # func                 -33    0.063  xntimer_next_local_shot+0x0
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
:|  # event   tick@65      -33    0.041  xntimer_next_local_shot+0x63
[xeno_nucleus] (xntimer_tick_aperiodic+0x1b0 [xeno_nucleus])
:|  # func                 -33    0.057  ipipe_timer_set+0x0
(xntimer_next_local_shot+0x6b [xeno_nucleus])
:|  # func                 -33    0.231  lapic_next_deadline+0x0
(ipipe_timer_set+0x5c)
:|  # func                 -32    0.103  __xnpod_schedule+0x0
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
:|  # [30287] -<?>-   -1   -32    0.078  __xnpod_schedule+0x11d
[xeno_nucleus] (xnintr_clock_handler+0x315 [xeno_nucleus])
:|  # func                 -32!  27.879  xnsched_pick_next+0x0
[xeno_nucleus] (__xnpod_schedule+0x272 [xeno_nucleus])
:|  # func                  -4    0.580
__ipipe_notify_vm_preemption+0x0 (__xnpod_schedule+0x71d
[xeno_nucleus])
:|  # [ 2271] samplin 99    -4    0.301  __xnpod_schedule+0x4e1
[xeno_nucleus] (xnpod_suspend_thread+0x44a [xeno_nucleus])
:|  # func                  -4    0.114  xntimer_get_overruns+0x0
[xeno_nucleus] (xnpod_wait_thread_period+0x13c [xeno_nucleus])
:|  # func                  -3+   1.210  __ipipe_restore_head+0x0
(xnpod_wait_thread_period+0x182 [xeno_nucleus])
:   + func                  -2    0.084  __ipipe_syscall_root+0x0
(__ipipe_syscall_root_thunk+0x35)
:   + func                  -2    0.068  __ipipe_notify_syscall+0x0
(__ipipe_syscall_root+0x35)
:   + func                  -2    0.058  ipipe_syscall_hook+0x0
(__ipipe_notify_syscall+0xbf)


Jeroen.

2014-04-24 10:57 GMT+02:00 Jeroen Van den Keybus
<jeroen.vandenkeybus@gmail.com>:
>> Could you put a printk in the function vfile_stat_rewind to see if it
>> gets called (more than once) when the problem happens?
>
> I patched as follows:
>
> static int vfile_stat_rewind(struct xnvfile_snapshot_iterator *it)
> {
>     struct vfile_stat_priv *priv = xnvfile_iterator_priv(it);
>     int irqnr;
>     int ret, irq;
>
>     /*
>      * The activity numbers on each valid interrupt descriptor are
>      * grouped under a pseudo-thread.
>      */
>     priv->curr = getheadq(&nkpod->threadq);
>     irq = priv->irq;
>     priv->irq = 0;
>     irqnr = xnintr_query_init(&priv->intr_it) * XNARCH_NR_CPUS;
>
>     ret = irqnr + countq(&nkpod->threadq);
>
>     printk(KERN_DEBUG "%s: priv=%p, ->curr=%p, ->irq=(%d), irqnr=%d, ret=%d\n",
>            __FUNCTION__, priv, priv->curr, irq, irqnr, ret);
>
>     return ret;
> }
>
>
> The result (3 first accesses are without latency running, the 2 last
> ones with latency running - each time 120 µs delay)
>
> [  173.098667] vfile_stat_rewind: priv=ffff880211863228,
> ->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=264
> [  181.547424] vfile_stat_rewind: priv=ffff880211863b28,
> ->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=264
> [  183.002400] vfile_stat_rewind: priv=ffff880211863228,
> ->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=264
> [  201.475071] vfile_stat_rewind: priv=ffff880211863228,
> ->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=266
> [  209.432070] vfile_stat_rewind: priv=ffff8802118631a8,
> ->curr=ffffffffa06b1110, ->irq=(0), irqnr=256, ret=266
>
>
> So vfile... is called exactly once upon issuing 'cat /proc/xenomai/stat'.
>
>
> Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-24 14:46                 ` Jeroen Van den Keybus
@ 2014-04-25  8:15                   ` Jeroen Van den Keybus
  2014-04-25 10:44                     ` Jeroen Van den Keybus
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-25  8:15 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

I'm currently looking in vfile.c. I noticed that the bug gets
triggered in stat mainly because my NR_IRQS is high (2^14) and
vfile_stat_next is simply called much more frequently than in the
other snapshot entries. When removing the call to xnintr_query_next
from vfile_stat_next (replacing ret = xnintr_query_next(...) by ret =
-ENODEV) the problem still persists.


Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-25  8:15                   ` Jeroen Van den Keybus
@ 2014-04-25 10:44                     ` Jeroen Van den Keybus
  2014-09-09 21:03                       ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-04-25 10:44 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

For testing, I've removed the locks from the vfile system. Then the
high latencies reliably disappear.

To test, I made two xeno_nucleus modules: one with the xnlock_get/put_
in place and one with dummies. Subsequently, I use a program that
simply opens and reads the stat file 1,000 times.

With locks:

RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
RTD|     -2.575|     -2.309|      9.286|       0|     0|     -2.575|      9.286
RTD|     -2.364|     -2.276|      1.600|       0|     0|     -2.575|      9.286
RTD|     -2.482|     -2.274|      2.165|       0|     0|     -2.575|      9.286
RTD|     -2.368|    135.261|   1478.154|   13008|     0|     -2.575|   1478.154
RTD|     -2.368|     -2.272|      2.602|   13008|     0|     -2.575|   1478.154
RTD|     -2.499|     -2.272|      6.933|   13008|     0|     -2.575|   1478.154

Without locks:

RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
RTD|     -2.503|     -2.270|      3.310|       0|     0|     -2.503|      3.310
RTD|     -2.418|     -2.284|     -1.646|       0|     0|     -2.503|      3.310
RTD|     -2.496|     -2.275|      4.630|       0|     0|     -2.503|      4.630
RTD|     -2.374|     -2.285|     -1.458|       0|     0|     -2.503|      4.630
RTD|     -2.452|     -2.273|      3.559|       0|     0|     -2.503|      4.630
RTD|     -2.370|     -2.285|     -1.518|       0|     0|     -2.503|      4.630
RTD|     -2.458|     -2.274|      4.203|       0|     0|     -2.503|      4.630

I'll now have a closer look into the vfile system but if the locks are
malfunctioning, I'm clueless.


BTW I found that unloading and loading xeno_nucleus didn't work due to
a missing rthal_free_ptdkey call in xnshadow_cleanup. I used the
following patch to fix that. (The ability to swap out xenomai modules
is a real lifesaver when debugging. Thanks!)

--- /home/vdkeybus/work/xenomai/ksrc/nucleus/shadow.c   2014-04-16
22:46:19.018851844 +0200
+++ shadow.c    2014-04-25 09:43:49.838735832 +0200
@@ -3139,6 +3139,8 @@ void xnshadow_cleanup(void)
        }

        rthal_apc_free(lostage_apc);
+
+       rthal_free_ptdkey(nkmmptd);
        rthal_free_ptdkey(nkerrptd);
        rthal_free_ptdkey(nkthrptd);


Jeroen.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-04-25 10:44                     ` Jeroen Van den Keybus
@ 2014-09-09 21:03                       ` Gilles Chanteperdrix
  2014-09-10 13:50                         ` Jeroen Van den Keybus
  2014-09-11  5:11                         ` Jan Kiszka
  0 siblings, 2 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-09 21:03 UTC (permalink / raw)
  To: Jeroen Van den Keybus; +Cc: xenomai

On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
> For testing, I've removed the locks from the vfile system. Then the
> high latencies reliably disappear.
> 
> To test, I made two xeno_nucleus modules: one with the xnlock_get/put_
> in place and one with dummies. Subsequently, I use a program that
> simply opens and reads the stat file 1,000 times.
> 
> With locks:
> 
> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
> RTD|     -2.575|     -2.309|      9.286|       0|     0|     -2.575|      9.286
> RTD|     -2.364|     -2.276|      1.600|       0|     0|     -2.575|      9.286
> RTD|     -2.482|     -2.274|      2.165|       0|     0|     -2.575|      9.286
> RTD|     -2.368|    135.261|   1478.154|   13008|     0|     -2.575|   1478.154
> RTD|     -2.368|     -2.272|      2.602|   13008|     0|     -2.575|   1478.154
> RTD|     -2.499|     -2.272|      6.933|   13008|     0|     -2.575|   1478.154
> 
> Without locks:
> 
> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
> RTD|     -2.503|     -2.270|      3.310|       0|     0|     -2.503|      3.310
> RTD|     -2.418|     -2.284|     -1.646|       0|     0|     -2.503|      3.310
> RTD|     -2.496|     -2.275|      4.630|       0|     0|     -2.503|      4.630
> RTD|     -2.374|     -2.285|     -1.458|       0|     0|     -2.503|      4.630
> RTD|     -2.452|     -2.273|      3.559|       0|     0|     -2.503|      4.630
> RTD|     -2.370|     -2.285|     -1.518|       0|     0|     -2.503|      4.630
> RTD|     -2.458|     -2.274|      4.203|       0|     0|     -2.503|      4.630
> 
> I'll now have a closer look into the vfile system but if the locks are
> malfunctioning, I'm clueless.

Answering with a "little" delay, could you try the following patch?

diff --git a/include/asm-generic/bits/pod.h b/include/asm-generic/bits/pod.h
index a6be0dc..cfb0c71 100644
--- a/include/asm-generic/bits/pod.h
+++ b/include/asm-generic/bits/pod.h
@@ -248,6 +248,7 @@ void __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
 			cpu_relax();
 			xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
 					    XNLOCK_DBG_PASS_CONTEXT);
+			xnarch_memory_barrier();
 		} while(atomic_read(&lock->owner) != ~0);
 }
 EXPORT_SYMBOL_GPL(__xnlock_spin);
diff --git a/include/asm-generic/system.h b/include/asm-generic/system.h
index 25bd83f..7a8c4d0 100644
--- a/include/asm-generic/system.h
+++ b/include/asm-generic/system.h
@@ -378,6 +378,8 @@ static inline void xnlock_put(xnlock_t *lock)
 	xnarch_memory_barrier();
 
 	atomic_set(&lock->owner, ~0);
+
+	xnarch_memory_barrier();
 }
 
 static inline spl_t
diff --git a/ksrc/nucleus/vfile.c b/ksrc/nucleus/vfile.c
index c8e0363..066c12f 100644
--- a/ksrc/nucleus/vfile.c
+++ b/ksrc/nucleus/vfile.c
@@ -279,6 +279,15 @@ redo:
 			data += vfile->datasz;
 			it->nrdata++;
 		}
+#ifdef CONFIG_SMP
+		{
+			/* Leave some time for other cpus to get the lock */
+			xnticks_t wakeup = xnarch_get_cpu_tsc();
+			wakeup += xnarch_ns_to_tsc(1000);
+			while ((xnsticks_t)(xnarch_get_cpu_tsc() - wakeup) < 0)
+				cpu_relax();
+		}
+#endif
 	}
 
 	if (ret < 0) {


> 
> 
> BTW I found that unloading and loading xeno_nucleus didn't work due to
> a missing rthal_free_ptdkey call in xnshadow_cleanup. I used the
> following patch to fix that. (The ability to swap out xenomai modules
> is a real lifesaver when debugging. Thanks!)
> 
> --- /home/vdkeybus/work/xenomai/ksrc/nucleus/shadow.c   2014-04-16
> 22:46:19.018851844 +0200
> +++ shadow.c    2014-04-25 09:43:49.838735832 +0200
> @@ -3139,6 +3139,8 @@ void xnshadow_cleanup(void)
>         }
> 
>         rthal_apc_free(lostage_apc);
> +
> +       rthal_free_ptdkey(nkmmptd);
>         rthal_free_ptdkey(nkerrptd);
>         rthal_free_ptdkey(nkthrptd);

Merged, thanks.


-- 
                                                                Gilles.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-09 21:03                       ` Gilles Chanteperdrix
@ 2014-09-10 13:50                         ` Jeroen Van den Keybus
  2014-09-10 19:47                           ` Gilles Chanteperdrix
  2014-09-11  5:11                         ` Jan Kiszka
  1 sibling, 1 reply; 40+ messages in thread
From: Jeroen Van den Keybus @ 2014-09-10 13:50 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

Hi Gilles,



> Answering with a "little" delay, could you try the following patch?
>

No problem. And we understand you are busy.

The tests below consist of running ./latency at 10 kHz and continuously
open, read and close /proc/xenomai/stat. We can read at about 200 Hz. We
let it cook for 10 minutes.

To verify, we tested again without the patch (problems after one sec
already, so we stopped this):

== Sampling period: 100 us

== Test mode: periodic user-mode task

== All results in microseconds

warming up...

RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
worst

RTD|      0.725|      0.956|      1.524|       0|     0|      0.725|
1.524

RTD|      0.782|      0.936|      1.482|       0|     0|      0.725|
1.524

RTD|      0.886|      0.936|      1.750|       0|     0|      0.725|
1.750

RTD|      0.886|      2.355|    546.854|       5|     0|      0.725|
546.854

RTD|      1.253|      4.380|    629.025|      15|     0|      0.725|
629.025

RTD|      1.292|      4.348|    578.529|      19|     0|      0.725|
 629.025

RTD|      1.287|      4.375|    662.344|      27|     0|      0.725|
662.344

RTD|      1.265|      4.372|    369.331|      35|     0|      0.725|
662.344


And with the patch (same conditions for 10 minutes):

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
worst

RTD|      1.152|      1.230|      1.889|       0|     0|      0.547|
2.591

RTD|      1.054|      1.231|      1.824|       0|     0|      0.547|
2.591

RTD|      1.148|      1.229|      1.775|       0|     0|      0.547|
2.591

RTD|      1.057|      1.231|      1.782|       0|     0|      0.547|
2.591

RTD|      1.150|      1.231|      1.786|       0|     0|      0.547|
2.591

RTD|      1.045|      1.230|      1.888|       0|     0|      0.547|
2.591

RTD|      1.151|      1.231|      1.761|       0|     0|      0.547|
2.591

RTD|      0.999|      1.230|      2.049|       0|     0|      0.547|
2.591

RTD|      1.148|      1.231|      1.818|       0|     0|      0.547|
2.591

RTD|      1.031|      1.231|      1.784|       0|     0|      0.547|
2.591

RTD|      1.149|      1.231|      1.818|       0|     0|      0.547|
2.591

RTD|      0.832|      1.228|      1.976|       0|     0|      0.547|
2.591

RTD|      1.149|      1.226|      1.805|       0|     0|      0.547|
2.591

RTD|      1.025|      1.225|      1.842|       0|     0|      0.547|
2.591

RTD|      1.150|      1.225|      1.795|       0|     0|      0.547|
2.591

RTD|      1.053|      1.225|      1.774|       0|     0|      0.547|
2.591

RTD|      1.150|      1.226|      1.876|       0|     0|      0.547|
2.591

RTD|      0.910|      1.225|      2.205|       0|     0|      0.547|
2.591

RTD|      1.149|      1.225|      1.819|       0|     0|      0.547|
2.591

RTD|      0.716|      1.225|      1.774|       0|     0|      0.547|
2.591

RTD|      0.873|      1.225|      1.925|       0|     0|      0.547|
2.591

RTT|  00:09:49  (periodic user-mode task, 100 us period, priority 99)



We also checked the kernel performance without reading stat:

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
worst

RTD|      0.906|      0.917|      1.377|       0|     0|      0.494|
2.600

RTD|      0.905|      0.916|      1.501|       0|     0|      0.494|
2.600

RTD|      0.905|      0.916|      1.357|       0|     0|      0.494|
2.600

RTD|      0.905|      0.960|      1.395|       0|     0|      0.494|
2.600

RTD|      0.594|      0.916|      1.349|       0|     0|      0.494|
2.600

RTD|      0.906|      0.917|      1.364|       0|     0|      0.494|
2.600

RTD|      0.905|      0.916|      1.331|       0|     0|      0.494|
2.600

RTD|      0.905|      0.917|      1.333|       0|     0|      0.494|
2.600

RTD|      0.846|      0.954|      1.363|       0|     0|      0.494|
2.600

RTD|      0.906|      0.917|      1.369|       0|     0|      0.494|
2.600

RTD|      0.906|      0.917|      1.365|       0|     0|      0.494|
2.600

RTD|      0.905|      0.917|      1.341|       0|     0|      0.494|
2.600

RTD|      0.906|      0.917|      1.354|       0|     0|      0.494|
2.600

RTD|      0.906|      0.957|      1.340|       0|     0|      0.494|
2.600

RTD|      0.905|      0.917|      1.380|       0|     0|      0.494|
2.600

RTD|      0.905|      0.918|      1.356|       0|     0|      0.494|
2.600

RTD|      0.905|      0.917|      1.339|       0|     0|      0.494|
2.600

RTD|      0.905|      0.917|      1.331|       0|     0|      0.494|
2.600

RTD|      0.906|      0.955|      1.376|       0|     0|      0.494|
2.600

RTD|      0.906|      0.917|      1.353|       0|     0|      0.494|
2.600

RTD|      0.604|      0.916|      1.379|       0|     0|      0.494|
2.600

RTT|  00:10:10  (periodic user-mode task, 100 us period, priority 99)



You can clearly observe an increase of the average latency (about 300 ns),
but the worst case latency isn't necessarily worse.

Obviously the patch works. Thanks !

Do I understand correctly that currently a flag (lock) set on one CPU isn't
observable by another CPU (due to e.g. cache) ?


Jeroen.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-10 13:50                         ` Jeroen Van den Keybus
@ 2014-09-10 19:47                           ` Gilles Chanteperdrix
  0 siblings, 0 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-10 19:47 UTC (permalink / raw)
  To: Jeroen Van den Keybus; +Cc: xenomai

On 09/10/2014 03:50 PM, Jeroen Van den Keybus wrote:
> Hi Gilles,
> 
> 
> 
>> Answering with a "little" delay, could you try the following patch?
>>
> 
> No problem. And we understand you are busy.
> 
> The tests below consist of running ./latency at 10 kHz and continuously
> open, read and close /proc/xenomai/stat. We can read at about 200 Hz. We
> let it cook for 10 minutes.
> 
> To verify, we tested again without the patch (problems after one sec
> already, so we stopped this):
> 
> == Sampling period: 100 us
> 
> == Test mode: periodic user-mode task
> 
> == All results in microseconds
> 
> warming up...
> 
> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
> 
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
> worst
> 
> RTD|      0.725|      0.956|      1.524|       0|     0|      0.725|
> 1.524
> 
> RTD|      0.782|      0.936|      1.482|       0|     0|      0.725|
> 1.524
> 
> RTD|      0.886|      0.936|      1.750|       0|     0|      0.725|
> 1.750
> 
> RTD|      0.886|      2.355|    546.854|       5|     0|      0.725|
> 546.854
> 
> RTD|      1.253|      4.380|    629.025|      15|     0|      0.725|
> 629.025
> 
> RTD|      1.292|      4.348|    578.529|      19|     0|      0.725|
>  629.025
> 
> RTD|      1.287|      4.375|    662.344|      27|     0|      0.725|
> 662.344
> 
> RTD|      1.265|      4.372|    369.331|      35|     0|      0.725|
> 662.344
> 
> 
> And with the patch (same conditions for 10 minutes):
> 
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
> worst
> 
> RTD|      1.152|      1.230|      1.889|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.054|      1.231|      1.824|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.148|      1.229|      1.775|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.057|      1.231|      1.782|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.150|      1.231|      1.786|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.045|      1.230|      1.888|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.151|      1.231|      1.761|       0|     0|      0.547|
> 2.591
> 
> RTD|      0.999|      1.230|      2.049|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.148|      1.231|      1.818|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.031|      1.231|      1.784|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.149|      1.231|      1.818|       0|     0|      0.547|
> 2.591
> 
> RTD|      0.832|      1.228|      1.976|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.149|      1.226|      1.805|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.025|      1.225|      1.842|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.150|      1.225|      1.795|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.053|      1.225|      1.774|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.150|      1.226|      1.876|       0|     0|      0.547|
> 2.591
> 
> RTD|      0.910|      1.225|      2.205|       0|     0|      0.547|
> 2.591
> 
> RTD|      1.149|      1.225|      1.819|       0|     0|      0.547|
> 2.591
> 
> RTD|      0.716|      1.225|      1.774|       0|     0|      0.547|
> 2.591
> 
> RTD|      0.873|      1.225|      1.925|       0|     0|      0.547|
> 2.591
> 
> RTT|  00:09:49  (periodic user-mode task, 100 us period, priority 99)
> 
> 
> 
> We also checked the kernel performance without reading stat:
> 
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
> worst
> 
> RTD|      0.906|      0.917|      1.377|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.916|      1.501|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.916|      1.357|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.960|      1.395|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.594|      0.916|      1.349|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.906|      0.917|      1.364|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.916|      1.331|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.917|      1.333|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.846|      0.954|      1.363|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.906|      0.917|      1.369|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.906|      0.917|      1.365|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.917|      1.341|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.906|      0.917|      1.354|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.906|      0.957|      1.340|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.917|      1.380|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.918|      1.356|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.917|      1.339|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.905|      0.917|      1.331|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.906|      0.955|      1.376|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.906|      0.917|      1.353|       0|     0|      0.494|
> 2.600
> 
> RTD|      0.604|      0.916|      1.379|       0|     0|      0.494|
> 2.600
> 
> RTT|  00:10:10  (periodic user-mode task, 100 us period, priority 99)
> 
> 
> 
> You can clearly observe an increase of the average latency (about 300 ns),
> but the worst case latency isn't necessarily worse.
> 
> Obviously the patch works. Thanks !
> 
> Do I understand correctly that currently a flag (lock) set on one CPU isn't
> observable by another CPU (due to e.g. cache) ?

Currently, freeing the lock is not perceived immediately by the cpus
waiting for the lock, that is because the lock is freed with atomic_set,
which does not contain a barrier. The lock, however, is locked with
atomic_cmpxchg, which is a full barrier, so, there is no way a locked
lock can be perceived as free.

That accountfs for the barriers in xnlock_put and __xnlock_spin. However
these two barriers do not seem to be sufficient to avoid the issue
completely, either busy sleeping in the snapshot code, or putting a
barrier in front of the cmpxchg in the xlock_get code seems to be
necessary. I have chosen the latter, because it has a smaller impact
than the former, but I am not entirely satisfied (busy sleeping is a bit
ugly).

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-09 21:03                       ` Gilles Chanteperdrix
  2014-09-10 13:50                         ` Jeroen Van den Keybus
@ 2014-09-11  5:11                         ` Jan Kiszka
  2014-09-11  5:19                           ` Jan Kiszka
  2014-09-16 11:09                           ` Gilles Chanteperdrix
  1 sibling, 2 replies; 40+ messages in thread
From: Jan Kiszka @ 2014-09-11  5:11 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>> For testing, I've removed the locks from the vfile system. Then the
>> high latencies reliably disappear.
>>
>> To test, I made two xeno_nucleus modules: one with the xnlock_get/put_
>> in place and one with dummies. Subsequently, I use a program that
>> simply opens and reads the stat file 1,000 times.
>>
>> With locks:
>>
>> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
>> RTD|     -2.575|     -2.309|      9.286|       0|     0|     -2.575|      9.286
>> RTD|     -2.364|     -2.276|      1.600|       0|     0|     -2.575|      9.286
>> RTD|     -2.482|     -2.274|      2.165|       0|     0|     -2.575|      9.286
>> RTD|     -2.368|    135.261|   1478.154|   13008|     0|     -2.575|   1478.154
>> RTD|     -2.368|     -2.272|      2.602|   13008|     0|     -2.575|   1478.154
>> RTD|     -2.499|     -2.272|      6.933|   13008|     0|     -2.575|   1478.154
>>
>> Without locks:
>>
>> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
>> RTD|     -2.503|     -2.270|      3.310|       0|     0|     -2.503|      3.310
>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|     -2.503|      3.310
>> RTD|     -2.496|     -2.275|      4.630|       0|     0|     -2.503|      4.630
>> RTD|     -2.374|     -2.285|     -1.458|       0|     0|     -2.503|      4.630
>> RTD|     -2.452|     -2.273|      3.559|       0|     0|     -2.503|      4.630
>> RTD|     -2.370|     -2.285|     -1.518|       0|     0|     -2.503|      4.630
>> RTD|     -2.458|     -2.274|      4.203|       0|     0|     -2.503|      4.630
>>
>> I'll now have a closer look into the vfile system but if the locks are
>> malfunctioning, I'm clueless.
> 
> Answering with a "little" delay, could you try the following patch?
> 
> diff --git a/include/asm-generic/bits/pod.h b/include/asm-generic/bits/pod.h
> index a6be0dc..cfb0c71 100644
> --- a/include/asm-generic/bits/pod.h
> +++ b/include/asm-generic/bits/pod.h
> @@ -248,6 +248,7 @@ void __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>  			cpu_relax();
>  			xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>  					    XNLOCK_DBG_PASS_CONTEXT);
> +			xnarch_memory_barrier();
>  		} while(atomic_read(&lock->owner) != ~0);
>  }
>  EXPORT_SYMBOL_GPL(__xnlock_spin);
> diff --git a/include/asm-generic/system.h b/include/asm-generic/system.h
> index 25bd83f..7a8c4d0 100644
> --- a/include/asm-generic/system.h
> +++ b/include/asm-generic/system.h
> @@ -378,6 +378,8 @@ static inline void xnlock_put(xnlock_t *lock)
>  	xnarch_memory_barrier();
>  
>  	atomic_set(&lock->owner, ~0);
> +
> +	xnarch_memory_barrier();

That's pretty heavy-weighted now (it was already due to the first memory
barrier). Maybe it's better to look at some ticket lock mechanism like
Linux uses for fairness. At least on x86 (and other strictly ordered
archs), those require no memory barriers on release.

Jan

>  }
>  
>  static inline spl_t
> diff --git a/ksrc/nucleus/vfile.c b/ksrc/nucleus/vfile.c
> index c8e0363..066c12f 100644
> --- a/ksrc/nucleus/vfile.c
> +++ b/ksrc/nucleus/vfile.c
> @@ -279,6 +279,15 @@ redo:
>  			data += vfile->datasz;
>  			it->nrdata++;
>  		}
> +#ifdef CONFIG_SMP
> +		{
> +			/* Leave some time for other cpus to get the lock */
> +			xnticks_t wakeup = xnarch_get_cpu_tsc();
> +			wakeup += xnarch_ns_to_tsc(1000);
> +			while ((xnsticks_t)(xnarch_get_cpu_tsc() - wakeup) < 0)
> +				cpu_relax();
> +		}
> +#endif
>  	}
>  
>  	if (ret < 0) {
> 
> 
>>
>>
>> BTW I found that unloading and loading xeno_nucleus didn't work due to
>> a missing rthal_free_ptdkey call in xnshadow_cleanup. I used the
>> following patch to fix that. (The ability to swap out xenomai modules
>> is a real lifesaver when debugging. Thanks!)
>>
>> --- /home/vdkeybus/work/xenomai/ksrc/nucleus/shadow.c   2014-04-16
>> 22:46:19.018851844 +0200
>> +++ shadow.c    2014-04-25 09:43:49.838735832 +0200
>> @@ -3139,6 +3139,8 @@ void xnshadow_cleanup(void)
>>         }
>>
>>         rthal_apc_free(lostage_apc);
>> +
>> +       rthal_free_ptdkey(nkmmptd);
>>         rthal_free_ptdkey(nkerrptd);
>>         rthal_free_ptdkey(nkthrptd);
> 
> Merged, thanks.
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20140911/90346dde/attachment.sig>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-11  5:11                         ` Jan Kiszka
@ 2014-09-11  5:19                           ` Jan Kiszka
  2014-09-18 11:46                             ` Gilles Chanteperdrix
  2014-09-16 11:09                           ` Gilles Chanteperdrix
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-11  5:19 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-11 07:11, Jan Kiszka wrote:
> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>> For testing, I've removed the locks from the vfile system. Then the
>>> high latencies reliably disappear.
>>>
>>> To test, I made two xeno_nucleus modules: one with the xnlock_get/put_
>>> in place and one with dummies. Subsequently, I use a program that
>>> simply opens and reads the stat file 1,000 times.
>>>
>>> With locks:
>>>
>>> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
>>> RTD|     -2.575|     -2.309|      9.286|       0|     0|     -2.575|      9.286
>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|     -2.575|      9.286
>>> RTD|     -2.482|     -2.274|      2.165|       0|     0|     -2.575|      9.286
>>> RTD|     -2.368|    135.261|   1478.154|   13008|     0|     -2.575|   1478.154
>>> RTD|     -2.368|     -2.272|      2.602|   13008|     0|     -2.575|   1478.154
>>> RTD|     -2.499|     -2.272|      6.933|   13008|     0|     -2.575|   1478.154
>>>
>>> Without locks:
>>>
>>> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
>>> RTD|     -2.503|     -2.270|      3.310|       0|     0|     -2.503|      3.310
>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|     -2.503|      3.310
>>> RTD|     -2.496|     -2.275|      4.630|       0|     0|     -2.503|      4.630
>>> RTD|     -2.374|     -2.285|     -1.458|       0|     0|     -2.503|      4.630
>>> RTD|     -2.452|     -2.273|      3.559|       0|     0|     -2.503|      4.630
>>> RTD|     -2.370|     -2.285|     -1.518|       0|     0|     -2.503|      4.630
>>> RTD|     -2.458|     -2.274|      4.203|       0|     0|     -2.503|      4.630
>>>
>>> I'll now have a closer look into the vfile system but if the locks are
>>> malfunctioning, I'm clueless.
>>
>> Answering with a "little" delay, could you try the following patch?
>>
>> diff --git a/include/asm-generic/bits/pod.h b/include/asm-generic/bits/pod.h
>> index a6be0dc..cfb0c71 100644
>> --- a/include/asm-generic/bits/pod.h
>> +++ b/include/asm-generic/bits/pod.h
>> @@ -248,6 +248,7 @@ void __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>  			cpu_relax();
>>  			xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>  					    XNLOCK_DBG_PASS_CONTEXT);
>> +			xnarch_memory_barrier();
>>  		} while(atomic_read(&lock->owner) != ~0);
>>  }
>>  EXPORT_SYMBOL_GPL(__xnlock_spin);
>> diff --git a/include/asm-generic/system.h b/include/asm-generic/system.h
>> index 25bd83f..7a8c4d0 100644
>> --- a/include/asm-generic/system.h
>> +++ b/include/asm-generic/system.h
>> @@ -378,6 +378,8 @@ static inline void xnlock_put(xnlock_t *lock)
>>  	xnarch_memory_barrier();
>>  
>>  	atomic_set(&lock->owner, ~0);
>> +
>> +	xnarch_memory_barrier();
> 
> That's pretty heavy-weighted now (it was already due to the first memory
> barrier). Maybe it's better to look at some ticket lock mechanism like
> Linux uses for fairness. At least on x86 (and other strictly ordered
> archs), those require no memory barriers on release.

In fact, memory barriers aren't needed on strictly ordered archs already
today, independent of the spinlock granting algorithm. So there are two
optimization possibilities:

- ticket-based granting
- arch-specific (thus optimized) core

Jan


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://www.xenomai.org/pipermail/xenomai/attachments/20140911/b4129df0/attachment.sig>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-11  5:11                         ` Jan Kiszka
  2014-09-11  5:19                           ` Jan Kiszka
@ 2014-09-16 11:09                           ` Gilles Chanteperdrix
  1 sibling, 0 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-16 11:09 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/11/2014 07:11 AM, Jan Kiszka wrote:
> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>> For testing, I've removed the locks from the vfile system. Then the
>>> high latencies reliably disappear.
>>>
>>> To test, I made two xeno_nucleus modules: one with the xnlock_get/put_
>>> in place and one with dummies. Subsequently, I use a program that
>>> simply opens and reads the stat file 1,000 times.
>>>
>>> With locks:
>>>
>>> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
>>> RTD|     -2.575|     -2.309|      9.286|       0|     0|     -2.575|      9.286
>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|     -2.575|      9.286
>>> RTD|     -2.482|     -2.274|      2.165|       0|     0|     -2.575|      9.286
>>> RTD|     -2.368|    135.261|   1478.154|   13008|     0|     -2.575|   1478.154
>>> RTD|     -2.368|     -2.272|      2.602|   13008|     0|     -2.575|   1478.154
>>> RTD|     -2.499|     -2.272|      6.933|   13008|     0|     -2.575|   1478.154
>>>
>>> Without locks:
>>>
>>> RTT|  00:00:01  (periodic user-mode task, 100 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat worst
>>> RTD|     -2.503|     -2.270|      3.310|       0|     0|     -2.503|      3.310
>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|     -2.503|      3.310
>>> RTD|     -2.496|     -2.275|      4.630|       0|     0|     -2.503|      4.630
>>> RTD|     -2.374|     -2.285|     -1.458|       0|     0|     -2.503|      4.630
>>> RTD|     -2.452|     -2.273|      3.559|       0|     0|     -2.503|      4.630
>>> RTD|     -2.370|     -2.285|     -1.518|       0|     0|     -2.503|      4.630
>>> RTD|     -2.458|     -2.274|      4.203|       0|     0|     -2.503|      4.630
>>>
>>> I'll now have a closer look into the vfile system but if the locks are
>>> malfunctioning, I'm clueless.
>>
>> Answering with a "little" delay, could you try the following patch?
>>
>> diff --git a/include/asm-generic/bits/pod.h b/include/asm-generic/bits/pod.h
>> index a6be0dc..cfb0c71 100644
>> --- a/include/asm-generic/bits/pod.h
>> +++ b/include/asm-generic/bits/pod.h
>> @@ -248,6 +248,7 @@ void __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>  			cpu_relax();
>>  			xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>  					    XNLOCK_DBG_PASS_CONTEXT);
>> +			xnarch_memory_barrier();
>>  		} while(atomic_read(&lock->owner) != ~0);
>>  }
>>  EXPORT_SYMBOL_GPL(__xnlock_spin);
>> diff --git a/include/asm-generic/system.h b/include/asm-generic/system.h
>> index 25bd83f..7a8c4d0 100644
>> --- a/include/asm-generic/system.h
>> +++ b/include/asm-generic/system.h
>> @@ -378,6 +378,8 @@ static inline void xnlock_put(xnlock_t *lock)
>>  	xnarch_memory_barrier();
>>  
>>  	atomic_set(&lock->owner, ~0);
>> +
>> +	xnarch_memory_barrier();
> 
> That's pretty heavy-weighted now (it was already due to the first memory
> barrier). Maybe it's better to look at some ticket lock mechanism like
> Linux uses for fairness. At least on x86 (and other strictly ordered
> archs), those require no memory barriers on release.

Maybe I can use atomic_cmpxchg(cpu, ~0), at least it will be only one
big barrier instead of two, I believe this is what the original xnlock
code did, I do not remember why it got changed to the current bogus
implemenation, it even allows a cheap check for invalid unlock. I am not
too fond of such an invasive change as changing the locks implementation
completely in 2.6. Especially since the issue we have is a corner case
(the problem is caused by /proc/xenomai/stat which quickly locks and
unlocks the nklock).


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-11  5:19                           ` Jan Kiszka
@ 2014-09-18 11:46                             ` Gilles Chanteperdrix
  2014-09-18 11:59                               ` Jan Kiszka
  0 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 11:46 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 09/11/2014 07:19 AM, Jan Kiszka wrote:
> On 2014-09-11 07:11, Jan Kiszka wrote:
>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>> For testing, I've removed the locks from the vfile system.
>>>> Then the high latencies reliably disappear.
>>>> 
>>>> To test, I made two xeno_nucleus modules: one with the
>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>> I use a program that simply opens and reads the stat file
>>>> 1,000 times.
>>>> 
>>>> With locks:
>>>> 
>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286 
>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>> 0|     -2.575|   1478.154
>>>> 
>>>> Without locks:
>>>> 
>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310 
>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>> 4.203|       0|     0|     -2.503|      4.630
>>>> 
>>>> I'll now have a closer look into the vfile system but if the
>>>> locks are malfunctioning, I'm clueless.
>>> 
>>> Answering with a "little" delay, could you try the following
>>> patch?
>>> 
>>> diff --git a/include/asm-generic/bits/pod.h
>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644 
>>> --- a/include/asm-generic/bits/pod.h +++
>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS) 
>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */ 
>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>> while(atomic_read(&lock->owner) != ~0); } 
>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>> a/include/asm-generic/system.h b/include/asm-generic/system.h 
>>> index 25bd83f..7a8c4d0 100644 ---
>>> a/include/asm-generic/system.h +++
>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>> inline void xnlock_put(xnlock_t *lock) 
>>> xnarch_memory_barrier();
>>> 
>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>> 
>> That's pretty heavy-weighted now (it was already due to the first
>> memory barrier). Maybe it's better to look at some ticket lock
>> mechanism like Linux uses for fairness. At least on x86 (and
>> other strictly ordered archs), those require no memory barriers
>> on release.
> 
> In fact, memory barriers aren't needed on strictly ordered archs
> already today, independent of the spinlock granting algorithm. So
> there are two optimization possibilities:
> 
> - ticket-based granting - arch-specific (thus optimized) core

Ok, no answer, so I will try to be more clear.

I do not pretend to understand how memory barriers work at a low
level, this is a shame, I know, and am sorry for that. My "high level"
view, is that memory barriers on SMP systems act as synchronization
points, meaning that when a CPU issues a barrier, it will "see" the
state of the other CPUs at the time of their last barrier. This means
that for a CPU to see a store that occured on another CPU, there must
have been two barriers: a barrier after the store on one cpu, and a
barrier after that before the read on the other cpu. This view of
things seems to be corroborated by the fact that the patch works, and
by the following sentence in Documentation/memory-barriers.txt:

 (*) There is no guarantee that a CPU will see the correct order of
effects from a second CPU's accesses, even _if_ the second CPU uses a
memory barrier, unless the first CPU _also_ uses a matching memory
barrier (see the subsection on "SMP Barrier Pairing").

So, the lack of memory barrier after atomic_set in xnlock_put looks
like a bug to me, and your assertion that ticket based algorithm do
not require memory barriers looks dubious.

Now, I do not really know what "strictly ordered architecture" means,
(a shame, again, sorry) but I suspect it implies strict ordering on
one core, but not amongst cores, so that the two barriers thing
remains mandatory. So, in short, on a fully ordered system, the
barrier before atomic_set can be removed, but the one after atomic_set
is still necessary. If this is the case, then we would simply need to
define an xnarch_local_memory_barrier() which implies ordering on the
current cpu, and that would simply be a compiler barrier on x86, and
we do not need a complete reimplementation of the spinlocks just for
one barrier.

For the same reason, I find that the memory barrier before atomic_read
in __xnlock_spin is necessary. In fact it is necessary only on x86
which is the only architecture where cpu_relax() is not defined to be
a barrier, but anyway, I do not believe this barrier is a problem
since it happens on a slow path.

- -- 
                                                                Gilles.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iD8DBQFUGsYvGpcgE6m/fboRAtJjAKCBOIeeWT5OnSKfozydZR3lwxcK6ACfbTW4
o1rwRixqvFXN3/WGX1MVn/E=
=R5hK
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 11:46                             ` Gilles Chanteperdrix
@ 2014-09-18 11:59                               ` Jan Kiszka
  2014-09-18 12:11                                 ` Gilles Chanteperdrix
                                                   ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Jan Kiszka @ 2014-09-18 11:59 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>> For testing, I've removed the locks from the vfile system.
>>>>> Then the high latencies reliably disappear.
>>>>>
>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>> I use a program that simply opens and reads the stat file
>>>>> 1,000 times.
>>>>>
>>>>> With locks:
>>>>>
>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>> 0|     -2.575|   1478.154
>>>>>
>>>>> Without locks:
>>>>>
>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>
>>>>> I'll now have a closer look into the vfile system but if the
>>>>> locks are malfunctioning, I'm clueless.
>>>>
>>>> Answering with a "little" delay, could you try the following
>>>> patch?
>>>>
>>>> diff --git a/include/asm-generic/bits/pod.h
>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>> --- a/include/asm-generic/bits/pod.h +++
>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>> while(atomic_read(&lock->owner) != ~0); }
>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>> index 25bd83f..7a8c4d0 100644 ---
>>>> a/include/asm-generic/system.h +++
>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>> inline void xnlock_put(xnlock_t *lock)
>>>> xnarch_memory_barrier();
>>>>
>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>
>>> That's pretty heavy-weighted now (it was already due to the first
>>> memory barrier). Maybe it's better to look at some ticket lock
>>> mechanism like Linux uses for fairness. At least on x86 (and
>>> other strictly ordered archs), those require no memory barriers
>>> on release.
> 
>> In fact, memory barriers aren't needed on strictly ordered archs
>> already today, independent of the spinlock granting algorithm. So
>> there are two optimization possibilities:
> 
>> - ticket-based granting - arch-specific (thus optimized) core
> 
> Ok, no answer, so I will try to be more clear.
> 
> I do not pretend to understand how memory barriers work at a low
> level, this is a shame, I know, and am sorry for that. My "high level"
> view, is that memory barriers on SMP systems act as synchronization
> points, meaning that when a CPU issues a barrier, it will "see" the
> state of the other CPUs at the time of their last barrier. This means
> that for a CPU to see a store that occured on another CPU, there must
> have been two barriers: a barrier after the store on one cpu, and a
> barrier after that before the read on the other cpu. This view of
> things seems to be corroborated by the fact that the patch works, and
> by the following sentence in Documentation/memory-barriers.txt:
> 
>  (*) There is no guarantee that a CPU will see the correct order of
> effects from a second CPU's accesses, even _if_ the second CPU uses a
> memory barrier, unless the first CPU _also_ uses a matching memory
> barrier (see the subsection on "SMP Barrier Pairing").

[quick answer]

...or the architecture refrains from reordering write requests, like x86
does. What may happen, though, is that the compiler reorders the writes.
Therefore you need at least a (must cheaper) compiler barrier on those
archs. See also linux/Documentation/memory-barriers.txt on this and more.

Jan

> 
> So, the lack of memory barrier after atomic_set in xnlock_put looks
> like a bug to me, and your assertion that ticket based algorithm do
> not require memory barriers looks dubious.
> 
> Now, I do not really know what "strictly ordered architecture" means,
> (a shame, again, sorry) but I suspect it implies strict ordering on
> one core, but not amongst cores, so that the two barriers thing
> remains mandatory. So, in short, on a fully ordered system, the
> barrier before atomic_set can be removed, but the one after atomic_set
> is still necessary. If this is the case, then we would simply need to
> define an xnarch_local_memory_barrier() which implies ordering on the
> current cpu, and that would simply be a compiler barrier on x86, and
> we do not need a complete reimplementation of the spinlocks just for
> one barrier.
> 
> For the same reason, I find that the memory barrier before atomic_read
> in __xnlock_spin is necessary. In fact it is necessary only on x86
> which is the only architecture where cpu_relax() is not defined to be
> a barrier, but anyway, I do not believe this barrier is a problem
> since it happens on a slow path.
> 
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 11:59                               ` Jan Kiszka
@ 2014-09-18 12:11                                 ` Gilles Chanteperdrix
  2014-09-18 12:17                                 ` Gilles Chanteperdrix
  2014-09-18 20:21                                 ` Gilles Chanteperdrix
  2 siblings, 0 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 12:11 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 01:59 PM, Jan Kiszka wrote:
> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>> Then the high latencies reliably disappear.
>>>>>>
>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>> I use a program that simply opens and reads the stat file
>>>>>> 1,000 times.
>>>>>>
>>>>>> With locks:
>>>>>>
>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>> 0|     -2.575|   1478.154
>>>>>>
>>>>>> Without locks:
>>>>>>
>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>
>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>> locks are malfunctioning, I'm clueless.
>>>>>
>>>>> Answering with a "little" delay, could you try the following
>>>>> patch?
>>>>>
>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>> a/include/asm-generic/system.h +++
>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>> xnarch_memory_barrier();
>>>>>
>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>
>>>> That's pretty heavy-weighted now (it was already due to the first
>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>> other strictly ordered archs), those require no memory barriers
>>>> on release.
>>
>>> In fact, memory barriers aren't needed on strictly ordered archs
>>> already today, independent of the spinlock granting algorithm. So
>>> there are two optimization possibilities:
>>
>>> - ticket-based granting - arch-specific (thus optimized) core
>>
>> Ok, no answer, so I will try to be more clear.
>>
>> I do not pretend to understand how memory barriers work at a low
>> level, this is a shame, I know, and am sorry for that. My "high level"
>> view, is that memory barriers on SMP systems act as synchronization
>> points, meaning that when a CPU issues a barrier, it will "see" the
>> state of the other CPUs at the time of their last barrier. This means
>> that for a CPU to see a store that occured on another CPU, there must
>> have been two barriers: a barrier after the store on one cpu, and a
>> barrier after that before the read on the other cpu. This view of
>> things seems to be corroborated by the fact that the patch works, and
>> by the following sentence in Documentation/memory-barriers.txt:
>>
>>  (*) There is no guarantee that a CPU will see the correct order of
>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>> memory barrier, unless the first CPU _also_ uses a matching memory
>> barrier (see the subsection on "SMP Barrier Pairing").
> 
> [quick answer]
> 
> ...or the architecture refrains from reordering write requests, like x86
> does. What may happen, though, is that the compiler reorders the writes.
> Therefore you need at least a (must cheaper) compiler barrier on those
> archs. See also linux/Documentation/memory-barriers.txt on this and more.

I have answered that, please read the mail completely.


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 11:59                               ` Jan Kiszka
  2014-09-18 12:11                                 ` Gilles Chanteperdrix
@ 2014-09-18 12:17                                 ` Gilles Chanteperdrix
  2014-09-18 12:20                                   ` Jan Kiszka
  2014-09-18 20:21                                 ` Gilles Chanteperdrix
  2 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 12:17 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 01:59 PM, Jan Kiszka wrote:
> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>> Then the high latencies reliably disappear.
>>>>>>
>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>> I use a program that simply opens and reads the stat file
>>>>>> 1,000 times.
>>>>>>
>>>>>> With locks:
>>>>>>
>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>> 0|     -2.575|   1478.154
>>>>>>
>>>>>> Without locks:
>>>>>>
>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>
>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>> locks are malfunctioning, I'm clueless.
>>>>>
>>>>> Answering with a "little" delay, could you try the following
>>>>> patch?
>>>>>
>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>> a/include/asm-generic/system.h +++
>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>> xnarch_memory_barrier();
>>>>>
>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>
>>>> That's pretty heavy-weighted now (it was already due to the first
>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>> other strictly ordered archs), those require no memory barriers
>>>> on release.
>>
>>> In fact, memory barriers aren't needed on strictly ordered archs
>>> already today, independent of the spinlock granting algorithm. So
>>> there are two optimization possibilities:
>>
>>> - ticket-based granting - arch-specific (thus optimized) core
>>
>> Ok, no answer, so I will try to be more clear.
>>
>> I do not pretend to understand how memory barriers work at a low
>> level, this is a shame, I know, and am sorry for that. My "high level"
>> view, is that memory barriers on SMP systems act as synchronization
>> points, meaning that when a CPU issues a barrier, it will "see" the
>> state of the other CPUs at the time of their last barrier. This means
>> that for a CPU to see a store that occured on another CPU, there must
>> have been two barriers: a barrier after the store on one cpu, and a
>> barrier after that before the read on the other cpu. This view of
>> things seems to be corroborated by the fact that the patch works, and
>> by the following sentence in Documentation/memory-barriers.txt:
>>
>>  (*) There is no guarantee that a CPU will see the correct order of
>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>> memory barrier, unless the first CPU _also_ uses a matching memory
>> barrier (see the subsection on "SMP Barrier Pairing").
> 
> [quick answer]
> 
> ...or the architecture refrains from reordering write requests, like x86
> does. What may happen, though, is that the compiler reorders the writes.
> Therefore you need at least a (must cheaper) compiler barrier on those
> archs. See also linux/Documentation/memory-barriers.txt on this and more.

quick answer: I do not believe an SMP architecture can enforce stores
ordering accross multiple cpus, with cpus local caches and such. And the
fact that the patch I sent fixed the issue on x86 tend to prove me right.

The only reason to not put the barrier after the atomic_set would be
some kind of "optimistic unlocking" optimization, where we let the store
pending up to the next barrier, believing that such a barrier should
happen inevitably. This improves the spinlock overhead on one cpu, at
the expense of the spinning time of the other cpus in the contended case.


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 12:17                                 ` Gilles Chanteperdrix
@ 2014-09-18 12:20                                   ` Jan Kiszka
  2014-09-18 13:05                                     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-18 12:20 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>> Then the high latencies reliably disappear.
>>>>>>>
>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>> 1,000 times.
>>>>>>>
>>>>>>> With locks:
>>>>>>>
>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>> 0|     -2.575|   1478.154
>>>>>>>
>>>>>>> Without locks:
>>>>>>>
>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>
>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>
>>>>>> Answering with a "little" delay, could you try the following
>>>>>> patch?
>>>>>>
>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>> a/include/asm-generic/system.h +++
>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>> xnarch_memory_barrier();
>>>>>>
>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>
>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>> other strictly ordered archs), those require no memory barriers
>>>>> on release.
>>>
>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>> already today, independent of the spinlock granting algorithm. So
>>>> there are two optimization possibilities:
>>>
>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>
>>> Ok, no answer, so I will try to be more clear.
>>>
>>> I do not pretend to understand how memory barriers work at a low
>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>> view, is that memory barriers on SMP systems act as synchronization
>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>> state of the other CPUs at the time of their last barrier. This means
>>> that for a CPU to see a store that occured on another CPU, there must
>>> have been two barriers: a barrier after the store on one cpu, and a
>>> barrier after that before the read on the other cpu. This view of
>>> things seems to be corroborated by the fact that the patch works, and
>>> by the following sentence in Documentation/memory-barriers.txt:
>>>
>>>  (*) There is no guarantee that a CPU will see the correct order of
>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>> barrier (see the subsection on "SMP Barrier Pairing").
>>
>> [quick answer]
>>
>> ...or the architecture refrains from reordering write requests, like x86
>> does. What may happen, though, is that the compiler reorders the writes.
>> Therefore you need at least a (must cheaper) compiler barrier on those
>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
> 
> quick answer: I do not believe an SMP architecture can enforce stores
> ordering accross multiple cpus, with cpus local caches and such. And the
> fact that the patch I sent fixed the issue on x86 tend to prove me right.

It's not wrong, it's just (costly, on larger machines) overkill as the
other cores either see the lock release and all prior changes committed
or the lock taken (and the prior changes do not matter then). They will
never see later changes committed before the lock being visible as free.
That's architecturally guaranteed, and that's why you have no memory
barriers in x86 spinlock release operations.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 12:20                                   ` Jan Kiszka
@ 2014-09-18 13:05                                     ` Gilles Chanteperdrix
  2014-09-18 13:26                                       ` Jan Kiszka
  0 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 13:05 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 02:20 PM, Jan Kiszka wrote:
> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>
>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>> 1,000 times.
>>>>>>>>
>>>>>>>> With locks:
>>>>>>>>
>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>
>>>>>>>> Without locks:
>>>>>>>>
>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>
>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>
>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>> patch?
>>>>>>>
>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>> a/include/asm-generic/system.h +++
>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>> xnarch_memory_barrier();
>>>>>>>
>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>
>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>> on release.
>>>>
>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>> already today, independent of the spinlock granting algorithm. So
>>>>> there are two optimization possibilities:
>>>>
>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>
>>>> Ok, no answer, so I will try to be more clear.
>>>>
>>>> I do not pretend to understand how memory barriers work at a low
>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>> view, is that memory barriers on SMP systems act as synchronization
>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>> state of the other CPUs at the time of their last barrier. This means
>>>> that for a CPU to see a store that occured on another CPU, there must
>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>> barrier after that before the read on the other cpu. This view of
>>>> things seems to be corroborated by the fact that the patch works, and
>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>
>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>
>>> [quick answer]
>>>
>>> ...or the architecture refrains from reordering write requests, like x86
>>> does. What may happen, though, is that the compiler reorders the writes.
>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>
>> quick answer: I do not believe an SMP architecture can enforce stores
>> ordering accross multiple cpus, with cpus local caches and such. And the
>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
> 
> It's not wrong, it's just (costly, on larger machines) overkill as the
> other cores either see the lock release and all prior changes committed
> or the lock taken (and the prior changes do not matter then). They will
> never see later changes committed before the lock being visible as free.

I agree. But this is true on all architectures, not just on strictly
ordered ones, this is just due to how barriers work on SMP systems, as I
explained.

> That's architecturally guaranteed, and that's why you have no memory
> barriers in x86 spinlock release operations.

I disagree, as explained in the paragraph just below the one you quote,
I believe this is an optimization, which is almost valid on any
architecture. Almost valid, because if the cpu which has done the unlock
does another lock without any time for a barrier in between to
synchronize cpus, we have a problem, because the other cpus will never
see the spinlock as free. With ticket spinlocks, you just add a store on
the cpu which spins, and you have to add a barrier after that, if you
want the barrier before the read on the cpu which will acquire the lock
to see that the spinlock is contended. So I do not see how this requires
less barriers.


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 13:05                                     ` Gilles Chanteperdrix
@ 2014-09-18 13:26                                       ` Jan Kiszka
  2014-09-18 13:44                                         ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-18 13:26 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>
>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>> 1,000 times.
>>>>>>>>>
>>>>>>>>> With locks:
>>>>>>>>>
>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>
>>>>>>>>> Without locks:
>>>>>>>>>
>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>
>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>
>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>> patch?
>>>>>>>>
>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>> xnarch_memory_barrier();
>>>>>>>>
>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>
>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>> on release.
>>>>>
>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>> there are two optimization possibilities:
>>>>>
>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>
>>>>> Ok, no answer, so I will try to be more clear.
>>>>>
>>>>> I do not pretend to understand how memory barriers work at a low
>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>> barrier after that before the read on the other cpu. This view of
>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>
>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>
>>>> [quick answer]
>>>>
>>>> ...or the architecture refrains from reordering write requests, like x86
>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>
>>> quick answer: I do not believe an SMP architecture can enforce stores
>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>
>> It's not wrong, it's just (costly, on larger machines) overkill as the
>> other cores either see the lock release and all prior changes committed
>> or the lock taken (and the prior changes do not matter then). They will
>> never see later changes committed before the lock being visible as free.
> 
> I agree. But this is true on all architectures, not just on strictly
> ordered ones, this is just due to how barriers work on SMP systems, as I
> explained.
> 
>> That's architecturally guaranteed, and that's why you have no memory
>> barriers in x86 spinlock release operations.
> 
> I disagree, as explained in the paragraph just below the one you quote,
> I believe this is an optimization, which is almost valid on any
> architecture. Almost valid, because if the cpu which has done the unlock
> does another lock without any time for a barrier in between to
> synchronize cpus, we have a problem, because the other cpus will never
> see the spinlock as free. With ticket spinlocks, you just add a store on
> the cpu which spins, and you have to add a barrier after that, if you
> want the barrier before the read on the cpu which will acquire the lock
> to see that the spinlock is contended. So I do not see how this requires
> less barriers.

Ticket locks prevent unfair starvation without the closing barrier as
they grant the next ticket to the next waiter, not the current holder.
See the Linux implementation.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 13:26                                       ` Jan Kiszka
@ 2014-09-18 13:44                                         ` Gilles Chanteperdrix
  2014-09-18 16:14                                           ` Jan Kiszka
  0 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 13:44 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 03:26 PM, Jan Kiszka wrote:
> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>
>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>> 1,000 times.
>>>>>>>>>>
>>>>>>>>>> With locks:
>>>>>>>>>>
>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>
>>>>>>>>>> Without locks:
>>>>>>>>>>
>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>
>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>
>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>> patch?
>>>>>>>>>
>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>
>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>
>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>> on release.
>>>>>>
>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>> there are two optimization possibilities:
>>>>>>
>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>
>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>
>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>
>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>
>>>>> [quick answer]
>>>>>
>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>
>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>
>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>> other cores either see the lock release and all prior changes committed
>>> or the lock taken (and the prior changes do not matter then). They will
>>> never see later changes committed before the lock being visible as free.
>>
>> I agree. But this is true on all architectures, not just on strictly
>> ordered ones, this is just due to how barriers work on SMP systems, as I
>> explained.
>>
>>> That's architecturally guaranteed, and that's why you have no memory
>>> barriers in x86 spinlock release operations.
>>
>> I disagree, as explained in the paragraph just below the one you quote,
>> I believe this is an optimization, which is almost valid on any
>> architecture. Almost valid, because if the cpu which has done the unlock
>> does another lock without any time for a barrier in between to
>> synchronize cpus, we have a problem, because the other cpus will never
>> see the spinlock as free. With ticket spinlocks, you just add a store on
>> the cpu which spins, and you have to add a barrier after that, if you
>> want the barrier before the read on the cpu which will acquire the lock
>> to see that the spinlock is contended. So I do not see how this requires
>> less barriers.
> 
> Ticket locks prevent unfair starvation without the closing barrier as
> they grant the next ticket to the next waiter, not the current holder.
> See the Linux implementation.

Whether to put the closing barrier after the last store is orthogonal,
to whether implementing ticket locks or not. This is all a question of
tradeoffs.

Without the barrier after the last store, you increase the spinning time
due to time taken for the store to be visible on other cpus, but you
optimize the overhead of unlocking.

With ticket spinlocks you avoid the starvation situation, at the expense
of increasing the overhread of spinlock operations.

I do not know which is worse. I suspect all this does not make much of a
difference, and what dominates is the duration of spinlock sections anyway.


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 13:44                                         ` Gilles Chanteperdrix
@ 2014-09-18 16:14                                           ` Jan Kiszka
  2014-09-18 16:28                                             ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-18 16:14 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>
>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>
>>>>>>>>>>> With locks:
>>>>>>>>>>>
>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>
>>>>>>>>>>> Without locks:
>>>>>>>>>>>
>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>
>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>
>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>> patch?
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>
>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>
>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>> on release.
>>>>>>>
>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>> there are two optimization possibilities:
>>>>>>>
>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>
>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>
>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>
>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>
>>>>>> [quick answer]
>>>>>>
>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>
>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>
>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>> other cores either see the lock release and all prior changes committed
>>>> or the lock taken (and the prior changes do not matter then). They will
>>>> never see later changes committed before the lock being visible as free.
>>>
>>> I agree. But this is true on all architectures, not just on strictly
>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>> explained.
>>>
>>>> That's architecturally guaranteed, and that's why you have no memory
>>>> barriers in x86 spinlock release operations.
>>>
>>> I disagree, as explained in the paragraph just below the one you quote,
>>> I believe this is an optimization, which is almost valid on any
>>> architecture. Almost valid, because if the cpu which has done the unlock
>>> does another lock without any time for a barrier in between to
>>> synchronize cpus, we have a problem, because the other cpus will never
>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>> the cpu which spins, and you have to add a barrier after that, if you
>>> want the barrier before the read on the cpu which will acquire the lock
>>> to see that the spinlock is contended. So I do not see how this requires
>>> less barriers.
>>
>> Ticket locks prevent unfair starvation without the closing barrier as
>> they grant the next ticket to the next waiter, not the current holder.
>> See the Linux implementation.
> 
> Whether to put the closing barrier after the last store is orthogonal,
> to whether implementing ticket locks or not. This is all a question of
> tradeoffs.
> 
> Without the barrier after the last store, you increase the spinning time
> due to time taken for the store to be visible on other cpus, but you
> optimize the overhead of unlocking.
> 
> With ticket spinlocks you avoid the starvation situation, at the expense
> of increasing the overhread of spinlock operations.
> 
> I do not know which is worse. I suspect all this does not make much of a
> difference, and what dominates is the duration of spinlock sections anyway.

I think the way classic Linux spinlock did this on x86 provide the answer.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 16:14                                           ` Jan Kiszka
@ 2014-09-18 16:28                                             ` Gilles Chanteperdrix
  2014-09-18 18:39                                               ` Gilles Chanteperdrix
  2014-09-18 19:09                                               ` Jan Kiszka
  0 siblings, 2 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 16:28 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 06:14 PM, Jan Kiszka wrote:
> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>
>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>
>>>>>>>>>>>> With locks:
>>>>>>>>>>>>
>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>
>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>
>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>
>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>
>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>> patch?
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>
>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>
>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>> on release.
>>>>>>>>
>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>> there are two optimization possibilities:
>>>>>>>>
>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>
>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>
>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>
>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>
>>>>>>> [quick answer]
>>>>>>>
>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>
>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>
>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>> other cores either see the lock release and all prior changes committed
>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>> never see later changes committed before the lock being visible as free.
>>>>
>>>> I agree. But this is true on all architectures, not just on strictly
>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>> explained.
>>>>
>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>> barriers in x86 spinlock release operations.
>>>>
>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>> I believe this is an optimization, which is almost valid on any
>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>> does another lock without any time for a barrier in between to
>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>> want the barrier before the read on the cpu which will acquire the lock
>>>> to see that the spinlock is contended. So I do not see how this requires
>>>> less barriers.
>>>
>>> Ticket locks prevent unfair starvation without the closing barrier as
>>> they grant the next ticket to the next waiter, not the current holder.
>>> See the Linux implementation.
>>
>> Whether to put the closing barrier after the last store is orthogonal,
>> to whether implementing ticket locks or not. This is all a question of
>> tradeoffs.
>>
>> Without the barrier after the last store, you increase the spinning time
>> due to time taken for the store to be visible on other cpus, but you
>> optimize the overhead of unlocking.
>>
>> With ticket spinlocks you avoid the starvation situation, at the expense
>> of increasing the overhread of spinlock operations.
>>
>> I do not know which is worse. I suspect all this does not make much of a
>> difference, and what dominates is the duration of spinlock sections anyway.
> 
> I think the way classic Linux spinlock did this on x86 provide the answer.

The situation is completely different: linux spinlocks are well split,
xenomai basically has one only spinlock, so chances are that it will be
more contended, so the heavy unlock path (the one which implements the
ticket stuff) will be triggered more often. Also, xenomai spinlock (we
can loose the s anyway) being more contended, the "pending store
barrier" optimization has in fact chances of being detrimental. And
finally, due to the way spinlocks are split, Linux has scalability
issues that Xenomai can not even begin to imagine tackling.

Anyway, the discussion is kind of moot, because as I said, we are not
going to change the spinlock implementation in 2.6. What we are
discussing here is whether to put the barrier after the atomic_set, or
whether to put that barrier where it is really needed: in the snapshot
code, and what to do for forge. I also agree that the barrier before the
atomic_set in xnlock_put is not needed on x86 and proposed an
architecture macro to replace it with a compiler barrier in that case.

I also proposed to replace the atomic_set with a cmpxchg, cmpxchg has
two barriers on ARM, but I guess on x86 it is only one barrier, this
would solve the architecture dependency nicely.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 16:28                                             ` Gilles Chanteperdrix
@ 2014-09-18 18:39                                               ` Gilles Chanteperdrix
  2014-09-18 19:23                                                 ` Jan Kiszka
  2014-09-18 19:09                                               ` Jan Kiszka
  1 sibling, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 18:39 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 06:28 PM, Gilles Chanteperdrix wrote:
> On 09/18/2014 06:14 PM, Jan Kiszka wrote:
>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With locks:
>>>>>>>>>>>>>
>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>>
>>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>>
>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>>
>>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>>> patch?
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>>
>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>>
>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>>> on release.
>>>>>>>>>
>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>>> there are two optimization possibilities:
>>>>>>>>>
>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>>
>>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>>
>>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>>
>>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>>
>>>>>>>> [quick answer]
>>>>>>>>
>>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>>
>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>>
>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>>> other cores either see the lock release and all prior changes committed
>>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>>> never see later changes committed before the lock being visible as free.
>>>>>
>>>>> I agree. But this is true on all architectures, not just on strictly
>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>>> explained.
>>>>>
>>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>>> barriers in x86 spinlock release operations.
>>>>>
>>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>>> I believe this is an optimization, which is almost valid on any
>>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>>> does another lock without any time for a barrier in between to
>>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>>> want the barrier before the read on the cpu which will acquire the lock
>>>>> to see that the spinlock is contended. So I do not see how this requires
>>>>> less barriers.
>>>>
>>>> Ticket locks prevent unfair starvation without the closing barrier as
>>>> they grant the next ticket to the next waiter, not the current holder.
>>>> See the Linux implementation.
>>>
>>> Whether to put the closing barrier after the last store is orthogonal,
>>> to whether implementing ticket locks or not. This is all a question of
>>> tradeoffs.
>>>
>>> Without the barrier after the last store, you increase the spinning time
>>> due to time taken for the store to be visible on other cpus, but you
>>> optimize the overhead of unlocking.
>>>
>>> With ticket spinlocks you avoid the starvation situation, at the expense
>>> of increasing the overhread of spinlock operations.
>>>
>>> I do not know which is worse. I suspect all this does not make much of a
>>> difference, and what dominates is the duration of spinlock sections anyway.
>>
>> I think the way classic Linux spinlock did this on x86 provide the answer.
> 
> The situation is completely different: linux spinlocks are well split,
> xenomai basically has one only spinlock, so chances are that it will be
> more contended, so the heavy unlock path (the one which implements the
> ticket stuff) will be triggered more often. Also, xenomai spinlock (we
> can loose the s anyway) being more contended, the "pending store
> barrier" optimization has in fact chances of being detrimental. And
> finally, due to the way spinlocks are split, Linux has scalability
> issues that Xenomai can not even begin to imagine tackling.

Finally, in the eternal worst case vs average case fight, the worst case
worth optimizing is the contended case in our case, and I believe adding
the barrier after atomic_set in xnlock_put is what optimizes this worst
case best, because, again, it reduces the time between the unlock and
its visibility on the spinning cpu. That is at least something Linux
does not have to care about, because the worst case is not what it is
optimized for.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 16:28                                             ` Gilles Chanteperdrix
  2014-09-18 18:39                                               ` Gilles Chanteperdrix
@ 2014-09-18 19:09                                               ` Jan Kiszka
  2014-09-18 19:32                                                 ` Gilles Chanteperdrix
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-18 19:09 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-18 18:28, Gilles Chanteperdrix wrote:
> On 09/18/2014 06:14 PM, Jan Kiszka wrote:
>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With locks:
>>>>>>>>>>>>>
>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>>
>>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>>
>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>>
>>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>>> patch?
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>>
>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>>
>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>>> on release.
>>>>>>>>>
>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>>> there are two optimization possibilities:
>>>>>>>>>
>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>>
>>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>>
>>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>>
>>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>>
>>>>>>>> [quick answer]
>>>>>>>>
>>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>>
>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>>
>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>>> other cores either see the lock release and all prior changes committed
>>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>>> never see later changes committed before the lock being visible as free.
>>>>>
>>>>> I agree. But this is true on all architectures, not just on strictly
>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>>> explained.
>>>>>
>>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>>> barriers in x86 spinlock release operations.
>>>>>
>>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>>> I believe this is an optimization, which is almost valid on any
>>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>>> does another lock without any time for a barrier in between to
>>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>>> want the barrier before the read on the cpu which will acquire the lock
>>>>> to see that the spinlock is contended. So I do not see how this requires
>>>>> less barriers.
>>>>
>>>> Ticket locks prevent unfair starvation without the closing barrier as
>>>> they grant the next ticket to the next waiter, not the current holder.
>>>> See the Linux implementation.
>>>
>>> Whether to put the closing barrier after the last store is orthogonal,
>>> to whether implementing ticket locks or not. This is all a question of
>>> tradeoffs.
>>>
>>> Without the barrier after the last store, you increase the spinning time
>>> due to time taken for the store to be visible on other cpus, but you
>>> optimize the overhead of unlocking.
>>>
>>> With ticket spinlocks you avoid the starvation situation, at the expense
>>> of increasing the overhread of spinlock operations.
>>>
>>> I do not know which is worse. I suspect all this does not make much of a
>>> difference, and what dominates is the duration of spinlock sections anyway.
>>
>> I think the way classic Linux spinlock did this on x86 provide the answer.
> 
> The situation is completely different: linux spinlocks are well split,
> xenomai basically has one only spinlock, so chances are that it will be
> more contended, so the heavy unlock path (the one which implements the
> ticket stuff) will be triggered more often. Also, xenomai spinlock (we
> can loose the s anyway) being more contended, the "pending store
> barrier" optimization has in fact chances of being detrimental. And
> finally, due to the way spinlocks are split, Linux has scalability
> issues that Xenomai can not even begin to imagine tackling.
> 
> Anyway, the discussion is kind of moot, because as I said, we are not
> going to change the spinlock implementation in 2.6. What we are
> discussing here is whether to put the barrier after the atomic_set, or
> whether to put that barrier where it is really needed: in the snapshot
> code, and what to do for forge. I also agree that the barrier before the
> atomic_set in xnlock_put is not needed on x86 and proposed an
> architecture macro to replace it with a compiler barrier in that case.

Yes, seems reasonable.

> 
> I also proposed to replace the atomic_set with a cmpxchg, cmpxchg has
> two barriers on ARM, but I guess on x86 it is only one barrier, this
> would solve the architecture dependency nicely.

That saves an abstraction but I have no clue if "mfence" is equally
expensive as "lock cmpxchg". If it is, that's fine, but I suspect it's
not (due to the cacheline "lock").

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 18:39                                               ` Gilles Chanteperdrix
@ 2014-09-18 19:23                                                 ` Jan Kiszka
  2014-09-18 19:31                                                   ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-18 19:23 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-18 20:39, Gilles Chanteperdrix wrote:
> On 09/18/2014 06:28 PM, Gilles Chanteperdrix wrote:
>> On 09/18/2014 06:14 PM, Jan Kiszka wrote:
>>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With locks:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>>>> patch?
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>>>
>>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>>>
>>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>>>> on release.
>>>>>>>>>>
>>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>>>> there are two optimization possibilities:
>>>>>>>>>>
>>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>>>
>>>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>>>
>>>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>>>
>>>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>>>
>>>>>>>>> [quick answer]
>>>>>>>>>
>>>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>>>
>>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>>>
>>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>>>> other cores either see the lock release and all prior changes committed
>>>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>>>> never see later changes committed before the lock being visible as free.
>>>>>>
>>>>>> I agree. But this is true on all architectures, not just on strictly
>>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>>>> explained.
>>>>>>
>>>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>>>> barriers in x86 spinlock release operations.
>>>>>>
>>>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>>>> I believe this is an optimization, which is almost valid on any
>>>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>>>> does another lock without any time for a barrier in between to
>>>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>>>> want the barrier before the read on the cpu which will acquire the lock
>>>>>> to see that the spinlock is contended. So I do not see how this requires
>>>>>> less barriers.
>>>>>
>>>>> Ticket locks prevent unfair starvation without the closing barrier as
>>>>> they grant the next ticket to the next waiter, not the current holder.
>>>>> See the Linux implementation.
>>>>
>>>> Whether to put the closing barrier after the last store is orthogonal,
>>>> to whether implementing ticket locks or not. This is all a question of
>>>> tradeoffs.
>>>>
>>>> Without the barrier after the last store, you increase the spinning time
>>>> due to time taken for the store to be visible on other cpus, but you
>>>> optimize the overhead of unlocking.
>>>>
>>>> With ticket spinlocks you avoid the starvation situation, at the expense
>>>> of increasing the overhread of spinlock operations.
>>>>
>>>> I do not know which is worse. I suspect all this does not make much of a
>>>> difference, and what dominates is the duration of spinlock sections anyway.
>>>
>>> I think the way classic Linux spinlock did this on x86 provide the answer.
>>
>> The situation is completely different: linux spinlocks are well split,
>> xenomai basically has one only spinlock, so chances are that it will be
>> more contended, so the heavy unlock path (the one which implements the
>> ticket stuff) will be triggered more often. Also, xenomai spinlock (we
>> can loose the s anyway) being more contended, the "pending store
>> barrier" optimization has in fact chances of being detrimental. And
>> finally, due to the way spinlocks are split, Linux has scalability
>> issues that Xenomai can not even begin to imagine tackling.
> 
> Finally, in the eternal worst case vs average case fight, the worst case
> worth optimizing is the contended case in our case, and I believe adding
> the barrier after atomic_set in xnlock_put is what optimizes this worst
> case best, because, again, it reduces the time between the unlock and
> its visibility on the spinning cpu. That is at least something Linux
> does not have to care about, because the worst case is not what it is
> optimized for.

Maybe. Unsure right now, if we see prolonged spinning time due to this
on x86. I suspect not as spinning is not only increasing latencies but
also burning CPU power uselessly, and that would be noticed and disliked
under Linux.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 19:23                                                 ` Jan Kiszka
@ 2014-09-18 19:31                                                   ` Gilles Chanteperdrix
  0 siblings, 0 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 19:31 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 09:23 PM, Jan Kiszka wrote:
> On 2014-09-18 20:39, Gilles Chanteperdrix wrote:
>> On 09/18/2014 06:28 PM, Gilles Chanteperdrix wrote:
>>> On 09/18/2014 06:14 PM, Jan Kiszka wrote:
>>>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>>>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With locks:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>>>>> patch?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>>>>
>>>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>>>>> on release.
>>>>>>>>>>>
>>>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>>>>> there are two optimization possibilities:
>>>>>>>>>>>
>>>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>>>>
>>>>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>>>>
>>>>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>>>>
>>>>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>>>>
>>>>>>>>>> [quick answer]
>>>>>>>>>>
>>>>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>>>>
>>>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>>>>
>>>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>>>>> other cores either see the lock release and all prior changes committed
>>>>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>>>>> never see later changes committed before the lock being visible as free.
>>>>>>>
>>>>>>> I agree. But this is true on all architectures, not just on strictly
>>>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>>>>> explained.
>>>>>>>
>>>>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>>>>> barriers in x86 spinlock release operations.
>>>>>>>
>>>>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>>>>> I believe this is an optimization, which is almost valid on any
>>>>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>>>>> does another lock without any time for a barrier in between to
>>>>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>>>>> want the barrier before the read on the cpu which will acquire the lock
>>>>>>> to see that the spinlock is contended. So I do not see how this requires
>>>>>>> less barriers.
>>>>>>
>>>>>> Ticket locks prevent unfair starvation without the closing barrier as
>>>>>> they grant the next ticket to the next waiter, not the current holder.
>>>>>> See the Linux implementation.
>>>>>
>>>>> Whether to put the closing barrier after the last store is orthogonal,
>>>>> to whether implementing ticket locks or not. This is all a question of
>>>>> tradeoffs.
>>>>>
>>>>> Without the barrier after the last store, you increase the spinning time
>>>>> due to time taken for the store to be visible on other cpus, but you
>>>>> optimize the overhead of unlocking.
>>>>>
>>>>> With ticket spinlocks you avoid the starvation situation, at the expense
>>>>> of increasing the overhread of spinlock operations.
>>>>>
>>>>> I do not know which is worse. I suspect all this does not make much of a
>>>>> difference, and what dominates is the duration of spinlock sections anyway.
>>>>
>>>> I think the way classic Linux spinlock did this on x86 provide the answer.
>>>
>>> The situation is completely different: linux spinlocks are well split,
>>> xenomai basically has one only spinlock, so chances are that it will be
>>> more contended, so the heavy unlock path (the one which implements the
>>> ticket stuff) will be triggered more often. Also, xenomai spinlock (we
>>> can loose the s anyway) being more contended, the "pending store
>>> barrier" optimization has in fact chances of being detrimental. And
>>> finally, due to the way spinlocks are split, Linux has scalability
>>> issues that Xenomai can not even begin to imagine tackling.
>>
>> Finally, in the eternal worst case vs average case fight, the worst case
>> worth optimizing is the contended case in our case, and I believe adding
>> the barrier after atomic_set in xnlock_put is what optimizes this worst
>> case best, because, again, it reduces the time between the unlock and
>> its visibility on the spinning cpu. That is at least something Linux
>> does not have to care about, because the worst case is not what it is
>> optimized for.
> 
> Maybe. Unsure right now, if we see prolonged spinning time due to this
> on x86. I suspect not as spinning is not only increasing latencies but
> also burning CPU power uselessly, and that would be noticed and disliked
> under Linux.

Probably it does not have this issue, because it uses atomic add or
atomic cmpxchg to unlock the spinlock as far as I can tell. But these
instructions look as heavy as a barrier to me.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 19:09                                               ` Jan Kiszka
@ 2014-09-18 19:32                                                 ` Gilles Chanteperdrix
  2014-09-18 19:56                                                   ` Jan Kiszka
  0 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 19:32 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 09:09 PM, Jan Kiszka wrote:
> On 2014-09-18 18:28, Gilles Chanteperdrix wrote:
>> On 09/18/2014 06:14 PM, Jan Kiszka wrote:
>>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With locks:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>>>> patch?
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>>>
>>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>>>
>>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>>>> on release.
>>>>>>>>>>
>>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>>>> there are two optimization possibilities:
>>>>>>>>>>
>>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>>>
>>>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>>>
>>>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>>>
>>>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>>>
>>>>>>>>> [quick answer]
>>>>>>>>>
>>>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>>>
>>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>>>
>>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>>>> other cores either see the lock release and all prior changes committed
>>>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>>>> never see later changes committed before the lock being visible as free.
>>>>>>
>>>>>> I agree. But this is true on all architectures, not just on strictly
>>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>>>> explained.
>>>>>>
>>>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>>>> barriers in x86 spinlock release operations.
>>>>>>
>>>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>>>> I believe this is an optimization, which is almost valid on any
>>>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>>>> does another lock without any time for a barrier in between to
>>>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>>>> want the barrier before the read on the cpu which will acquire the lock
>>>>>> to see that the spinlock is contended. So I do not see how this requires
>>>>>> less barriers.
>>>>>
>>>>> Ticket locks prevent unfair starvation without the closing barrier as
>>>>> they grant the next ticket to the next waiter, not the current holder.
>>>>> See the Linux implementation.
>>>>
>>>> Whether to put the closing barrier after the last store is orthogonal,
>>>> to whether implementing ticket locks or not. This is all a question of
>>>> tradeoffs.
>>>>
>>>> Without the barrier after the last store, you increase the spinning time
>>>> due to time taken for the store to be visible on other cpus, but you
>>>> optimize the overhead of unlocking.
>>>>
>>>> With ticket spinlocks you avoid the starvation situation, at the expense
>>>> of increasing the overhread of spinlock operations.
>>>>
>>>> I do not know which is worse. I suspect all this does not make much of a
>>>> difference, and what dominates is the duration of spinlock sections anyway.
>>>
>>> I think the way classic Linux spinlock did this on x86 provide the answer.
>>
>> The situation is completely different: linux spinlocks are well split,
>> xenomai basically has one only spinlock, so chances are that it will be
>> more contended, so the heavy unlock path (the one which implements the
>> ticket stuff) will be triggered more often. Also, xenomai spinlock (we
>> can loose the s anyway) being more contended, the "pending store
>> barrier" optimization has in fact chances of being detrimental. And
>> finally, due to the way spinlocks are split, Linux has scalability
>> issues that Xenomai can not even begin to imagine tackling.
>>
>> Anyway, the discussion is kind of moot, because as I said, we are not
>> going to change the spinlock implementation in 2.6. What we are
>> discussing here is whether to put the barrier after the atomic_set, or
>> whether to put that barrier where it is really needed: in the snapshot
>> code, and what to do for forge. I also agree that the barrier before the
>> atomic_set in xnlock_put is not needed on x86 and proposed an
>> architecture macro to replace it with a compiler barrier in that case.
> 
> Yes, seems reasonable.
> 
>>
>> I also proposed to replace the atomic_set with a cmpxchg, cmpxchg has
>> two barriers on ARM, but I guess on x86 it is only one barrier, this
>> would solve the architecture dependency nicely.
> 
> That saves an abstraction but I have no clue if "mfence" is equally
> expensive as "lock cmpxchg". If it is, that's fine, but I suspect it's
> not (due to the cacheline "lock").

Unless I misunderstand something in Linux code, it also uses the "lock"
prefix for unlocking ticket spinlocks. Either with a lock; add or with a
lock; cmpxchg.


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 19:32                                                 ` Gilles Chanteperdrix
@ 2014-09-18 19:56                                                   ` Jan Kiszka
  2014-09-18 20:13                                                     ` Gilles Chanteperdrix
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-18 19:56 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-18 21:32, Gilles Chanteperdrix wrote:
> On 09/18/2014 09:09 PM, Jan Kiszka wrote:
>> On 2014-09-18 18:28, Gilles Chanteperdrix wrote:
>>> On 09/18/2014 06:14 PM, Jan Kiszka wrote:
>>>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>>>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With locks:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>>>>> patch?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>>>>
>>>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>>>>> on release.
>>>>>>>>>>>
>>>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>>>>> there are two optimization possibilities:
>>>>>>>>>>>
>>>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>>>>
>>>>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>>>>
>>>>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>>>>
>>>>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>>>>
>>>>>>>>>> [quick answer]
>>>>>>>>>>
>>>>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>>>>
>>>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>>>>
>>>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>>>>> other cores either see the lock release and all prior changes committed
>>>>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>>>>> never see later changes committed before the lock being visible as free.
>>>>>>>
>>>>>>> I agree. But this is true on all architectures, not just on strictly
>>>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>>>>> explained.
>>>>>>>
>>>>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>>>>> barriers in x86 spinlock release operations.
>>>>>>>
>>>>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>>>>> I believe this is an optimization, which is almost valid on any
>>>>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>>>>> does another lock without any time for a barrier in between to
>>>>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>>>>> want the barrier before the read on the cpu which will acquire the lock
>>>>>>> to see that the spinlock is contended. So I do not see how this requires
>>>>>>> less barriers.
>>>>>>
>>>>>> Ticket locks prevent unfair starvation without the closing barrier as
>>>>>> they grant the next ticket to the next waiter, not the current holder.
>>>>>> See the Linux implementation.
>>>>>
>>>>> Whether to put the closing barrier after the last store is orthogonal,
>>>>> to whether implementing ticket locks or not. This is all a question of
>>>>> tradeoffs.
>>>>>
>>>>> Without the barrier after the last store, you increase the spinning time
>>>>> due to time taken for the store to be visible on other cpus, but you
>>>>> optimize the overhead of unlocking.
>>>>>
>>>>> With ticket spinlocks you avoid the starvation situation, at the expense
>>>>> of increasing the overhread of spinlock operations.
>>>>>
>>>>> I do not know which is worse. I suspect all this does not make much of a
>>>>> difference, and what dominates is the duration of spinlock sections anyway.
>>>>
>>>> I think the way classic Linux spinlock did this on x86 provide the answer.
>>>
>>> The situation is completely different: linux spinlocks are well split,
>>> xenomai basically has one only spinlock, so chances are that it will be
>>> more contended, so the heavy unlock path (the one which implements the
>>> ticket stuff) will be triggered more often. Also, xenomai spinlock (we
>>> can loose the s anyway) being more contended, the "pending store
>>> barrier" optimization has in fact chances of being detrimental. And
>>> finally, due to the way spinlocks are split, Linux has scalability
>>> issues that Xenomai can not even begin to imagine tackling.
>>>
>>> Anyway, the discussion is kind of moot, because as I said, we are not
>>> going to change the spinlock implementation in 2.6. What we are
>>> discussing here is whether to put the barrier after the atomic_set, or
>>> whether to put that barrier where it is really needed: in the snapshot
>>> code, and what to do for forge. I also agree that the barrier before the
>>> atomic_set in xnlock_put is not needed on x86 and proposed an
>>> architecture macro to replace it with a compiler barrier in that case.
>>
>> Yes, seems reasonable.
>>
>>>
>>> I also proposed to replace the atomic_set with a cmpxchg, cmpxchg has
>>> two barriers on ARM, but I guess on x86 it is only one barrier, this
>>> would solve the architecture dependency nicely.
>>
>> That saves an abstraction but I have no clue if "mfence" is equally
>> expensive as "lock cmpxchg". If it is, that's fine, but I suspect it's
>> not (due to the cacheline "lock").
> 
> Unless I misunderstand something in Linux code, it also uses the "lock"
> prefix for unlocking ticket spinlocks. Either with a lock; add or with a
> lock; cmpxchg.

I'm looking at arch/x86/asm/spinlock.h, and there is only a non-atomic
__add. Same in the disassembly.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 19:56                                                   ` Jan Kiszka
@ 2014-09-18 20:13                                                     ` Gilles Chanteperdrix
  0 siblings, 0 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 20:13 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 09:56 PM, Jan Kiszka wrote:
> On 2014-09-18 21:32, Gilles Chanteperdrix wrote:
>> On 09/18/2014 09:09 PM, Jan Kiszka wrote:
>>> On 2014-09-18 18:28, Gilles Chanteperdrix wrote:
>>>> On 09/18/2014 06:14 PM, Jan Kiszka wrote:
>>>>> On 2014-09-18 15:44, Gilles Chanteperdrix wrote:
>>>>>> On 09/18/2014 03:26 PM, Jan Kiszka wrote:
>>>>>>> On 2014-09-18 15:05, Gilles Chanteperdrix wrote:
>>>>>>>> On 09/18/2014 02:20 PM, Jan Kiszka wrote:
>>>>>>>>> On 2014-09-18 14:17, Gilles Chanteperdrix wrote:
>>>>>>>>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>>>>>>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>>>>>>>> On 09/11/2014 07:19 AM, Jan Kiszka wrote:
>>>>>>>>>>>>> On 2014-09-11 07:11, Jan Kiszka wrote:
>>>>>>>>>>>>>> On 2014-09-09 23:03, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>>>> On 04/25/2014 12:44 PM, Jeroen Van den Keybus wrote:
>>>>>>>>>>>>>>>> For testing, I've removed the locks from the vfile system.
>>>>>>>>>>>>>>>> Then the high latencies reliably disappear.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To test, I made two xeno_nucleus modules: one with the
>>>>>>>>>>>>>>>> xnlock_get/put_ in place and one with dummies. Subsequently,
>>>>>>>>>>>>>>>> I use a program that simply opens and reads the stat file
>>>>>>>>>>>>>>>> 1,000 times.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With locks:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.575|
>>>>>>>>>>>>>>>> -2.309|      9.286|       0|     0|     -2.575|      9.286
>>>>>>>>>>>>>>>> RTD|     -2.364|     -2.276|      1.600|       0|     0|
>>>>>>>>>>>>>>>> -2.575|      9.286 RTD|     -2.482|     -2.274|      2.165|
>>>>>>>>>>>>>>>> 0|     0|     -2.575|      9.286 RTD|     -2.368|    135.261|
>>>>>>>>>>>>>>>> 1478.154|   13008|     0|     -2.575|   1478.154 RTD|
>>>>>>>>>>>>>>>> -2.368|     -2.272|      2.602|   13008|     0|     -2.575|
>>>>>>>>>>>>>>>> 1478.154 RTD|     -2.499|     -2.272|      6.933|   13008|
>>>>>>>>>>>>>>>> 0|     -2.575|   1478.154
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Without locks:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> RTT|  00:00:01  (periodic user-mode task, 100 us period,
>>>>>>>>>>>>>>>> priority 99) RTH|----lat min|----lat avg|----lat
>>>>>>>>>>>>>>>> max|-overrun|---msw|---lat best|--lat worst RTD|     -2.503|
>>>>>>>>>>>>>>>> -2.270|      3.310|       0|     0|     -2.503|      3.310
>>>>>>>>>>>>>>>> RTD|     -2.418|     -2.284|     -1.646|       0|     0|
>>>>>>>>>>>>>>>> -2.503|      3.310 RTD|     -2.496|     -2.275|      4.630|
>>>>>>>>>>>>>>>> 0|     0|     -2.503|      4.630 RTD|     -2.374|     -2.285|
>>>>>>>>>>>>>>>> -1.458|       0|     0|     -2.503|      4.630 RTD|
>>>>>>>>>>>>>>>> -2.452|     -2.273|      3.559|       0|     0|     -2.503|
>>>>>>>>>>>>>>>> 4.630 RTD|     -2.370|     -2.285|     -1.518|       0|
>>>>>>>>>>>>>>>> 0|     -2.503|      4.630 RTD|     -2.458|     -2.274|
>>>>>>>>>>>>>>>> 4.203|       0|     0|     -2.503|      4.630
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll now have a closer look into the vfile system but if the
>>>>>>>>>>>>>>>> locks are malfunctioning, I'm clueless.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Answering with a "little" delay, could you try the following
>>>>>>>>>>>>>>> patch?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/include/asm-generic/bits/pod.h
>>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h index a6be0dc..cfb0c71 100644
>>>>>>>>>>>>>>> --- a/include/asm-generic/bits/pod.h +++
>>>>>>>>>>>>>>> b/include/asm-generic/bits/pod.h @@ -248,6 +248,7 @@ void
>>>>>>>>>>>>>>> __xnlock_spin(xnlock_t *lock /*, */ XNLOCK_DBG_CONTEXT_ARGS)
>>>>>>>>>>>>>>> cpu_relax(); xnlock_dbg_spinning(lock, cpu, &spin_limit /*, */
>>>>>>>>>>>>>>> XNLOCK_DBG_PASS_CONTEXT); +			xnarch_memory_barrier(); }
>>>>>>>>>>>>>>> while(atomic_read(&lock->owner) != ~0); }
>>>>>>>>>>>>>>> EXPORT_SYMBOL_GPL(__xnlock_spin); diff --git
>>>>>>>>>>>>>>> a/include/asm-generic/system.h b/include/asm-generic/system.h
>>>>>>>>>>>>>>> index 25bd83f..7a8c4d0 100644 ---
>>>>>>>>>>>>>>> a/include/asm-generic/system.h +++
>>>>>>>>>>>>>>> b/include/asm-generic/system.h @@ -378,6 +378,8 @@ static
>>>>>>>>>>>>>>> inline void xnlock_put(xnlock_t *lock)
>>>>>>>>>>>>>>> xnarch_memory_barrier();
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> atomic_set(&lock->owner, ~0); + +	xnarch_memory_barrier();
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's pretty heavy-weighted now (it was already due to the first
>>>>>>>>>>>>>> memory barrier). Maybe it's better to look at some ticket lock
>>>>>>>>>>>>>> mechanism like Linux uses for fairness. At least on x86 (and
>>>>>>>>>>>>>> other strictly ordered archs), those require no memory barriers
>>>>>>>>>>>>>> on release.
>>>>>>>>>>>>
>>>>>>>>>>>>> In fact, memory barriers aren't needed on strictly ordered archs
>>>>>>>>>>>>> already today, independent of the spinlock granting algorithm. So
>>>>>>>>>>>>> there are two optimization possibilities:
>>>>>>>>>>>>
>>>>>>>>>>>>> - ticket-based granting - arch-specific (thus optimized) core
>>>>>>>>>>>>
>>>>>>>>>>>> Ok, no answer, so I will try to be more clear.
>>>>>>>>>>>>
>>>>>>>>>>>> I do not pretend to understand how memory barriers work at a low
>>>>>>>>>>>> level, this is a shame, I know, and am sorry for that. My "high level"
>>>>>>>>>>>> view, is that memory barriers on SMP systems act as synchronization
>>>>>>>>>>>> points, meaning that when a CPU issues a barrier, it will "see" the
>>>>>>>>>>>> state of the other CPUs at the time of their last barrier. This means
>>>>>>>>>>>> that for a CPU to see a store that occured on another CPU, there must
>>>>>>>>>>>> have been two barriers: a barrier after the store on one cpu, and a
>>>>>>>>>>>> barrier after that before the read on the other cpu. This view of
>>>>>>>>>>>> things seems to be corroborated by the fact that the patch works, and
>>>>>>>>>>>> by the following sentence in Documentation/memory-barriers.txt:
>>>>>>>>>>>>
>>>>>>>>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>>>>>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>>>>>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>>>>>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>>>>>>>>
>>>>>>>>>>> [quick answer]
>>>>>>>>>>>
>>>>>>>>>>> ...or the architecture refrains from reordering write requests, like x86
>>>>>>>>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>>>>>>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>>>>>>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>>>>>>>>
>>>>>>>>>> quick answer: I do not believe an SMP architecture can enforce stores
>>>>>>>>>> ordering accross multiple cpus, with cpus local caches and such. And the
>>>>>>>>>> fact that the patch I sent fixed the issue on x86 tend to prove me right.
>>>>>>>>>
>>>>>>>>> It's not wrong, it's just (costly, on larger machines) overkill as the
>>>>>>>>> other cores either see the lock release and all prior changes committed
>>>>>>>>> or the lock taken (and the prior changes do not matter then). They will
>>>>>>>>> never see later changes committed before the lock being visible as free.
>>>>>>>>
>>>>>>>> I agree. But this is true on all architectures, not just on strictly
>>>>>>>> ordered ones, this is just due to how barriers work on SMP systems, as I
>>>>>>>> explained.
>>>>>>>>
>>>>>>>>> That's architecturally guaranteed, and that's why you have no memory
>>>>>>>>> barriers in x86 spinlock release operations.
>>>>>>>>
>>>>>>>> I disagree, as explained in the paragraph just below the one you quote,
>>>>>>>> I believe this is an optimization, which is almost valid on any
>>>>>>>> architecture. Almost valid, because if the cpu which has done the unlock
>>>>>>>> does another lock without any time for a barrier in between to
>>>>>>>> synchronize cpus, we have a problem, because the other cpus will never
>>>>>>>> see the spinlock as free. With ticket spinlocks, you just add a store on
>>>>>>>> the cpu which spins, and you have to add a barrier after that, if you
>>>>>>>> want the barrier before the read on the cpu which will acquire the lock
>>>>>>>> to see that the spinlock is contended. So I do not see how this requires
>>>>>>>> less barriers.
>>>>>>>
>>>>>>> Ticket locks prevent unfair starvation without the closing barrier as
>>>>>>> they grant the next ticket to the next waiter, not the current holder.
>>>>>>> See the Linux implementation.
>>>>>>
>>>>>> Whether to put the closing barrier after the last store is orthogonal,
>>>>>> to whether implementing ticket locks or not. This is all a question of
>>>>>> tradeoffs.
>>>>>>
>>>>>> Without the barrier after the last store, you increase the spinning time
>>>>>> due to time taken for the store to be visible on other cpus, but you
>>>>>> optimize the overhead of unlocking.
>>>>>>
>>>>>> With ticket spinlocks you avoid the starvation situation, at the expense
>>>>>> of increasing the overhread of spinlock operations.
>>>>>>
>>>>>> I do not know which is worse. I suspect all this does not make much of a
>>>>>> difference, and what dominates is the duration of spinlock sections anyway.
>>>>>
>>>>> I think the way classic Linux spinlock did this on x86 provide the answer.
>>>>
>>>> The situation is completely different: linux spinlocks are well split,
>>>> xenomai basically has one only spinlock, so chances are that it will be
>>>> more contended, so the heavy unlock path (the one which implements the
>>>> ticket stuff) will be triggered more often. Also, xenomai spinlock (we
>>>> can loose the s anyway) being more contended, the "pending store
>>>> barrier" optimization has in fact chances of being detrimental. And
>>>> finally, due to the way spinlocks are split, Linux has scalability
>>>> issues that Xenomai can not even begin to imagine tackling.
>>>>
>>>> Anyway, the discussion is kind of moot, because as I said, we are not
>>>> going to change the spinlock implementation in 2.6. What we are
>>>> discussing here is whether to put the barrier after the atomic_set, or
>>>> whether to put that barrier where it is really needed: in the snapshot
>>>> code, and what to do for forge. I also agree that the barrier before the
>>>> atomic_set in xnlock_put is not needed on x86 and proposed an
>>>> architecture macro to replace it with a compiler barrier in that case.
>>>
>>> Yes, seems reasonable.
>>>
>>>>
>>>> I also proposed to replace the atomic_set with a cmpxchg, cmpxchg has
>>>> two barriers on ARM, but I guess on x86 it is only one barrier, this
>>>> would solve the architecture dependency nicely.
>>>
>>> That saves an abstraction but I have no clue if "mfence" is equally
>>> expensive as "lock cmpxchg". If it is, that's fine, but I suspect it's
>>> not (due to the cacheline "lock").
>>
>> Unless I misunderstand something in Linux code, it also uses the "lock"
>> prefix for unlocking ticket spinlocks. Either with a lock; add or with a
>> lock; cmpxchg.
> 
> I'm looking at arch/x86/asm/spinlock.h, and there is only a non-atomic
> __add. Same in the disassembly.

Indeed, the lock prefix is only used for some erratum on x86_32.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 11:59                               ` Jan Kiszka
  2014-09-18 12:11                                 ` Gilles Chanteperdrix
  2014-09-18 12:17                                 ` Gilles Chanteperdrix
@ 2014-09-18 20:21                                 ` Gilles Chanteperdrix
  2014-09-19  2:06                                   ` Gilles Chanteperdrix
  2 siblings, 1 reply; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-18 20:21 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 01:59 PM, Jan Kiszka wrote:
> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>  (*) There is no guarantee that a CPU will see the correct order of
>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>> memory barrier, unless the first CPU _also_ uses a matching memory
>> barrier (see the subsection on "SMP Barrier Pairing").
> 
> [quick answer]
> 
> ...or the architecture refrains from reordering write requests, like x86
> does. What may happen, though, is that the compiler reorders the writes.
> Therefore you need at least a (must cheaper) compiler barrier on those
> archs. See also linux/Documentation/memory-barriers.txt on this and more.

The passage you quote is quoted from memory-barriers.txt, and I find it
makes it pretty clear that the two barriers are needed for cache
synchronization in the general case. Now, I read more in
memory-barriers, and I do not find easily details about what the fact
that x86 is "strictly ordered" means, and how it relaxes the constraints
on what rules. Maybe you would care to give us the exact passage where
this is mentioned? Also, I would welcome any detail about how SMP cache
synchronization actually works on x86.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-18 20:21                                 ` Gilles Chanteperdrix
@ 2014-09-19  2:06                                   ` Gilles Chanteperdrix
  2014-09-19  5:41                                     ` Jan Kiszka
  2014-09-19 10:51                                     ` Gilles Chanteperdrix
  0 siblings, 2 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-19  2:06 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/18/2014 10:21 PM, Gilles Chanteperdrix wrote:
> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>  (*) There is no guarantee that a CPU will see the correct order of
>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>> barrier (see the subsection on "SMP Barrier Pairing").
>>
>> [quick answer]
>>
>> ...or the architecture refrains from reordering write requests, like x86
>> does. What may happen, though, is that the compiler reorders the writes.
>> Therefore you need at least a (must cheaper) compiler barrier on those
>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
> 
> The passage you quote is quoted from memory-barriers.txt, and I find it
> makes it pretty clear that the two barriers are needed for cache
> synchronization in the general case. Now, I read more in
> memory-barriers, and I do not find easily details about what the fact
> that x86 is "strictly ordered" means, and how it relaxes the constraints
> on what rules. Maybe you would care to give us the exact passage where
> this is mentioned? Also, I would welcome any detail about how SMP cache
> synchronization actually works on x86.

Ok, I have read a few things, it would seem recent x86 architectures
(nehalem, sandy bridge and probably haswell) use the MESIF cache
coherence protocol, with a twist for haswell since it introduced
transactional memory. A cache coherence protocol ensures in theory
transparently the same view of cache on all cpus. MESIF itself is
derived from the MESI cache coherence protocol, which is said (by
wikipedia article) to have some performance issues which are generally
compensated by adding a store buffer, which in turn requires memory
barriers for a store on one cpu to be visible in the cache (and so on
other cpus). I did not find any indication that memory barriers are
still needed for this case (which is exactly the case we are interested
in) with MESIF, but no indication that they are not needed either.

Then, I had a look at the ticket spinlocks implementations. The
operations they do are roughly the same as the xnlock implementation,
except that they are optimized for each architecture, and so remove the
useless barriers. The ARM implementation has the barrier after unlock,
and use in addition the special "sev" instruction, allowing the spinning
cpu to wait for this signal with the "wfe" (wait for event) instruction,
and to not burn cpu power when spinning. In fact it does not spin.

Of course, the problem is that they are not recursive, so implementing
recursive tickets spinlocks without adding overhead seems tricky. Just
to test if ticket spinlocks solve the issue which started this thread, I
made the following implementation:

typedef struct {
	unsigned owner;
	arch_spinlock_t alock;
} xnlock_t;

static inline int __xnlock_get(xnlock_t *lock /*, */
XNLOCK_DBG_CONTEXT_ARGS)
{
	unsigned long long start;
	int cpu = xnarch_current_cpu();

	if (lock->owner == cpu)
		return 1;

	xnlock_dbg_prepare_acquire(&start);

	arch_spin_lock(&lock->alock);
	lock->owner = cpu;

	xnlock_dbg_acquired(lock, cpu, &start /*, */ XNLOCK_DBG_PASS_CONTEXT);

	return 0;
}

static inline void xnlock_put(xnlock_t *lock)
{
	if (xnlock_dbg_release(lock))
		return;

	lock->owner = ~0U;
	arch_spin_unlock(&lock->alock);
}

And the good news is yes, this avoids the issue with /proc/xenomai/stat.
The bad news is that it does not answer the question about visibility on
one cpu of stores on another cpu without barrier. Because the ticket
spinlocks work either way on x86: the atomic add at the beginning of
arch_spin_lock ensures both the visibility of the fact that there is a
waiter to the cpu attempting to relock, and of the fact that the spin
lock has been unlocked to the waiting cpu. So, in the particular case of
the concurrent cat /proc/xenomai/stat, the "two barriers needed for
visibility" rule is respected.

I have also measured latencies with a cat /proc/xenomai/stat loop
running, with and without a memory barrier after arch_spin_unlock, and
could not find any difference, minimum, average and maximum latency
after a few minutes of runtime are the same, or at least inferior to 100ns.

I am also wondering if this xnlock implementation could be used on
forge. It has the advantage of benefiting from architecture
optimization, without the need for maintaining architecture dependent code.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-19  2:06                                   ` Gilles Chanteperdrix
@ 2014-09-19  5:41                                     ` Jan Kiszka
  2014-09-19  7:04                                       ` Philippe Gerum
  2014-09-19 10:51                                     ` Gilles Chanteperdrix
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Kiszka @ 2014-09-19  5:41 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Jeroen Van den Keybus; +Cc: xenomai

On 2014-09-19 04:06, Gilles Chanteperdrix wrote:
> On 09/18/2014 10:21 PM, Gilles Chanteperdrix wrote:
>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>
>>> [quick answer]
>>>
>>> ...or the architecture refrains from reordering write requests, like x86
>>> does. What may happen, though, is that the compiler reorders the writes.
>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>
>> The passage you quote is quoted from memory-barriers.txt, and I find it
>> makes it pretty clear that the two barriers are needed for cache
>> synchronization in the general case. Now, I read more in
>> memory-barriers, and I do not find easily details about what the fact
>> that x86 is "strictly ordered" means, and how it relaxes the constraints
>> on what rules. Maybe you would care to give us the exact passage where
>> this is mentioned? Also, I would welcome any detail about how SMP cache
>> synchronization actually works on x86.
> 
> Ok, I have read a few things, it would seem recent x86 architectures
> (nehalem, sandy bridge and probably haswell) use the MESIF cache
> coherence protocol, with a twist for haswell since it introduced
> transactional memory. A cache coherence protocol ensures in theory
> transparently the same view of cache on all cpus. MESIF itself is
> derived from the MESI cache coherence protocol, which is said (by
> wikipedia article) to have some performance issues which are generally
> compensated by adding a store buffer, which in turn requires memory
> barriers for a store on one cpu to be visible in the cache (and so on
> other cpus). I did not find any indication that memory barriers are
> still needed for this case (which is exactly the case we are interested
> in) with MESIF, but no indication that they are not needed either.
> 
> Then, I had a look at the ticket spinlocks implementations. The
> operations they do are roughly the same as the xnlock implementation,
> except that they are optimized for each architecture, and so remove the
> useless barriers. The ARM implementation has the barrier after unlock,
> and use in addition the special "sev" instruction, allowing the spinning
> cpu to wait for this signal with the "wfe" (wait for event) instruction,
> and to not burn cpu power when spinning. In fact it does not spin.
> 
> Of course, the problem is that they are not recursive, so implementing
> recursive tickets spinlocks without adding overhead seems tricky. Just
> to test if ticket spinlocks solve the issue which started this thread, I
> made the following implementation:
> 
> typedef struct {
> 	unsigned owner;
> 	arch_spinlock_t alock;
> } xnlock_t;
> 
> static inline int __xnlock_get(xnlock_t *lock /*, */
> XNLOCK_DBG_CONTEXT_ARGS)
> {
> 	unsigned long long start;
> 	int cpu = xnarch_current_cpu();
> 
> 	if (lock->owner == cpu)
> 		return 1;
> 
> 	xnlock_dbg_prepare_acquire(&start);
> 
> 	arch_spin_lock(&lock->alock);
> 	lock->owner = cpu;
> 
> 	xnlock_dbg_acquired(lock, cpu, &start /*, */ XNLOCK_DBG_PASS_CONTEXT);
> 
> 	return 0;
> }
> 
> static inline void xnlock_put(xnlock_t *lock)
> {
> 	if (xnlock_dbg_release(lock))
> 		return;
> 
> 	lock->owner = ~0U;
> 	arch_spin_unlock(&lock->alock);
> }
> 
> And the good news is yes, this avoids the issue with /proc/xenomai/stat.
> The bad news is that it does not answer the question about visibility on
> one cpu of stores on another cpu without barrier. Because the ticket
> spinlocks work either way on x86: the atomic add at the beginning of
> arch_spin_lock ensures both the visibility of the fact that there is a
> waiter to the cpu attempting to relock, and of the fact that the spin
> lock has been unlocked to the waiting cpu. So, in the particular case of
> the concurrent cat /proc/xenomai/stat, the "two barriers needed for
> visibility" rule is respected.
> 
> I have also measured latencies with a cat /proc/xenomai/stat loop
> running, with and without a memory barrier after arch_spin_unlock, and
> could not find any difference, minimum, average and maximum latency
> after a few minutes of runtime are the same, or at least inferior to 100ns.
> 
> I am also wondering if this xnlock implementation could be used on
> forge. It has the advantage of benefiting from architecture
> optimization, without the need for maintaining architecture dependent code.
>

Indeed, that would be very elegant!

Jan


-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-19  5:41                                     ` Jan Kiszka
@ 2014-09-19  7:04                                       ` Philippe Gerum
  0 siblings, 0 replies; 40+ messages in thread
From: Philippe Gerum @ 2014-09-19  7:04 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: Jan Kiszka, xenomai

On 09/19/2014 07:41 AM, Jan Kiszka wrote:
> On 2014-09-19 04:06, Gilles Chanteperdrix wrote:
>> On 09/18/2014 10:21 PM, Gilles Chanteperdrix wrote:
>>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>>
>>>> [quick answer]
>>>>
>>>> ...or the architecture refrains from reordering write requests, like x86
>>>> does. What may happen, though, is that the compiler reorders the writes.
>>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>>
>>> The passage you quote is quoted from memory-barriers.txt, and I find it
>>> makes it pretty clear that the two barriers are needed for cache
>>> synchronization in the general case. Now, I read more in
>>> memory-barriers, and I do not find easily details about what the fact
>>> that x86 is "strictly ordered" means, and how it relaxes the constraints
>>> on what rules. Maybe you would care to give us the exact passage where
>>> this is mentioned? Also, I would welcome any detail about how SMP cache
>>> synchronization actually works on x86.
>>
>> Ok, I have read a few things, it would seem recent x86 architectures
>> (nehalem, sandy bridge and probably haswell) use the MESIF cache
>> coherence protocol, with a twist for haswell since it introduced
>> transactional memory. A cache coherence protocol ensures in theory
>> transparently the same view of cache on all cpus. MESIF itself is
>> derived from the MESI cache coherence protocol, which is said (by
>> wikipedia article) to have some performance issues which are generally
>> compensated by adding a store buffer, which in turn requires memory
>> barriers for a store on one cpu to be visible in the cache (and so on
>> other cpus). I did not find any indication that memory barriers are
>> still needed for this case (which is exactly the case we are interested
>> in) with MESIF, but no indication that they are not needed either.
>>
>> Then, I had a look at the ticket spinlocks implementations. The
>> operations they do are roughly the same as the xnlock implementation,
>> except that they are optimized for each architecture, and so remove the
>> useless barriers. The ARM implementation has the barrier after unlock,
>> and use in addition the special "sev" instruction, allowing the spinning
>> cpu to wait for this signal with the "wfe" (wait for event) instruction,
>> and to not burn cpu power when spinning. In fact it does not spin.
>>
>> Of course, the problem is that they are not recursive, so implementing
>> recursive tickets spinlocks without adding overhead seems tricky. Just
>> to test if ticket spinlocks solve the issue which started this thread, I
>> made the following implementation:
>>
>> typedef struct {
>> 	unsigned owner;
>> 	arch_spinlock_t alock;
>> } xnlock_t;
>>
>> static inline int __xnlock_get(xnlock_t *lock /*, */
>> XNLOCK_DBG_CONTEXT_ARGS)
>> {
>> 	unsigned long long start;
>> 	int cpu = xnarch_current_cpu();
>>
>> 	if (lock->owner == cpu)
>> 		return 1;
>>
>> 	xnlock_dbg_prepare_acquire(&start);
>>
>> 	arch_spin_lock(&lock->alock);
>> 	lock->owner = cpu;
>>
>> 	xnlock_dbg_acquired(lock, cpu, &start /*, */ XNLOCK_DBG_PASS_CONTEXT);
>>
>> 	return 0;
>> }
>>
>> static inline void xnlock_put(xnlock_t *lock)
>> {
>> 	if (xnlock_dbg_release(lock))
>> 		return;
>>
>> 	lock->owner = ~0U;
>> 	arch_spin_unlock(&lock->alock);
>> }
>>
>> And the good news is yes, this avoids the issue with /proc/xenomai/stat.
>> The bad news is that it does not answer the question about visibility on
>> one cpu of stores on another cpu without barrier. Because the ticket
>> spinlocks work either way on x86: the atomic add at the beginning of
>> arch_spin_lock ensures both the visibility of the fact that there is a
>> waiter to the cpu attempting to relock, and of the fact that the spin
>> lock has been unlocked to the waiting cpu. So, in the particular case of
>> the concurrent cat /proc/xenomai/stat, the "two barriers needed for
>> visibility" rule is respected.
>>
>> I have also measured latencies with a cat /proc/xenomai/stat loop
>> running, with and without a memory barrier after arch_spin_unlock, and
>> could not find any difference, minimum, average and maximum latency
>> after a few minutes of runtime are the same, or at least inferior to 100ns.
>>
>> I am also wondering if this xnlock implementation could be used on
>> forge. It has the advantage of benefiting from architecture
>> optimization, without the need for maintaining architecture dependent code.
>>
> 
> Indeed, that would be very elegant!
> 

Ack. As you are at it, could you please think of a debug instrumentation
for tracking lock nesting? As we discussed earlier, at some point after
3.0 is out, we may want to get rid of the recursion support in xnlocks,
to eventually map 1:1 over native spinlocks. That would help with this
process.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Xenomai] Reading /proc/xenomai/stat causes high latencies
  2014-09-19  2:06                                   ` Gilles Chanteperdrix
  2014-09-19  5:41                                     ` Jan Kiszka
@ 2014-09-19 10:51                                     ` Gilles Chanteperdrix
  1 sibling, 0 replies; 40+ messages in thread
From: Gilles Chanteperdrix @ 2014-09-19 10:51 UTC (permalink / raw)
  To: Jan Kiszka, Jeroen Van den Keybus; +Cc: xenomai

On 09/19/2014 04:06 AM, Gilles Chanteperdrix wrote:
> On 09/18/2014 10:21 PM, Gilles Chanteperdrix wrote:
>> On 09/18/2014 01:59 PM, Jan Kiszka wrote:
>>> On 2014-09-18 13:46, Gilles Chanteperdrix wrote:
>>>>  (*) There is no guarantee that a CPU will see the correct order of
>>>> effects from a second CPU's accesses, even _if_ the second CPU uses a
>>>> memory barrier, unless the first CPU _also_ uses a matching memory
>>>> barrier (see the subsection on "SMP Barrier Pairing").
>>>
>>> [quick answer]
>>>
>>> ...or the architecture refrains from reordering write requests, like x86
>>> does. What may happen, though, is that the compiler reorders the writes.
>>> Therefore you need at least a (must cheaper) compiler barrier on those
>>> archs. See also linux/Documentation/memory-barriers.txt on this and more.
>>
>> The passage you quote is quoted from memory-barriers.txt, and I find it
>> makes it pretty clear that the two barriers are needed for cache
>> synchronization in the general case. Now, I read more in
>> memory-barriers, and I do not find easily details about what the fact
>> that x86 is "strictly ordered" means, and how it relaxes the constraints
>> on what rules. Maybe you would care to give us the exact passage where
>> this is mentioned? Also, I would welcome any detail about how SMP cache
>> synchronization actually works on x86.
> 
> Ok, I have read a few things, it would seem recent x86 architectures
> (nehalem, sandy bridge and probably haswell) use the MESIF cache
> coherence protocol, with a twist for haswell since it introduced
> transactional memory. A cache coherence protocol ensures in theory
> transparently the same view of cache on all cpus. MESIF itself is
> derived from the MESI cache coherence protocol, which is said (by
> wikipedia article) to have some performance issues which are generally
> compensated by adding a store buffer, which in turn requires memory
> barriers for a store on one cpu to be visible in the cache (and so on
> other cpus). I did not find any indication that memory barriers are
> still needed for this case (which is exactly the case we are interested
> in) with MESIF, but no indication that they are not needed either.

Thinking more about this, the store buffer is there for timing reasons
(because getting the cache line from another cpu takes time), so I
suspect the barrier does not in fact really flush the buffer, but wait
for it to drain, which means issuing the barrier will not, in fact,
change the timing for the visibility of the last store on a distant cpu,
it will simply stall the current cpu.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2014-09-19 10:51 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-22 16:02 [Xenomai] Reading /proc/xenomai/stat causes high latencies Jeroen Van den Keybus
2014-04-23  9:14 ` Jeroen Van den Keybus
2014-04-23 13:45   ` Jeroen Van den Keybus
2014-04-23 14:07     ` Gilles Chanteperdrix
2014-04-23 20:54       ` Jeroen Van den Keybus
2014-04-23 20:56         ` Gilles Chanteperdrix
2014-04-23 21:39           ` Jeroen Van den Keybus
2014-04-23 22:25             ` Gilles Chanteperdrix
2014-04-24  8:57               ` Jeroen Van den Keybus
2014-04-24 14:46                 ` Jeroen Van den Keybus
2014-04-25  8:15                   ` Jeroen Van den Keybus
2014-04-25 10:44                     ` Jeroen Van den Keybus
2014-09-09 21:03                       ` Gilles Chanteperdrix
2014-09-10 13:50                         ` Jeroen Van den Keybus
2014-09-10 19:47                           ` Gilles Chanteperdrix
2014-09-11  5:11                         ` Jan Kiszka
2014-09-11  5:19                           ` Jan Kiszka
2014-09-18 11:46                             ` Gilles Chanteperdrix
2014-09-18 11:59                               ` Jan Kiszka
2014-09-18 12:11                                 ` Gilles Chanteperdrix
2014-09-18 12:17                                 ` Gilles Chanteperdrix
2014-09-18 12:20                                   ` Jan Kiszka
2014-09-18 13:05                                     ` Gilles Chanteperdrix
2014-09-18 13:26                                       ` Jan Kiszka
2014-09-18 13:44                                         ` Gilles Chanteperdrix
2014-09-18 16:14                                           ` Jan Kiszka
2014-09-18 16:28                                             ` Gilles Chanteperdrix
2014-09-18 18:39                                               ` Gilles Chanteperdrix
2014-09-18 19:23                                                 ` Jan Kiszka
2014-09-18 19:31                                                   ` Gilles Chanteperdrix
2014-09-18 19:09                                               ` Jan Kiszka
2014-09-18 19:32                                                 ` Gilles Chanteperdrix
2014-09-18 19:56                                                   ` Jan Kiszka
2014-09-18 20:13                                                     ` Gilles Chanteperdrix
2014-09-18 20:21                                 ` Gilles Chanteperdrix
2014-09-19  2:06                                   ` Gilles Chanteperdrix
2014-09-19  5:41                                     ` Jan Kiszka
2014-09-19  7:04                                       ` Philippe Gerum
2014-09-19 10:51                                     ` Gilles Chanteperdrix
2014-09-16 11:09                           ` Gilles Chanteperdrix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.