All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai]  i.MX6q memory write causes high latency
@ 2018-07-04 17:06 Federico Sbalchiero
  2018-07-04 23:14 ` Tom Evans
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-04 17:06 UTC (permalink / raw)
  To: xenomai

Hi,
first I want to say thanks to everyone involved in Xenomai for their job.

I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
sabresd board using Yocto. System boots fine and is stable, but latency
under load (xeno-test) is higher than in my reference system (Xenomai
2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
This is after disabling power management, frequency scaling, CMA,
graphics, tracing, debug.

I have found that a simple non-realtime user space process writing a
buffer in memory (memwrite) is able to trigger such high latencies.
Latency worsen a lot running a copy of the process on each core.
There is a correlation between buffer size and cache size suggesting
an L2 cache issue, like the L2 write allocate discussed in the mailing
list, but I can confirm L2 WA is disabled (see log).

I'm looking for comments or suggestions.

Thanks,
Federico


"memwrite" test case:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
unsigned char *buffer;
int main(int argc, char **argv)
{
     int i;
     int count = 0;
     int n;
     int size = 10 * 1024 * 1024;
     volatile unsigned *pt;
     printf("load system by writing in memory\n");
     buffer = malloc(size);
     if (buffer == NULL) {
         printf("buffer allocation failed\n");
         exit(1);
     }
     n = size / sizeof(unsigned);
     while (1) {
         // write some data to memory buffer
         pt = (unsigned *) buffer;
         for (i = 0; i < n; i++)
             *pt++ = i;
         count++;
     }
     return 0;
}

xeno-test on Xenomai 3.0.7 and ipipe-arm/4.14:
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|     18.000|     26.504|     42.667|       0|     0| 18.000|     42.667
RTD|     19.000|     25.198|     41.000|       0|     0| 18.000|     42.667
RTD|     18.999|     25.494|     40.999|       0|     0| 18.000|     42.667
RTD|     18.666|     25.060|     38.999|       0|     0| 18.000|     42.667
RTD|     18.999|     24.464|     38.332|       0|     0| 18.000|     42.667
RTD|     18.332|     24.546|     41.999|       0|     0| 18.000|     42.667
RTD|     13.332|     22.445|     45.665|       0|     0| 13.332|     45.665
RTD|     13.331|     21.164|     43.665|       0|     0| 13.331|     45.665
RTD|     13.331|     21.930|     43.665|       0|     0| 13.331|     45.665
RTD|     13.331|     22.254|     48.664|       0|     0| 13.331|     48.664
RTD|     13.331|     22.037|     46.664|       0|     0| 13.331|     48.664
RTD|     13.330|     21.053|     42.664|       0|     0| 13.330|     48.664
RTD|     13.330|     20.610|     37.330|       0|     0| 13.330|     48.664
RTD|     13.330|     20.520|     34.997|       0|     0| 13.330|     48.664
RTD|     13.330|     20.398|     39.330|       0|     0| 13.330|     48.664
RTD|     13.663|     21.249|     37.996|       0|     0| 13.330|     48.664
RTD|     13.329|     20.983|     35.663|       0|     0| 13.329|     48.664
RTD|     12.996|     20.039|     34.329|       0|     0| 12.996|     48.664
RTD|     13.329|     20.580|     42.662|       0|     0| 12.996|     48.664
RTD|     12.995|     20.518|     39.329|       0|     0| 12.995|     48.664
RTD|     13.328|     20.168|     35.662|       0|     0| 12.995|     48.664

xeno-test on Xenomai 2.6.5 and Freescale Linux 3.10.17 + ipipe 3.10.18:
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|      4.957|     17.575|     28.088|       0|     0| 4.957|     28.088
RTD|      4.904|     17.560|     26.828|       0|     0| 4.904|     28.088
RTD|      4.479|     13.472|     29.767|       0|     0| 4.479|     29.767
RTD|      4.522|     12.724|     23.275|       0|     0| 4.479|     29.767
RTD|      4.512|     12.904|     25.641|       0|     0| 4.479|     29.767
RTD|      4.542|     12.818|     27.878|       0|     0| 4.479|     29.767
RTD|      4.520|     13.068|     27.926|       0|     0| 4.479|     29.767
RTD|      4.409|     12.770|     26.689|       0|     0| 4.409|     29.767
RTD|      4.568|     12.265|     27.065|       0|     0| 4.409|     29.767
RTD|      4.492|     12.017|     25.898|       0|     0| 4.409|     29.767
RTD|      4.469|     12.303|     24.540|       0|     0| 4.409|     29.767
RTD|      4.489|     12.030|     27.924|       0|     0| 4.409|     29.767
RTD|      4.590|     11.851|     23.651|       0|     0| 4.409|     29.767
RTD|      4.479|     13.371|     24.838|       0|     0| 4.409|     29.767
RTD|      4.396|     13.204|     28.797|       0|     0| 4.396|     29.767
RTD|      4.411|     12.454|     26.002|       0|     0| 4.396|     29.767
RTD|      4.560|     12.234|     27.146|       0|     0| 4.396|     29.767
RTD|      4.593|     12.441|     24.686|       0|     0| 4.396|     29.767
RTD|      4.520|     12.510|     24.275|       0|     0| 4.396|     29.767
RTD|      4.568|     11.797|     24.982|       0|     0| 4.396|     29.767
RTD|      4.482|     12.631|     24.972|       0|     0| 4.396|     29.767

latency (no xeno-test!) with four memwrite instances on Xenomai 3.0.7 
and ipipe-arm/4.14:
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|     41.999|     60.200|    101.333|       0|     0| 41.999|    101.333
RTD|     42.666|     60.471|    103.666|       0|     0| 41.999|    103.666
RTD|     36.999|     60.568|     96.999|       0|     0| 36.999|    103.666
RTD|     38.998|     60.535|     95.332|       0|     0| 36.999|    103.666
RTD|     35.332|     60.359|    105.998|       0|     0| 35.332|    105.998
RTD|     41.998|     60.762|    116.665|       0|     0| 35.332|    116.665
RTD|     40.664|     60.419|     96.331|       0|     0| 35.332|    116.665
RTD|     41.331|     60.678|     91.331|       0|     0| 35.332|    116.665
RTD|     38.997|     60.923|     94.664|       0|     0| 35.332|    116.665
RTD|     42.664|     60.505|     95.664|       0|     0| 35.332|    116.665
RTD|     40.330|     60.175|     95.664|       0|     0| 35.332|    116.665
RTD|     40.663|     60.368|     96.663|       0|     0| 35.332|    116.665
RTD|     42.330|     60.286|     95.330|       0|     0| 35.332|    116.665
RTD|     36.996|     60.304|    100.663|       0|     0| 35.332|    116.665
RTD|     41.996|     61.100|    103.663|       0|     0| 35.332|    116.665
RTD|     40.996|     60.792|    102.996|       0|     0| 35.332|    116.665
RTD|     38.662|     60.205|    104.329|       0|     0| 35.332|    116.665
RTD|     39.995|     60.645|     98.995|       0|     0| 35.332|    116.665
RTD|     37.328|     60.300|     99.662|       0|     0| 35.332|    116.665
RTD|     40.995|     60.206|     98.328|       0|     0| 35.332|    116.665
RTD|     36.994|     60.444|     95.661|       0|     0| 35.332|    116.665

Linux 4.14 menuconfig:
   CPU Power Management\CPU Frequency scaling -> NO
   CPU Power Management\CPU Idle PM support -> NO
   Power management options\Suspend to RAM and standby -> no
   Power management options\Device power management core functionality -> no
   Device Drivers\Graphics support\DRM Support for Freescale i.MX -> no
   Device Drivers\Graphics support\ETNAVIV (DRM support for Vivante GPU 
IP cores) -> no
   Kernel Features\Timer frequency -> 200 Hz
   Kernel Features\Allow for memory compaction -> no
   Kernel Features\Contiguous Memory Allocator -> no
   Kernel hacking\Tracers -> no
   Kernel hacking\Lock Debugging (spinlocks, mutexes, etc...) -> all no
   Kernel hacking\Stack backtrace support -> no
   Bus support\PCI support-> no

Boot log:
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.36-fslc+gc664b053e5ce 
(oe-user@oe-host) (gcc version 7.3.0 (GCC)) #1 SMP Tue Jun 26 08:49:22 
UTC 2018
[    0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), 
cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache
[    0.000000] OF: fdt: Machine model: Freescale i.MX6 Quad SABRE Smart 
Device Board
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] On node 0 totalpages: 262144
[    0.000000] free_area_init_node: node 0, pgdat c0d62900, node_mem_map 
ef7f8000
[    0.000000]   Normal zone: 1536 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 196608 pages, LIFO batch:31
[    0.000000]   HighMem zone: 65536 pages, LIFO batch:15
[    0.000000] percpu: Embedded 19 pages/cpu @ef78a000 s46924 r8192 
d22708 u77824
[    0.000000] pcpu-alloc: s46924 r8192 d22708 u77824 alloc=19*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 260608
[    0.000000] Kernel command line: console=ttymxc0,115200 
root=PARTUUID=1bfc1072-02 rootwait rw[    0.000000] PID hash table 
entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 
bytes)
[    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 
bytes)
[    0.000000] Memory: 1025492K/1048576K available (8192K kernel code, 
397K rwdata, 2220K rodata, 1024K init, 885K bss, 23084K reserved, 0K 
cma-reserved, 262144K highmem)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
[    0.000000]     vmalloc : 0xf0800000 - 0xff800000   ( 240 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xf0000000   ( 768 MB)
[    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
[    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
[    0.000000]       .text : 0xc0008000 - 0xc0900000   (9184 kB)
[    0.000000]       .init : 0xc0c00000 - 0xc0d00000   (1024 kB)
[    0.000000]       .data : 0xc0d00000 - 0xc0d63700   ( 398 kB)
[    0.000000]        .bss : 0xc0d65000 - 0xc0e4255c   ( 886 kB)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] Hierarchical RCU implementation.
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000000] L2C: I-pipe: l2x0_write_allocate= not specified, defaults 
to 0 (disabled).
[    0.000000] L2C: DT/platform modifies aux control register: 
0x32470000 -> 0x32c70000
[    0.000000] L2C-310 errata 752271 769419 enabled
[    0.000000] L2C-310 enabling early BRESP for Cortex-A9
[    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
[    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
[    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
[    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
[    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001
[    0.000000] I-pipe, 396.000 MHz timer
[    0.000000] Switching to timer-based delay loop, resolution 333ns
[    0.000007] sched_clock: 32 bits at 3000kHz, resolution 333ns, wraps 
every 715827882841ns
[    0.000027] clocksource: mxc_timer1: mask: 0xffffffff max_cycles: 
0xffffffff, max_idle_ns: 637086815595 ns
[    0.000042] I-pipe, 3.000 MHz clocksource, wrap in 1431655 ms
[    0.000059] clocksource: ipipe_tsc: mask: 0xffffffffffffffff 
max_cycles: 0x1623fa770, max_idle_ns: 881590404476 ns
[    0.001630] Interrupt pipeline (release #1)
[    0.001784] Console: colour dummy device 80x30
[    0.001816] Calibrating delay loop (skipped), value calculated using 
timer frequency.. 6.00 BogoMIPS (lpj=15000)
[    0.001833] pid_max: default: 32768 minimum: 301
[    0.001978] Mount-cache hash table entries: 2048 (order: 1, 8192 bytes)
[    0.002002] Mountpoint-cache hash table entries: 2048 (order: 1, 8192 
bytes)
[    0.002545] CPU: Testing write buffer coherency: ok
[    0.002902] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.003310] Setting up static identity map for 0x10100000 - 0x10100078
[    0.003453] Hierarchical SRCU implementation.
[    0.004120] smp: Bringing up secondary CPUs ...
[    0.004748] I-pipe, 396.000 MHz timer
[    0.004770] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.005491] I-pipe, 396.000 MHz timer
[    0.005512] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
[    0.006237] I-pipe, 396.000 MHz timer
[    0.006257] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
[    0.006373] smp: Brought up 1 node, 4 CPUs
[    0.006405] SMP: Total of 4 processors activated (24.00 BogoMIPS).
[    0.006415] CPU: All CPU(s) started in SVC mode.
[    0.007527] devtmpfs: initialized
[    0.016773] random: get_random_u32 called from 
bucket_table_alloc+0x100/0x230 with crng_init=0
[    0.016890] VFP support v0.3: implementor 41 architecture 3 part 30 
variant 9 rev 4
[    0.017029] clocksource: jiffies: mask: 0xffffffff max_cycles: 
0xffffffff, max_idle_ns: 9556302231375000 ns
[    0.017054] futex hash table entries: 1024 (order: 4, 65536 bytes)
[    0.017282] pinctrl core: initialized pinctrl subsystem
[    0.018032] NET: Registered protocol family 16
[    0.019872] DMA: preallocated 256 KiB pool for atomic coherent 
allocations
[    0.020697] CPU identified as i.MX6Q, silicon rev 1.1
[    0.026849] vdd1p1: supplied by regulator-dummy
[    0.027291] vdd3p0: supplied by regulator-dummy
[    0.027687] vdd2p5: supplied by regulator-dummy
[    0.037718] hw-breakpoint: found 5 (+1 reserved) breakpoint and 1 
watchpoint registers.
[    0.037731] hw-breakpoint: maximum watchpoint size is 4 bytes.
[    0.039110] imx6q-pinctrl 20e0000.iomuxc: initialized IMX pinctrl driver
[    0.060477] mxs-dma 110000.dma-apbh: initialized
[    0.063177] SCSI subsystem initialized
[    0.063409] libata version 3.00 loaded.
[    0.063601] usbcore: registered new interface driver usbfs
[    0.063658] usbcore: registered new interface driver hub
[    0.063755] usbcore: registered new device driver usb
[    0.064922] i2c i2c-0: IMX I2C adapter registered
[    0.064941] i2c i2c-0: can't use DMA, using PIO instead.
[    0.065483] i2c i2c-1: IMX I2C adapter registered
[    0.065501] i2c i2c-1: can't use DMA, using PIO instead.
[    0.065939] i2c i2c-2: IMX I2C adapter registered
[    0.065955] i2c i2c-2: can't use DMA, using PIO instead.
[    0.066077] media: Linux media interface: v0.10
[    0.066129] Linux video capture interface: v2.00
[    0.066188] pps_core: LinuxPPS API ver. 1 registered
[    0.066197] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 
Rodolfo Giometti <giometti@linux.it>
[    0.066219] PTP clock support registered
[    0.066476] Advanced Linux Sound Architecture Driver Initialized.
[    0.067494] Bluetooth: Core ver 2.22
[    0.067538] NET: Registered protocol family 31
[    0.067548] Bluetooth: HCI device and connection manager initialized
[    0.067565] Bluetooth: HCI socket layer initialized
[    0.067579] Bluetooth: L2CAP socket layer initialized
[    0.067615] Bluetooth: SCO socket layer initialized
[    0.068490] clocksource: Switched to clocksource ipipe_tsc
[    0.068592] VFS: Disk quotas dquot_6.6.0
[    0.068662] VFS: Dquot-cache hash table entries: 1024 (order 0, 4096 
bytes)
[    0.076791] NET: Registered protocol family 2
[    0.077372] TCP established hash table entries: 8192 (order: 3, 32768 
bytes)
[    0.077461] TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
[    0.077605] TCP: Hash tables configured (established 8192 bind 8192)
[    0.077815] UDP hash table entries: 512 (order: 2, 16384 bytes)
[    0.077862] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes)
[    0.078054] NET: Registered protocol family 1
[    0.078552] RPC: Registered named UNIX socket transport module.
[    0.078564] RPC: Registered udp transport module.
[    0.078574] RPC: Registered tcp transport module.
[    0.078583] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.079359] hw perfevents: no interrupt-affinity property for 
/soc/pmu, guessing.
[    0.079613] hw perfevents: enabled with armv7_cortex_a9 PMU driver, 7 
counters available
[    0.080925] [Xenomai] scheduling class idle registered.
[    0.080936] [Xenomai] scheduling class rt registered.
[    0.081181] I-pipe: head domain Xenomai registered.
[    0.083885] [Xenomai] Cobalt v3.0.7 (Lingering Dawn)
[    0.083987] workingset: timestamp_bits=30 max_order=18 bucket_order=0
[    0.088810] NFS: Registering the id_resolver key type
[    0.088850] Key type id_resolver registered
[    0.088860] Key type id_legacy registered
[    0.088905] jffs2: version 2.2. (NAND) © 2001-2006 Red Hat, Inc.
[    0.089328] fuse init (API version 7.26)
[    0.094536] bounce: pool size: 64 pages
[    0.094558] io scheduler noop registered
[    0.094568] io scheduler deadline registered
[    0.094785] io scheduler cfq registered (default)
[    0.094798] io scheduler mq-deadline registered
[    0.094807] io scheduler kyber registered
[    0.097088] pwm-backlight backlight-lvds: backlight-lvds supply power 
not found, using dummy regulator
[    0.098906] imx-sdma 20ec000.sdma: Direct firmware load for 
imx/sdma/sdma-imx6q.bin failed with error -2
[    0.098924] imx-sdma 20ec000.sdma: external firmware not found, using 
ROM firmware
...





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-04 17:06 [Xenomai] i.MX6q memory write causes high latency Federico Sbalchiero
@ 2018-07-04 23:14 ` Tom Evans
  2018-07-05  9:14 ` Philippe Gerum
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Tom Evans @ 2018-07-04 23:14 UTC (permalink / raw)
  To: xenomai, f.sbalchiero

On 05/07/18 03:06, Federico Sbalchiero wrote:
> Hi,
> first I want to say thanks to everyone involved in Xenomai for their job.
> 
> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
> sabresd board using Yocto. System boots fine and is stable, but latency
> under load (xeno-test) is higher than in my reference system (Xenomai
> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
> This is after disabling power management, frequency scaling, CMA,
> graphics, tracing, debug.
> 
> I have found that a simple non-realtime user space process writing a
> buffer in memory (memwrite) is able to trigger such high latencies.
> Latency worsen a lot running a copy of the process on each core.
> There is a correlation between buffer size and cache size suggesting
> an L2 cache issue, like the L2 write allocate discussed in the mailing
> list, but I can confirm L2 WA is disabled (see log).

Could I suggest compiling and running this. It gives good measurements of the 
multiple levels of cache in a computer and measures their latencies:

https://homepages.cwi.nl/~manegold/Calibrator/

It might also be a tool to run after changing the CPU setup to see if it can 
find any differences.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-04 17:06 [Xenomai] i.MX6q memory write causes high latency Federico Sbalchiero
  2018-07-04 23:14 ` Tom Evans
@ 2018-07-05  9:14 ` Philippe Gerum
  2018-07-05  9:50   ` Federico Sbalchiero
                     ` (2 more replies)
       [not found] ` <CALLqZ8SuY4K=PFP6LeCHXU9PXoXnmDv4UCLnqHmnz-8xc3CacA@mail.gmail.com>
  2018-07-06  9:04 ` Philippe Gerum
  3 siblings, 3 replies; 16+ messages in thread
From: Philippe Gerum @ 2018-07-05  9:14 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
> Hi,
> first I want to say thanks to everyone involved in Xenomai for their job.
> 
> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
> sabresd board using Yocto. System boots fine and is stable, but latency
> under load (xeno-test) is higher than in my reference system (Xenomai
> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
> This is after disabling power management, frequency scaling, CMA,
> graphics, tracing, debug.
> 
> I have found that a simple non-realtime user space process writing a
> buffer in memory (memwrite) is able to trigger such high latencies.
> Latency worsen a lot running a copy of the process on each core.
> There is a correlation between buffer size and cache size suggesting
> an L2 cache issue, like the L2 write allocate discussed in the mailing
> list, but I can confirm L2 WA is disabled (see log).
> 
> I'm looking for comments or suggestions.
> 

A basic dd if=/dev/zero of=/dev/null loop in the background is enough to
raise the latency actually. Could you try the Xenomai 3 + 3.18 combo on
your hw and let us know whether you see the same regression?

TIA,

-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-05  9:14 ` Philippe Gerum
@ 2018-07-05  9:50   ` Federico Sbalchiero
  2018-07-06  8:41   ` Federico Sbalchiero
  2018-07-06  9:09   ` Federico Sbalchiero
  2 siblings, 0 replies; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-05  9:50 UTC (permalink / raw)
  To: xenomai

Il 05/07/2018 11:14, Philippe Gerum ha scritto:
> On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
>> Hi,
>> first I want to say thanks to everyone involved in Xenomai for their job.
>>
>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>> sabresd board using Yocto. System boots fine and is stable, but latency
>> under load (xeno-test) is higher than in my reference system (Xenomai
>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>> This is after disabling power management, frequency scaling, CMA,
>> graphics, tracing, debug.
>>
>> I have found that a simple non-realtime user space process writing a
>> buffer in memory (memwrite) is able to trigger such high latencies.
>> Latency worsen a lot running a copy of the process on each core.
>> There is a correlation between buffer size and cache size suggesting
>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>> list, but I can confirm L2 WA is disabled (see log).
>>
>> I'm looking for comments or suggestions.
>>
> A basic dd if=/dev/zero of=/dev/null loop in the background is enough to
> raise the latency actually. Could you try the Xenomai 3 + 3.18 combo on
> your hw and let us know whether you see the same regression?
>
> TIA,
>
Yes, I'll try it this evening (out of office today).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
       [not found]   ` <CALLqZ8Q3uZm8jTmtJLHR=Yz3Cw5g+CGTrXaj-oUFPT=1Rh0JbA@mail.gmail.com>
@ 2018-07-06  8:07     ` Federico Sbalchiero
  2018-07-06  8:32       ` Philippe Gerum
  2018-07-08  9:41       ` Philippe Gerum
  0 siblings, 2 replies; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-06  8:07 UTC (permalink / raw)
  To: xenomai

adding a break at line 837 in file /arch/arm/mm/cache-l2x0.c enables L2
write allocate:

[    0.000000] L2C-310 errata 752271 769419 enabled
[    0.000000] L2C-310 enabling early BRESP for Cortex-A9
[    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
[    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
[    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
[    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
[    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001


latency under load (four memwrite instances) is better but still high.

RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|     42.667|     58.521|     87.667|       0|     0| 42.667|     87.667
RTD|     42.000|     58.935|     89.000|       0|     0| 42.000|     89.000
RTD|     36.666|     58.707|     90.333|       0|     0| 36.666|     90.333
RTD|     38.333|     58.439|     92.666|       0|     0| 36.666|     92.666
RTD|     41.666|     58.595|     84.999|       0|     0| 36.666|     92.666
RTD|     42.666|     58.698|     89.666|       0|     0| 36.666|     92.666
RTD|     40.999|     58.999|     95.665|       0|     0| 36.666|     95.665
RTD|     42.665|     58.823|     88.665|       0|     0| 36.666|     95.665
RTD|     42.665|     58.570|     84.665|       0|     0| 36.666|     95.665
RTD|     41.331|     58.599|     86.998|       0|     0| 36.666|     95.665
RTD|     37.664|     58.596|     92.331|       0|     0| 36.666|     95.665
RTD|     35.331|     58.893|     85.997|       0|     0| 35.331|     95.665
RTD|     41.997|     58.704|     86.997|       0|     0| 35.331|     95.665
RTD|     40.997|     58.723|     94.997|       0|     0| 35.331|     95.665
RTD|     41.330|     58.710|     88.997|       0|     0| 35.331|     95.665
RTD|     41.330|     59.080|     92.663|       0|     0| 35.331|     95.665
RTD|     38.330|     58.733|     85.996|       0|     0| 35.331|     95.665
RTD|     39.996|     59.095|     90.663|       0|     0| 35.331|     95.665
RTD|     41.662|     58.967|     86.662|       0|     0| 35.331|     95.665
RTD|     42.662|     58.884|     86.995|       0|     0| 35.331|     95.665
RTD|     42.662|     58.852|     88.329|       0|     0| 35.331|     95.665

Il 04/07/2018 21:31, Greg Gallagher ha scritto:
> I did a quick test, here and I do think we are mis-configuring the
> cache, if you have time try my suggestion in the first email. I'll
> ping Philippe as well.
>
> -Greg
>
> On Wed, Jul 4, 2018 at 3:06 PM, Greg Gallagher <greg@embeddedgreg.com> wrote:
>> disregard my last email, i read the statement wrong.  I'll keep looking.
>>
>> On Wed, Jul 4, 2018 at 1:06 PM, Federico Sbalchiero
>> <f.sbalchiero@sferapro.com> wrote:
>>> Hi,
>>> first I want to say thanks to everyone involved in Xenomai for their job.
>>>
>>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>>> sabresd board using Yocto. System boots fine and is stable, but latency
>>> under load (xeno-test) is higher than in my reference system (Xenomai
>>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>>> This is after disabling power management, frequency scaling, CMA,
>>> graphics, tracing, debug.
>>>
>>> I have found that a simple non-realtime user space process writing a
>>> buffer in memory (memwrite) is able to trigger such high latencies.
>>> Latency worsen a lot running a copy of the process on each core.
>>> There is a correlation between buffer size and cache size suggesting
>>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>>> list, but I can confirm L2 WA is disabled (see log).
>>>
>>> I'm looking for comments or suggestions.
>>>
>>> Thanks,
>>> Federico
>>>
>>>
>>> "memwrite" test case:
>>> #include <stdlib.h>
>>> #include <stdio.h>
>>> #include <string.h>
>>> unsigned char *buffer;
>>> int main(int argc, char **argv)
>>> {
>>>      int i;
>>>      int count = 0;
>>>      int n;
>>>      int size = 10 * 1024 * 1024;
>>>      volatile unsigned *pt;
>>>      printf("load system by writing in memory\n");
>>>      buffer = malloc(size);
>>>      if (buffer == NULL) {
>>>          printf("buffer allocation failed\n");
>>>          exit(1);
>>>      }
>>>      n = size / sizeof(unsigned);
>>>      while (1) {
>>>          // write some data to memory buffer
>>>          pt = (unsigned *) buffer;
>>>          for (i = 0; i < n; i++)
>>>              *pt++ = i;
>>>          count++;
>>>      }
>>>      return 0;
>>> }
>>>
>>> xeno-test on Xenomai 3.0.7 and ipipe-arm/4.14:
>>> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
>>> worst
>>> RTD|     18.000|     26.504|     42.667|       0|     0| 18.000|     42.667
>>> RTD|     19.000|     25.198|     41.000|       0|     0| 18.000|     42.667
>>> RTD|     18.999|     25.494|     40.999|       0|     0| 18.000|     42.667
>>> RTD|     18.666|     25.060|     38.999|       0|     0| 18.000|     42.667
>>> RTD|     18.999|     24.464|     38.332|       0|     0| 18.000|     42.667
>>> RTD|     18.332|     24.546|     41.999|       0|     0| 18.000|     42.667
>>> RTD|     13.332|     22.445|     45.665|       0|     0| 13.332|     45.665
>>> RTD|     13.331|     21.164|     43.665|       0|     0| 13.331|     45.665
>>> RTD|     13.331|     21.930|     43.665|       0|     0| 13.331|     45.665
>>> RTD|     13.331|     22.254|     48.664|       0|     0| 13.331|     48.664
>>> RTD|     13.331|     22.037|     46.664|       0|     0| 13.331|     48.664
>>> RTD|     13.330|     21.053|     42.664|       0|     0| 13.330|     48.664
>>> RTD|     13.330|     20.610|     37.330|       0|     0| 13.330|     48.664
>>> RTD|     13.330|     20.520|     34.997|       0|     0| 13.330|     48.664
>>> RTD|     13.330|     20.398|     39.330|       0|     0| 13.330|     48.664
>>> RTD|     13.663|     21.249|     37.996|       0|     0| 13.330|     48.664
>>> RTD|     13.329|     20.983|     35.663|       0|     0| 13.329|     48.664
>>> RTD|     12.996|     20.039|     34.329|       0|     0| 12.996|     48.664
>>> RTD|     13.329|     20.580|     42.662|       0|     0| 12.996|     48.664
>>> RTD|     12.995|     20.518|     39.329|       0|     0| 12.995|     48.664
>>> RTD|     13.328|     20.168|     35.662|       0|     0| 12.995|     48.664
>>>
>>> xeno-test on Xenomai 2.6.5 and Freescale Linux 3.10.17 + ipipe 3.10.18:
>>> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
>>> worst
>>> RTD|      4.957|     17.575|     28.088|       0|     0| 4.957|     28.088
>>> RTD|      4.904|     17.560|     26.828|       0|     0| 4.904|     28.088
>>> RTD|      4.479|     13.472|     29.767|       0|     0| 4.479|     29.767
>>> RTD|      4.522|     12.724|     23.275|       0|     0| 4.479|     29.767
>>> RTD|      4.512|     12.904|     25.641|       0|     0| 4.479|     29.767
>>> RTD|      4.542|     12.818|     27.878|       0|     0| 4.479|     29.767
>>> RTD|      4.520|     13.068|     27.926|       0|     0| 4.479|     29.767
>>> RTD|      4.409|     12.770|     26.689|       0|     0| 4.409|     29.767
>>> RTD|      4.568|     12.265|     27.065|       0|     0| 4.409|     29.767
>>> RTD|      4.492|     12.017|     25.898|       0|     0| 4.409|     29.767
>>> RTD|      4.469|     12.303|     24.540|       0|     0| 4.409|     29.767
>>> RTD|      4.489|     12.030|     27.924|       0|     0| 4.409|     29.767
>>> RTD|      4.590|     11.851|     23.651|       0|     0| 4.409|     29.767
>>> RTD|      4.479|     13.371|     24.838|       0|     0| 4.409|     29.767
>>> RTD|      4.396|     13.204|     28.797|       0|     0| 4.396|     29.767
>>> RTD|      4.411|     12.454|     26.002|       0|     0| 4.396|     29.767
>>> RTD|      4.560|     12.234|     27.146|       0|     0| 4.396|     29.767
>>> RTD|      4.593|     12.441|     24.686|       0|     0| 4.396|     29.767
>>> RTD|      4.520|     12.510|     24.275|       0|     0| 4.396|     29.767
>>> RTD|      4.568|     11.797|     24.982|       0|     0| 4.396|     29.767
>>> RTD|      4.482|     12.631|     24.972|       0|     0| 4.396|     29.767
>>>
>>> latency (no xeno-test!) with four memwrite instances on Xenomai 3.0.7 and
>>> ipipe-arm/4.14:
>>> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat best|--lat
>>> worst
>>> RTD|     41.999|     60.200|    101.333|       0|     0| 41.999|    101.333
>>> RTD|     42.666|     60.471|    103.666|       0|     0| 41.999|    103.666
>>> RTD|     36.999|     60.568|     96.999|       0|     0| 36.999|    103.666
>>> RTD|     38.998|     60.535|     95.332|       0|     0| 36.999|    103.666
>>> RTD|     35.332|     60.359|    105.998|       0|     0| 35.332|    105.998
>>> RTD|     41.998|     60.762|    116.665|       0|     0| 35.332|    116.665
>>> RTD|     40.664|     60.419|     96.331|       0|     0| 35.332|    116.665
>>> RTD|     41.331|     60.678|     91.331|       0|     0| 35.332|    116.665
>>> RTD|     38.997|     60.923|     94.664|       0|     0| 35.332|    116.665
>>> RTD|     42.664|     60.505|     95.664|       0|     0| 35.332|    116.665
>>> RTD|     40.330|     60.175|     95.664|       0|     0| 35.332|    116.665
>>> RTD|     40.663|     60.368|     96.663|       0|     0| 35.332|    116.665
>>> RTD|     42.330|     60.286|     95.330|       0|     0| 35.332|    116.665
>>> RTD|     36.996|     60.304|    100.663|       0|     0| 35.332|    116.665
>>> RTD|     41.996|     61.100|    103.663|       0|     0| 35.332|    116.665
>>> RTD|     40.996|     60.792|    102.996|       0|     0| 35.332|    116.665
>>> RTD|     38.662|     60.205|    104.329|       0|     0| 35.332|    116.665
>>> RTD|     39.995|     60.645|     98.995|       0|     0| 35.332|    116.665
>>> RTD|     37.328|     60.300|     99.662|       0|     0| 35.332|    116.665
>>> RTD|     40.995|     60.206|     98.328|       0|     0| 35.332|    116.665
>>> RTD|     36.994|     60.444|     95.661|       0|     0| 35.332|    116.665
>>>
>>> Linux 4.14 menuconfig:
>>>    CPU Power Management\CPU Frequency scaling -> NO
>>>    CPU Power Management\CPU Idle PM support -> NO
>>>    Power management options\Suspend to RAM and standby -> no
>>>    Power management options\Device power management core functionality -> no
>>>    Device Drivers\Graphics support\DRM Support for Freescale i.MX -> no
>>>    Device Drivers\Graphics support\ETNAVIV (DRM support for Vivante GPU IP
>>> cores) -> no
>>>    Kernel Features\Timer frequency -> 200 Hz
>>>    Kernel Features\Allow for memory compaction -> no
>>>    Kernel Features\Contiguous Memory Allocator -> no
>>>    Kernel hacking\Tracers -> no
>>>    Kernel hacking\Lock Debugging (spinlocks, mutexes, etc...) -> all no
>>>    Kernel hacking\Stack backtrace support -> no
>>>    Bus support\PCI support-> no
>>>
>>> Boot log:
>>> [    0.000000] Booting Linux on physical CPU 0x0
>>> [    0.000000] Linux version 4.14.36-fslc+gc664b053e5ce (oe-user@oe-host)
>>> (gcc version 7.3.0 (GCC)) #1 SMP Tue Jun 26 08:49:22 UTC 2018
>>> [    0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7),
>>> cr=10c5387d
>>> [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing
>>> instruction cache
>>> [    0.000000] OF: fdt: Machine model: Freescale i.MX6 Quad SABRE Smart
>>> Device Board
>>> [    0.000000] Memory policy: Data cache writealloc
>>> [    0.000000] On node 0 totalpages: 262144
>>> [    0.000000] free_area_init_node: node 0, pgdat c0d62900, node_mem_map
>>> ef7f8000
>>> [    0.000000]   Normal zone: 1536 pages used for memmap
>>> [    0.000000]   Normal zone: 0 pages reserved
>>> [    0.000000]   Normal zone: 196608 pages, LIFO batch:31
>>> [    0.000000]   HighMem zone: 65536 pages, LIFO batch:15
>>> [    0.000000] percpu: Embedded 19 pages/cpu @ef78a000 s46924 r8192 d22708
>>> u77824
>>> [    0.000000] pcpu-alloc: s46924 r8192 d22708 u77824 alloc=19*4096
>>> [    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
>>> [    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 260608
>>> [    0.000000] Kernel command line: console=ttymxc0,115200
>>> root=PARTUUID=1bfc1072-02 rootwait rw[    0.000000] PID hash table entries:
>>> 4096 (order: 2, 16384 bytes)
>>> [    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288
>>> bytes)
>>> [    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144
>>> bytes)
>>> [    0.000000] Memory: 1025492K/1048576K available (8192K kernel code, 397K
>>> rwdata, 2220K rodata, 1024K init, 885K bss, 23084K reserved, 0K
>>> cma-reserved, 262144K highmem)
>>> [    0.000000] Virtual kernel memory layout:
>>> [    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
>>> [    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
>>> [    0.000000]     vmalloc : 0xf0800000 - 0xff800000   ( 240 MB)
>>> [    0.000000]     lowmem  : 0xc0000000 - 0xf0000000   ( 768 MB)
>>> [    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
>>> [    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
>>> [    0.000000]       .text : 0xc0008000 - 0xc0900000   (9184 kB)
>>> [    0.000000]       .init : 0xc0c00000 - 0xc0d00000   (1024 kB)
>>> [    0.000000]       .data : 0xc0d00000 - 0xc0d63700   ( 398 kB)
>>> [    0.000000]        .bss : 0xc0d65000 - 0xc0e4255c   ( 886 kB)
>>> [    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
>>> [    0.000000] Hierarchical RCU implementation.
>>> [    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
>>> [    0.000000] L2C: I-pipe: l2x0_write_allocate= not specified, defaults to
>>> 0 (disabled).
>>> [    0.000000] L2C: DT/platform modifies aux control register: 0x32470000 ->
>>> 0x32c70000
>>> [    0.000000] L2C-310 errata 752271 769419 enabled
>>> [    0.000000] L2C-310 enabling early BRESP for Cortex-A9
>>> [    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
>>> [    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
>>> [    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
>>> [    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
>>> [    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001
>>> [    0.000000] I-pipe, 396.000 MHz timer
>>> [    0.000000] Switching to timer-based delay loop, resolution 333ns
>>> [    0.000007] sched_clock: 32 bits at 3000kHz, resolution 333ns, wraps
>>> every 715827882841ns
>>> [    0.000027] clocksource: mxc_timer1: mask: 0xffffffff max_cycles:
>>> 0xffffffff, max_idle_ns: 637086815595 ns
>>> [    0.000042] I-pipe, 3.000 MHz clocksource, wrap in 1431655 ms
>>> [    0.000059] clocksource: ipipe_tsc: mask: 0xffffffffffffffff max_cycles:
>>> 0x1623fa770, max_idle_ns: 881590404476 ns
>>> [    0.001630] Interrupt pipeline (release #1)
>>> [    0.001784] Console: colour dummy device 80x30
>>> [    0.001816] Calibrating delay loop (skipped), value calculated using
>>> timer frequency.. 6.00 BogoMIPS (lpj=15000)
>>> [    0.001833] pid_max: default: 32768 minimum: 301
>>> [    0.001978] Mount-cache hash table entries: 2048 (order: 1, 8192 bytes)
>>> [    0.002002] Mountpoint-cache hash table entries: 2048 (order: 1, 8192
>>> bytes)
>>> [    0.002545] CPU: Testing write buffer coherency: ok
>>> [    0.002902] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
>>> [    0.003310] Setting up static identity map for 0x10100000 - 0x10100078
>>> [    0.003453] Hierarchical SRCU implementation.
>>> [    0.004120] smp: Bringing up secondary CPUs ...
>>> [    0.004748] I-pipe, 396.000 MHz timer
>>> [    0.004770] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
>>> [    0.005491] I-pipe, 396.000 MHz timer
>>> [    0.005512] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
>>> [    0.006237] I-pipe, 396.000 MHz timer
>>> [    0.006257] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
>>> [    0.006373] smp: Brought up 1 node, 4 CPUs
>>> [    0.006405] SMP: Total of 4 processors activated (24.00 BogoMIPS).
>>> [    0.006415] CPU: All CPU(s) started in SVC mode.
>>> [    0.007527] devtmpfs: initialized
>>> [    0.016773] random: get_random_u32 called from
>>> bucket_table_alloc+0x100/0x230 with crng_init=0
>>> [    0.016890] VFP support v0.3: implementor 41 architecture 3 part 30
>>> variant 9 rev 4
>>> [    0.017029] clocksource: jiffies: mask: 0xffffffff max_cycles:
>>> 0xffffffff, max_idle_ns: 9556302231375000 ns
>>> [    0.017054] futex hash table entries: 1024 (order: 4, 65536 bytes)
>>> [    0.017282] pinctrl core: initialized pinctrl subsystem
>>> [    0.018032] NET: Registered protocol family 16
>>> [    0.019872] DMA: preallocated 256 KiB pool for atomic coherent
>>> allocations
>>> [    0.020697] CPU identified as i.MX6Q, silicon rev 1.1
>>> [    0.026849] vdd1p1: supplied by regulator-dummy
>>> [    0.027291] vdd3p0: supplied by regulator-dummy
>>> [    0.027687] vdd2p5: supplied by regulator-dummy
>>> [    0.037718] hw-breakpoint: found 5 (+1 reserved) breakpoint and 1
>>> watchpoint registers.
>>> [    0.037731] hw-breakpoint: maximum watchpoint size is 4 bytes.
>>> [    0.039110] imx6q-pinctrl 20e0000.iomuxc: initialized IMX pinctrl driver
>>> [    0.060477] mxs-dma 110000.dma-apbh: initialized
>>> [    0.063177] SCSI subsystem initialized
>>> [    0.063409] libata version 3.00 loaded.
>>> [    0.063601] usbcore: registered new interface driver usbfs
>>> [    0.063658] usbcore: registered new interface driver hub
>>> [    0.063755] usbcore: registered new device driver usb
>>> [    0.064922] i2c i2c-0: IMX I2C adapter registered
>>> [    0.064941] i2c i2c-0: can't use DMA, using PIO instead.
>>> [    0.065483] i2c i2c-1: IMX I2C adapter registered
>>> [    0.065501] i2c i2c-1: can't use DMA, using PIO instead.
>>> [    0.065939] i2c i2c-2: IMX I2C adapter registered
>>> [    0.065955] i2c i2c-2: can't use DMA, using PIO instead.
>>> [    0.066077] media: Linux media interface: v0.10
>>> [    0.066129] Linux video capture interface: v2.00
>>> [    0.066188] pps_core: LinuxPPS API ver. 1 registered
>>> [    0.066197] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo
>>> Giometti <giometti@linux.it>
>>> [    0.066219] PTP clock support registered
>>> [    0.066476] Advanced Linux Sound Architecture Driver Initialized.
>>> [    0.067494] Bluetooth: Core ver 2.22
>>> [    0.067538] NET: Registered protocol family 31
>>> [    0.067548] Bluetooth: HCI device and connection manager initialized
>>> [    0.067565] Bluetooth: HCI socket layer initialized
>>> [    0.067579] Bluetooth: L2CAP socket layer initialized
>>> [    0.067615] Bluetooth: SCO socket layer initialized
>>> [    0.068490] clocksource: Switched to clocksource ipipe_tsc
>>> [    0.068592] VFS: Disk quotas dquot_6.6.0
>>> [    0.068662] VFS: Dquot-cache hash table entries: 1024 (order 0, 4096
>>> bytes)
>>> [    0.076791] NET: Registered protocol family 2
>>> [    0.077372] TCP established hash table entries: 8192 (order: 3, 32768
>>> bytes)
>>> [    0.077461] TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
>>> [    0.077605] TCP: Hash tables configured (established 8192 bind 8192)
>>> [    0.077815] UDP hash table entries: 512 (order: 2, 16384 bytes)
>>> [    0.077862] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes)
>>> [    0.078054] NET: Registered protocol family 1
>>> [    0.078552] RPC: Registered named UNIX socket transport module.
>>> [    0.078564] RPC: Registered udp transport module.
>>> [    0.078574] RPC: Registered tcp transport module.
>>> [    0.078583] RPC: Registered tcp NFSv4.1 backchannel transport module.
>>> [    0.079359] hw perfevents: no interrupt-affinity property for /soc/pmu,
>>> guessing.
>>> [    0.079613] hw perfevents: enabled with armv7_cortex_a9 PMU driver, 7
>>> counters available
>>> [    0.080925] [Xenomai] scheduling class idle registered.
>>> [    0.080936] [Xenomai] scheduling class rt registered.
>>> [    0.081181] I-pipe: head domain Xenomai registered.
>>> [    0.083885] [Xenomai] Cobalt v3.0.7 (Lingering Dawn)
>>> [    0.083987] workingset: timestamp_bits=30 max_order=18 bucket_order=0
>>> [    0.088810] NFS: Registering the id_resolver key type
>>> [    0.088850] Key type id_resolver registered
>>> [    0.088860] Key type id_legacy registered
>>> [    0.088905] jffs2: version 2.2. (NAND) © 2001-2006 Red Hat, Inc.
>>> [    0.089328] fuse init (API version 7.26)
>>> [    0.094536] bounce: pool size: 64 pages
>>> [    0.094558] io scheduler noop registered
>>> [    0.094568] io scheduler deadline registered
>>> [    0.094785] io scheduler cfq registered (default)
>>> [    0.094798] io scheduler mq-deadline registered
>>> [    0.094807] io scheduler kyber registered
>>> [    0.097088] pwm-backlight backlight-lvds: backlight-lvds supply power not
>>> found, using dummy regulator
>>> [    0.098906] imx-sdma 20ec000.sdma: Direct firmware load for
>>> imx/sdma/sdma-imx6q.bin failed with error -2
>>> [    0.098924] imx-sdma 20ec000.sdma: external firmware not found, using ROM
>>> firmware
>>> ...
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Xenomai mailing list
>>> Xenomai@xenomai.org
>>> https://xenomai.org/mailman/listinfo/xenomai
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-06  8:07     ` Federico Sbalchiero
@ 2018-07-06  8:32       ` Philippe Gerum
  2018-07-08 13:45         ` Philippe Gerum
  2018-07-08  9:41       ` Philippe Gerum
  1 sibling, 1 reply; 16+ messages in thread
From: Philippe Gerum @ 2018-07-06  8:32 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/06/2018 10:07 AM, Federico Sbalchiero wrote:
> adding a break at line 837 in file /arch/arm/mm/cache-l2x0.c enables L2
> write allocate:
> 
> [    0.000000] L2C-310 errata 752271 769419 enabled
> [    0.000000] L2C-310 enabling early BRESP for Cortex-A9
> [    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
> [    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
> [    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
> [    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
> [    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
> 
> 
> latency under load (four memwrite instances) is better but still high.
> 
> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
> best|--lat worst
> RTD|     42.667|     58.521|     87.667|       0|     0| 42.667|     87.667
> RTD|     42.000|     58.935|     89.000|       0|     0| 42.000|     89.000
> RTD|     36.666|     58.707|     90.333|       0|     0| 36.666|     90.333
> RTD|     38.333|     58.439|     92.666|       0|     0| 36.666|     92.666
> RTD|     41.666|     58.595|     84.999|       0|     0| 36.666|     92.666
> RTD|     42.666|     58.698|     89.666|       0|     0| 36.666|     92.666
> RTD|     40.999|     58.999|     95.665|       0|     0| 36.666|     95.665
> RTD|     42.665|     58.823|     88.665|       0|     0| 36.666|     95.665
> RTD|     42.665|     58.570|     84.665|       0|     0| 36.666|     95.665
> RTD|     41.331|     58.599|     86.998|       0|     0| 36.666|     95.665
> RTD|     37.664|     58.596|     92.331|       0|     0| 36.666|     95.665
> RTD|     35.331|     58.893|     85.997|       0|     0| 35.331|     95.665
> RTD|     41.997|     58.704|     86.997|       0|     0| 35.331|     95.665
> RTD|     40.997|     58.723|     94.997|       0|     0| 35.331|     95.665
> RTD|     41.330|     58.710|     88.997|       0|     0| 35.331|     95.665
> RTD|     41.330|     59.080|     92.663|       0|     0| 35.331|     95.665
> RTD|     38.330|     58.733|     85.996|       0|     0| 35.331|     95.665
> RTD|     39.996|     59.095|     90.663|       0|     0| 35.331|     95.665
> RTD|     41.662|     58.967|     86.662|       0|     0| 35.331|     95.665
> RTD|     42.662|     58.884|     86.995|       0|     0| 35.331|     95.665
> RTD|     42.662|     58.852|     88.329|       0|     0| 35.331|     95.665
> 

According to my latest tests, waiting for operations to complete in the
cache unit induces most of the delay. I'm under the impression that the
way we deal with the outer L2 cache is obsolete, based on past
assumptions which may not be valid anymore. Typically, some of them
would involve events that might occur with VIVT caches, which we don't
support in 4.14.

The whole logic requires a fresh review. I'll follow up on this.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-05  9:14 ` Philippe Gerum
  2018-07-05  9:50   ` Federico Sbalchiero
@ 2018-07-06  8:41   ` Federico Sbalchiero
  2018-07-06  9:10     ` Philippe Gerum
  2018-07-06  9:09   ` Federico Sbalchiero
  2 siblings, 1 reply; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-06  8:41 UTC (permalink / raw)
  To: xenomai

Il 05/07/2018 11:14, Philippe Gerum ha scritto:
> On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
>> Hi,
>> first I want to say thanks to everyone involved in Xenomai for their job.
>>
>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>> sabresd board using Yocto. System boots fine and is stable, but latency
>> under load (xeno-test) is higher than in my reference system (Xenomai
>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>> This is after disabling power management, frequency scaling, CMA,
>> graphics, tracing, debug.
>>
>> I have found that a simple non-realtime user space process writing a
>> buffer in memory (memwrite) is able to trigger such high latencies.
>> Latency worsen a lot running a copy of the process on each core.
>> There is a correlation between buffer size and cache size suggesting
>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>> list, but I can confirm L2 WA is disabled (see log).
>>
>> I'm looking for comments or suggestions.
>>
> A basic dd if=/dev/zero of=/dev/null loop in the background is enough to
> raise the latency actually. Could you try the Xenomai 3 + 3.18 combo on
> your hw and let us know whether you see the same regression?
>
> TIA,
>

kernel 3.18.20-ipipe + xenomai 3.0.7

latency under load (four memwrite instances)
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|     24.985|     41.374|     76.351|       0| 0|     24.985|     76.351
RTD|     26.889|     41.203|     68.070|       0| 0|     24.985|     76.351
RTD|     22.828|     41.376|     67.681|       0| 0|     22.828|     76.351
RTD|     20.969|     41.043|     74.143|       0| 0|     20.969|     76.351
RTD|     27.027|     41.441|     68.037|       0| 0|     20.969|     76.351
RTD|     24.413|     41.585|     81.062|       0| 0|     20.969|     81.062
RTD|     27.234|     41.168|     76.516|       0| 0|     20.969|     81.062
RTD|     23.779|     41.141|     70.466|       0| 0|     20.969|     81.062
RTD|     24.824|     41.273|     75.322|       0| 0|     20.969|     81.062
RTD|     25.627|     41.195|     71.157|       0| 0|     20.969|     81.062
RTD|     28.874|     41.089|     66.579|       0| 0|     20.969|     81.062
RTD|     26.672|     41.638|     75.995|       0| 0|     20.969|     81.062
RTD|     25.139|     41.040|     69.543|       0| 0|     20.969|     81.062
RTD|     26.215|     41.099|     66.336|       0| 0|     20.969|     81.062
RTD|     24.192|     41.117|     76.828|       0| 0|     20.969|     81.062
RTD|     27.310|     41.942|     79.888|       0| 0|     20.969|     81.062
RTD|     24.348|     40.955|     66.484|       0| 0|     20.969|     81.062
RTD|     26.679|     41.260|     80.242|       0| 0|     20.969|     81.062
RTD|     26.820|     41.251|     74.986|       0| 0|     20.969|     81.062
RTD|     27.635|     41.301|     73.961|       0| 0|     20.969|     81.062
RTD|     26.877|     41.305|     72.789|       0| 0|     20.969|     81.062



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-04 17:06 [Xenomai] i.MX6q memory write causes high latency Federico Sbalchiero
                   ` (2 preceding siblings ...)
       [not found] ` <CALLqZ8SuY4K=PFP6LeCHXU9PXoXnmDv4UCLnqHmnz-8xc3CacA@mail.gmail.com>
@ 2018-07-06  9:04 ` Philippe Gerum
  2018-07-06 10:35   ` Federico Sbalchiero
  3 siblings, 1 reply; 16+ messages in thread
From: Philippe Gerum @ 2018-07-06  9:04 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
> Hi,
> first I want to say thanks to everyone involved in Xenomai for their job.
> 
> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
> sabresd board using Yocto. System boots fine and is stable, but latency
> under load (xeno-test) is higher than in my reference system (Xenomai
> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
> This is after disabling power management, frequency scaling, CMA,
> graphics, tracing, debug.
> 
> I have found that a simple non-realtime user space process writing a
> buffer in memory (memwrite) is able to trigger such high latencies.
> Latency worsen a lot running a copy of the process on each core.
> There is a correlation between buffer size and cache size suggesting
> an L2 cache issue, like the L2 write allocate discussed in the mailing
> list, but I can confirm L2 WA is disabled (see log).
> 
> I'm looking for comments or suggestions.
> 
> Thanks,
> Federico
> 
> 
> "memwrite" test case:
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> unsigned char *buffer;
> int main(int argc, char **argv)
> {
>     int i;
>     int count = 0;
>     int n;
>     int size = 10 * 1024 * 1024;
>     volatile unsigned *pt;
>     printf("load system by writing in memory\n");
>     buffer = malloc(size);
>     if (buffer == NULL) {
>         printf("buffer allocation failed\n");
>         exit(1);
>     }
>     n = size / sizeof(unsigned);
>     while (1) {
>         // write some data to memory buffer
>         pt = (unsigned *) buffer;
>         for (i = 0; i < n; i++)
>             *pt++ = i;
>         count++;
>     }
>     return 0;
> }
> 
> xeno-test on Xenomai 3.0.7 and ipipe-arm/4.14:
> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
> best|--lat worst
> RTD|     18.000|     26.504|     42.667|       0|     0| 18.000|     42.667
> RTD|     19.000|     25.198|     41.000|       0|     0| 18.000|     42.667
> RTD|     18.999|     25.494|     40.999|       0|     0| 18.000|     42.667
> RTD|     18.666|     25.060|     38.999|       0|     0| 18.000|     42.667
> RTD|     18.999|     24.464|     38.332|       0|     0| 18.000|     42.667
> RTD|     18.332|     24.546|     41.999|       0|     0| 18.000|     42.667
> RTD|     13.332|     22.445|     45.665|       0|     0| 13.332|     45.665
> RTD|     13.331|     21.164|     43.665|       0|     0| 13.331|     45.665
> RTD|     13.331|     21.930|     43.665|       0|     0| 13.331|     45.665
> RTD|     13.331|     22.254|     48.664|       0|     0| 13.331|     48.664
> RTD|     13.331|     22.037|     46.664|       0|     0| 13.331|     48.664
> RTD|     13.330|     21.053|     42.664|       0|     0| 13.330|     48.664
> RTD|     13.330|     20.610|     37.330|       0|     0| 13.330|     48.664
> RTD|     13.330|     20.520|     34.997|       0|     0| 13.330|     48.664
> RTD|     13.330|     20.398|     39.330|       0|     0| 13.330|     48.664
> RTD|     13.663|     21.249|     37.996|       0|     0| 13.330|     48.664
> RTD|     13.329|     20.983|     35.663|       0|     0| 13.329|     48.664
> RTD|     12.996|     20.039|     34.329|       0|     0| 12.996|     48.664
> RTD|     13.329|     20.580|     42.662|       0|     0| 12.996|     48.664
> RTD|     12.995|     20.518|     39.329|       0|     0| 12.995|     48.664
> RTD|     13.328|     20.168|     35.662|       0|     0| 12.995|     48.664
> 
> xeno-test on Xenomai 2.6.5 and Freescale Linux 3.10.17 + ipipe 3.10.18:
> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
> best|--lat worst
> RTD|      4.957|     17.575|     28.088|       0|     0| 4.957|     28.088
> RTD|      4.904|     17.560|     26.828|       0|     0| 4.904|     28.088
> RTD|      4.479|     13.472|     29.767|       0|     0| 4.479|     29.767
> RTD|      4.522|     12.724|     23.275|       0|     0| 4.479|     29.767
> RTD|      4.512|     12.904|     25.641|       0|     0| 4.479|     29.767
> RTD|      4.542|     12.818|     27.878|       0|     0| 4.479|     29.767
> RTD|      4.520|     13.068|     27.926|       0|     0| 4.479|     29.767
> RTD|      4.409|     12.770|     26.689|       0|     0| 4.409|     29.767
> RTD|      4.568|     12.265|     27.065|       0|     0| 4.409|     29.767
> RTD|      4.492|     12.017|     25.898|       0|     0| 4.409|     29.767
> RTD|      4.469|     12.303|     24.540|       0|     0| 4.409|     29.767
> RTD|      4.489|     12.030|     27.924|       0|     0| 4.409|     29.767
> RTD|      4.590|     11.851|     23.651|       0|     0| 4.409|     29.767
> RTD|      4.479|     13.371|     24.838|       0|     0| 4.409|     29.767
> RTD|      4.396|     13.204|     28.797|       0|     0| 4.396|     29.767
> RTD|      4.411|     12.454|     26.002|       0|     0| 4.396|     29.767
> RTD|      4.560|     12.234|     27.146|       0|     0| 4.396|     29.767
> RTD|      4.593|     12.441|     24.686|       0|     0| 4.396|     29.767
> RTD|      4.520|     12.510|     24.275|       0|     0| 4.396|     29.767
> RTD|      4.568|     11.797|     24.982|       0|     0| 4.396|     29.767
> RTD|      4.482|     12.631|     24.972|       0|     0| 4.396|     29.767
> 

Worst-case on 2.6.5 + 3.18.20 is 67 us here, after 10 hrs runtime on
imx6q - definitely not 30 us - stressing the latency test with:

- dd loop (zero -> null, 16M bs)
- switchtest -s 200

-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-05  9:14 ` Philippe Gerum
  2018-07-05  9:50   ` Federico Sbalchiero
  2018-07-06  8:41   ` Federico Sbalchiero
@ 2018-07-06  9:09   ` Federico Sbalchiero
  2018-07-06  9:12     ` Philippe Gerum
  2 siblings, 1 reply; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-06  9:09 UTC (permalink / raw)
  To: xenomai

Il 05/07/2018 11:14, Philippe Gerum ha scritto:
> On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
>> Hi,
>> first I want to say thanks to everyone involved in Xenomai for their job.
>>
>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>> sabresd board using Yocto. System boots fine and is stable, but latency
>> under load (xeno-test) is higher than in my reference system (Xenomai
>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>> This is after disabling power management, frequency scaling, CMA,
>> graphics, tracing, debug.
>>
>> I have found that a simple non-realtime user space process writing a
>> buffer in memory (memwrite) is able to trigger such high latencies.
>> Latency worsen a lot running a copy of the process on each core.
>> There is a correlation between buffer size and cache size suggesting
>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>> list, but I can confirm L2 WA is disabled (see log).
>>
>> I'm looking for comments or suggestions.
>>
> A basic dd if=/dev/zero of=/dev/null loop in the background is enough to
> raise the latency actually. Could you try the Xenomai 3 + 3.18 combo on
> your hw and let us know whether you see the same regression?
>
> TIA,
>

in the same configuration (kernel 3.18.20-ipipe + xenomai 3.0.7)
dd if=/dev/zero of=/dev/null has almost no effect on latency.
I think all data write to a few small buffers, not stressing L2
cache.

latency under load (four dd if=/dev/zero of=/dev/null instances)

RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|      1.093|      2.078|      7.757|       0| 0|      1.093|      7.757
RTD|      1.161|      2.090|      5.391|       0| 0|      1.093|      7.757
RTD|      1.128|      2.079|      6.537|       0| 0|      1.093|      7.757
RTD|      1.155|      2.140|      9.926|       0| 0|      1.093|      9.926
RTD|      1.117|      2.086|      9.483|       0| 0|      1.093|      9.926
RTD|      1.130|      2.079|      6.362|       0| 0|      1.093|      9.926
RTD|      1.165|      2.069|      5.506|       0| 0|      1.093|      9.926
RTD|      1.119|      2.066|      5.304|       0| 0|      1.093|      9.926
RTD|      1.202|      2.092|     10.538|       0| 0|      1.093|     10.538
RTD|      1.139|      2.071|      5.649|       0| 0|      1.093|     10.538
RTD|      1.169|      2.080|      5.604|       0| 0|      1.093|     10.538
RTD|      1.134|      2.065|      5.586|       0| 0|      1.093|     10.538
RTD|      1.027|      2.073|      5.452|       0| 0|      1.027|     10.538
RTD|      1.103|      2.188|      9.456|       0| 0|      1.027|     10.538
RTD|      1.070|      2.075|      5.694|       0| 0|      1.027|     10.538
RTD|      1.140|      2.074|      7.459|       0| 0|      1.027|     10.538
RTD|      1.135|      2.225|      9.373|       0| 0|      1.027|     10.538
RTD|      1.168|      2.081|      6.486|       0| 0|      1.027|     10.538
RTD|      1.062|      2.106|      9.617|       0| 0|      1.027|     10.538
RTD|      1.127|      2.161|      9.284|       0| 0|      1.027|     10.538
RTD|      1.079|      2.092|      5.480|       0| 0|      1.027|     10.538


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-06  8:41   ` Federico Sbalchiero
@ 2018-07-06  9:10     ` Philippe Gerum
  0 siblings, 0 replies; 16+ messages in thread
From: Philippe Gerum @ 2018-07-06  9:10 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/06/2018 10:41 AM, Federico Sbalchiero wrote:
> Il 05/07/2018 11:14, Philippe Gerum ha scritto:
>> On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
>>> Hi,
>>> first I want to say thanks to everyone involved in Xenomai for their
>>> job.
>>>
>>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>>> sabresd board using Yocto. System boots fine and is stable, but latency
>>> under load (xeno-test) is higher than in my reference system (Xenomai
>>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>>> This is after disabling power management, frequency scaling, CMA,
>>> graphics, tracing, debug.
>>>
>>> I have found that a simple non-realtime user space process writing a
>>> buffer in memory (memwrite) is able to trigger such high latencies.
>>> Latency worsen a lot running a copy of the process on each core.
>>> There is a correlation between buffer size and cache size suggesting
>>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>>> list, but I can confirm L2 WA is disabled (see log).
>>>
>>> I'm looking for comments or suggestions.
>>>
>> A basic dd if=/dev/zero of=/dev/null loop in the background is enough to
>> raise the latency actually. Could you try the Xenomai 3 + 3.18 combo on
>> your hw and let us know whether you see the same regression?
>>
>> TIA,
>>
> 
> kernel 3.18.20-ipipe + xenomai 3.0.7
> 
> latency under load (four memwrite instances)
> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
> best|--lat worst
> RTD|     24.985|     41.374|     76.351|       0| 0|     24.985|     76.351
> RTD|     26.889|     41.203|     68.070|       0| 0|     24.985|     76.351
> RTD|     22.828|     41.376|     67.681|       0| 0|     22.828|     76.351
> RTD|     20.969|     41.043|     74.143|       0| 0|     20.969|     76.351
> RTD|     27.027|     41.441|     68.037|       0| 0|     20.969|     76.351
> RTD|     24.413|     41.585|     81.062|       0| 0|     20.969|     81.062
> RTD|     27.234|     41.168|     76.516|       0| 0|     20.969|     81.062
> RTD|     23.779|     41.141|     70.466|       0| 0|     20.969|     81.062
> RTD|     24.824|     41.273|     75.322|       0| 0|     20.969|     81.062
> RTD|     25.627|     41.195|     71.157|       0| 0|     20.969|     81.062
> RTD|     28.874|     41.089|     66.579|       0| 0|     20.969|     81.062
> RTD|     26.672|     41.638|     75.995|       0| 0|     20.969|     81.062
> RTD|     25.139|     41.040|     69.543|       0| 0|     20.969|     81.062
> RTD|     26.215|     41.099|     66.336|       0| 0|     20.969|     81.062
> RTD|     24.192|     41.117|     76.828|       0| 0|     20.969|     81.062
> RTD|     27.310|     41.942|     79.888|       0| 0|     20.969|     81.062
> RTD|     24.348|     40.955|     66.484|       0| 0|     20.969|     81.062
> RTD|     26.679|     41.260|     80.242|       0| 0|     20.969|     81.062
> RTD|     26.820|     41.251|     74.986|       0| 0|     20.969|     81.062
> RTD|     27.635|     41.301|     73.961|       0| 0|     20.969|     81.062
> RTD|     26.877|     41.305|     72.789|       0| 0|     20.969|     81.062
> 
> 

Ok, if all goes well, we should soon be able to see the worst-case
latency drop to ~65 us under high mm stress on i.MX6q over 4.14, the
fewer the cores, the better the results with the i.MX6 series.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-06  9:09   ` Federico Sbalchiero
@ 2018-07-06  9:12     ` Philippe Gerum
  2018-07-06  9:14       ` Philippe Gerum
  0 siblings, 1 reply; 16+ messages in thread
From: Philippe Gerum @ 2018-07-06  9:12 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/06/2018 11:09 AM, Federico Sbalchiero wrote:
> Il 05/07/2018 11:14, Philippe Gerum ha scritto:
>> On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
>>> Hi,
>>> first I want to say thanks to everyone involved in Xenomai for their
>>> job.
>>>
>>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>>> sabresd board using Yocto. System boots fine and is stable, but latency
>>> under load (xeno-test) is higher than in my reference system (Xenomai
>>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>>> This is after disabling power management, frequency scaling, CMA,
>>> graphics, tracing, debug.
>>>
>>> I have found that a simple non-realtime user space process writing a
>>> buffer in memory (memwrite) is able to trigger such high latencies.
>>> Latency worsen a lot running a copy of the process on each core.
>>> There is a correlation between buffer size and cache size suggesting
>>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>>> list, but I can confirm L2 WA is disabled (see log).
>>>
>>> I'm looking for comments or suggestions.
>>>
>> A basic dd if=/dev/zero of=/dev/null loop in the background is enough to
>> raise the latency actually. Could you try the Xenomai 3 + 3.18 combo on
>> your hw and let us know whether you see the same regression?
>>
>> TIA,
>>
> 
> in the same configuration (kernel 3.18.20-ipipe + xenomai 3.0.7)
> dd if=/dev/zero of=/dev/null has almost no effect on latency.
> I think all data write to a few small buffers, not stressing L2
> cache.
> 

You need to set a large block size for dd.


-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-06  9:12     ` Philippe Gerum
@ 2018-07-06  9:14       ` Philippe Gerum
  0 siblings, 0 replies; 16+ messages in thread
From: Philippe Gerum @ 2018-07-06  9:14 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/06/2018 11:12 AM, Philippe Gerum wrote:
> On 07/06/2018 11:09 AM, Federico Sbalchiero wrote:
>> Il 05/07/2018 11:14, Philippe Gerum ha scritto:
>>> On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
>>>> Hi,
>>>> first I want to say thanks to everyone involved in Xenomai for their
>>>> job.
>>>>
>>>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>>>> sabresd board using Yocto. System boots fine and is stable, but latency
>>>> under load (xeno-test) is higher than in my reference system (Xenomai
>>>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>>>> This is after disabling power management, frequency scaling, CMA,
>>>> graphics, tracing, debug.
>>>>
>>>> I have found that a simple non-realtime user space process writing a
>>>> buffer in memory (memwrite) is able to trigger such high latencies.
>>>> Latency worsen a lot running a copy of the process on each core.
>>>> There is a correlation between buffer size and cache size suggesting
>>>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>>>> list, but I can confirm L2 WA is disabled (see log).
>>>>
>>>> I'm looking for comments or suggestions.
>>>>
>>> A basic dd if=/dev/zero of=/dev/null loop in the background is enough to
>>> raise the latency actually. Could you try the Xenomai 3 + 3.18 combo on
>>> your hw and let us know whether you see the same regression?
>>>
>>> TIA,
>>>
>>
>> in the same configuration (kernel 3.18.20-ipipe + xenomai 3.0.7)
>> dd if=/dev/zero of=/dev/null has almost no effect on latency.
>> I think all data write to a few small buffers, not stressing L2
>> cache.
>>
> 
> You need to set a large block size for dd.
> 
> 

bs=16M suffices to raise the worst-case quite significantly on i.MX6QP
(sabresd) here.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-06  9:04 ` Philippe Gerum
@ 2018-07-06 10:35   ` Federico Sbalchiero
  0 siblings, 0 replies; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-06 10:35 UTC (permalink / raw)
  To: xenomai

Il 06/07/2018 11:04, Philippe Gerum ha scritto:
> On 07/04/2018 07:06 PM, Federico Sbalchiero wrote:
>> Hi,
>> first I want to say thanks to everyone involved in Xenomai for their job.
>>
>> I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
>> sabresd board using Yocto. System boots fine and is stable, but latency
>> under load (xeno-test) is higher than in my reference system (Xenomai
>> 2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
>> This is after disabling power management, frequency scaling, CMA,
>> graphics, tracing, debug.
>>
>> I have found that a simple non-realtime user space process writing a
>> buffer in memory (memwrite) is able to trigger such high latencies.
>> Latency worsen a lot running a copy of the process on each core.
>> There is a correlation between buffer size and cache size suggesting
>> an L2 cache issue, like the L2 write allocate discussed in the mailing
>> list, but I can confirm L2 WA is disabled (see log).
>>
>> I'm looking for comments or suggestions.
>>
>> Thanks,
>> Federico
>>
>>
>> "memwrite" test case:
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <string.h>
>> unsigned char *buffer;
>> int main(int argc, char **argv)
>> {
>>      int i;
>>      int count = 0;
>>      int n;
>>      int size = 10 * 1024 * 1024;
>>      volatile unsigned *pt;
>>      printf("load system by writing in memory\n");
>>      buffer = malloc(size);
>>      if (buffer == NULL) {
>>          printf("buffer allocation failed\n");
>>          exit(1);
>>      }
>>      n = size / sizeof(unsigned);
>>      while (1) {
>>          // write some data to memory buffer
>>          pt = (unsigned *) buffer;
>>          for (i = 0; i < n; i++)
>>              *pt++ = i;
>>          count++;
>>      }
>>      return 0;
>> }
>>
>> xeno-test on Xenomai 3.0.7 and ipipe-arm/4.14:
>> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
>> best|--lat worst
>> RTD|     18.000|     26.504|     42.667|       0|     0| 18.000|     42.667
>> RTD|     19.000|     25.198|     41.000|       0|     0| 18.000|     42.667
>> RTD|     18.999|     25.494|     40.999|       0|     0| 18.000|     42.667
>> RTD|     18.666|     25.060|     38.999|       0|     0| 18.000|     42.667
>> RTD|     18.999|     24.464|     38.332|       0|     0| 18.000|     42.667
>> RTD|     18.332|     24.546|     41.999|       0|     0| 18.000|     42.667
>> RTD|     13.332|     22.445|     45.665|       0|     0| 13.332|     45.665
>> RTD|     13.331|     21.164|     43.665|       0|     0| 13.331|     45.665
>> RTD|     13.331|     21.930|     43.665|       0|     0| 13.331|     45.665
>> RTD|     13.331|     22.254|     48.664|       0|     0| 13.331|     48.664
>> RTD|     13.331|     22.037|     46.664|       0|     0| 13.331|     48.664
>> RTD|     13.330|     21.053|     42.664|       0|     0| 13.330|     48.664
>> RTD|     13.330|     20.610|     37.330|       0|     0| 13.330|     48.664
>> RTD|     13.330|     20.520|     34.997|       0|     0| 13.330|     48.664
>> RTD|     13.330|     20.398|     39.330|       0|     0| 13.330|     48.664
>> RTD|     13.663|     21.249|     37.996|       0|     0| 13.330|     48.664
>> RTD|     13.329|     20.983|     35.663|       0|     0| 13.329|     48.664
>> RTD|     12.996|     20.039|     34.329|       0|     0| 12.996|     48.664
>> RTD|     13.329|     20.580|     42.662|       0|     0| 12.996|     48.664
>> RTD|     12.995|     20.518|     39.329|       0|     0| 12.995|     48.664
>> RTD|     13.328|     20.168|     35.662|       0|     0| 12.995|     48.664
>>
>> xeno-test on Xenomai 2.6.5 and Freescale Linux 3.10.17 + ipipe 3.10.18:
>> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
>> best|--lat worst
>> RTD|      4.957|     17.575|     28.088|       0|     0| 4.957|     28.088
>> RTD|      4.904|     17.560|     26.828|       0|     0| 4.904|     28.088
>> RTD|      4.479|     13.472|     29.767|       0|     0| 4.479|     29.767
>> RTD|      4.522|     12.724|     23.275|       0|     0| 4.479|     29.767
>> RTD|      4.512|     12.904|     25.641|       0|     0| 4.479|     29.767
>> RTD|      4.542|     12.818|     27.878|       0|     0| 4.479|     29.767
>> RTD|      4.520|     13.068|     27.926|       0|     0| 4.479|     29.767
>> RTD|      4.409|     12.770|     26.689|       0|     0| 4.409|     29.767
>> RTD|      4.568|     12.265|     27.065|       0|     0| 4.409|     29.767
>> RTD|      4.492|     12.017|     25.898|       0|     0| 4.409|     29.767
>> RTD|      4.469|     12.303|     24.540|       0|     0| 4.409|     29.767
>> RTD|      4.489|     12.030|     27.924|       0|     0| 4.409|     29.767
>> RTD|      4.590|     11.851|     23.651|       0|     0| 4.409|     29.767
>> RTD|      4.479|     13.371|     24.838|       0|     0| 4.409|     29.767
>> RTD|      4.396|     13.204|     28.797|       0|     0| 4.396|     29.767
>> RTD|      4.411|     12.454|     26.002|       0|     0| 4.396|     29.767
>> RTD|      4.560|     12.234|     27.146|       0|     0| 4.396|     29.767
>> RTD|      4.593|     12.441|     24.686|       0|     0| 4.396|     29.767
>> RTD|      4.520|     12.510|     24.275|       0|     0| 4.396|     29.767
>> RTD|      4.568|     11.797|     24.982|       0|     0| 4.396|     29.767
>> RTD|      4.482|     12.631|     24.972|       0|     0| 4.396|     29.767
>>
> Worst-case on 2.6.5 + 3.18.20 is 67 us here, after 10 hrs runtime on
> imx6q - definitely not 30 us - stressing the latency test with:
>
> - dd loop (zero -> null, 16M bs)
> - switchtest -s 200
>
30 us worst case are in very short term (1-2 minutes) with just one 
instance of
memwrite in background.
Using dd loop and switchtest gives 50 us in short term. I suppose this 
compares
reasonably to 67 us after 10 hours.

I also confirm
dd if=/dev/zero of=/dev/null bs=16M
has the same effect on latency as memwrite, thanks


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-06  8:07     ` Federico Sbalchiero
  2018-07-06  8:32       ` Philippe Gerum
@ 2018-07-08  9:41       ` Philippe Gerum
  1 sibling, 0 replies; 16+ messages in thread
From: Philippe Gerum @ 2018-07-08  9:41 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/06/2018 10:07 AM, Federico Sbalchiero wrote:
> adding a break at line 837 in file /arch/arm/mm/cache-l2x0.c enables L2
> write allocate:
> 
> [    0.000000] L2C-310 errata 752271 769419 enabled
> [    0.000000] L2C-310 enabling early BRESP for Cortex-A9
> [    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
> [    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
> [    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
> [    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
> [    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
> 

Adding a break at this location defeats the purpose of the code, you
want it to fall through in order to update the control bits, as seen
from the value of the auxiliary control register above. 0x76470001 now
clears bit 23 (0x800000) due to breaking out of the switch, which should
be set for disabling fetch-on-write upon write misses instead. FWIW, I
don't see any issue in the original code, although the logic should be
made more obvious.

> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
> best|--lat worst
> RTD|     42.667|     58.521|     87.667|       0|     0| 42.667|
87.667
> RTD|     42.662|     58.884|     86.995|       0|     0| 35.331|
<snip>
95.665
> RTD|     42.662|     58.852|     88.329|       0|     0| 35.331|
95.665

So these figures above are actually obtained with write_allocate=1, they
are consistent with the results obtained with cache units older than
r3p2. Your cache unit advertises as r3p1-50rel0, where enabling
write-allocate leads to poor, even ugly, performances.

Would you leave the test run for several hours with proper load, I
believe that you should see the latency figures skyrocket way above 100 us.

-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Xenomai]  i.MX6q memory write causes high latency
  2018-07-06  8:32       ` Philippe Gerum
@ 2018-07-08 13:45         ` Philippe Gerum
  2018-07-16  9:05           ` Federico Sbalchiero
  0 siblings, 1 reply; 16+ messages in thread
From: Philippe Gerum @ 2018-07-08 13:45 UTC (permalink / raw)
  To: Federico Sbalchiero, xenomai

On 07/06/2018 10:32 AM, Philippe Gerum wrote:
> On 07/06/2018 10:07 AM, Federico Sbalchiero wrote:
>> adding a break at line 837 in file /arch/arm/mm/cache-l2x0.c enables L2
>> write allocate:
>>
>> [    0.000000] L2C-310 errata 752271 769419 enabled
>> [    0.000000] L2C-310 enabling early BRESP for Cortex-A9
>> [    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
>> [    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
>> [    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
>> [    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
>> [    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
>>
>>
>> latency under load (four memwrite instances) is better but still high.
>>
>> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
>> best|--lat worst
>> RTD|     42.667|     58.521|     87.667|       0|     0| 42.667|     87.667
>> RTD|     42.000|     58.935|     89.000|       0|     0| 42.000|     89.000
>> RTD|     36.666|     58.707|     90.333|       0|     0| 36.666|     90.333
>> RTD|     38.333|     58.439|     92.666|       0|     0| 36.666|     92.666
>> RTD|     41.666|     58.595|     84.999|       0|     0| 36.666|     92.666
>> RTD|     42.666|     58.698|     89.666|       0|     0| 36.666|     92.666
>> RTD|     40.999|     58.999|     95.665|       0|     0| 36.666|     95.665
>> RTD|     42.665|     58.823|     88.665|       0|     0| 36.666|     95.665
>> RTD|     42.665|     58.570|     84.665|       0|     0| 36.666|     95.665
>> RTD|     41.331|     58.599|     86.998|       0|     0| 36.666|     95.665
>> RTD|     37.664|     58.596|     92.331|       0|     0| 36.666|     95.665
>> RTD|     35.331|     58.893|     85.997|       0|     0| 35.331|     95.665
>> RTD|     41.997|     58.704|     86.997|       0|     0| 35.331|     95.665
>> RTD|     40.997|     58.723|     94.997|       0|     0| 35.331|     95.665
>> RTD|     41.330|     58.710|     88.997|       0|     0| 35.331|     95.665
>> RTD|     41.330|     59.080|     92.663|       0|     0| 35.331|     95.665
>> RTD|     38.330|     58.733|     85.996|       0|     0| 35.331|     95.665
>> RTD|     39.996|     59.095|     90.663|       0|     0| 35.331|     95.665
>> RTD|     41.662|     58.967|     86.662|       0|     0| 35.331|     95.665
>> RTD|     42.662|     58.884|     86.995|       0|     0| 35.331|     95.665
>> RTD|     42.662|     58.852|     88.329|       0|     0| 35.331|     95.665
>>
> 
> According to my latest tests, waiting for operations to complete in the
> cache unit induces most of the delay. I'm under the impression that the
> way we deal with the outer L2 cache is obsolete, based on past
> assumptions which may not be valid anymore. Typically, some of them
> would involve events that might occur with VIVT caches, which we don't
> support in 4.14.
> 
> The whole logic requires a fresh review. I'll follow up on this.
> 

I ran extensive tests on two SoCs equipped with PL310 l2 caches,
i.MX6QP (sabresd) and a VIA pico-ITX which is also a quad-core i.MX6Q
(much older though).

Background stress load while sampling latencies:

- while :; do dd if=/dev/zero of=/dev/null bs=16M; done&
- switchtest -s 200
- ethernet bandwidth testing with iperf to and from the SoC, only for
the purpose of hammering the system with lots of DMA transfers via the
FEC driver, which in turn causes a continuous flow of l2 cache
maintenance operations for cleaning / invalidating ranges of DMA-ed
cachelines.

With the very same Xenomai 3.0.7 over kernel 4.14.36 configuration
(I-pipe/4.14 commit [1]), SMP, all debug switches and tracers disabled,
only toggling l2x0_write_allocate on/off, the results were:

==========================================================
VIA (996 Mhz, L2 cache rev: L310_CACHE_ID_RTL_R3P1_50REL0)
==========================================================

* kernel 4.14.36, WA=0, ipipe-core-4.14.36-arm-1

L2C: I-pipe: l2x0_write_allocate= not specified, defaults to 0 (disabled).
L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32c70000
L2C-310 errata 752271 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTD|      4.212|      9.852|     57.243|       0|     0|
07:07:43/07:07:43

-----------------------------------------------------------------------------

* kernel 4.14.36, WA=1, ipipe-core-4.14.36-arm-1

L2C: I-pipe: write-allocate enabled, induces high latencies.
L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32470000
L2C-310 errata 752271 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      0.996|     16.472|     93.579|       0|     0|
03:12:12/03:12:12

=====================================================
IMX6QP (996Mhz, L2 cache rev: L310_CACHE_ID_RTL_R3P2)
=====================================================

* kernel 4.14.36, WA=1, ipipe-core-4.14.36-arm-1

L2C: I-pipe: revision >= L310-r3p2 detected, forcing WA.
L2C: I-pipe: write-allocate enabled, induces high latencies.
L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32470000
L2C-310 erratum 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x76470001

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      2.516|     14.070|     71.581|       0|     0|
03:28:03/03:28:03

-----------------------------------------------------------------------------

* kernel 4.14.36, WA=0, ipipe-core-4.14.36-arm-1

L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32c70000
L2C-310 erratum 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x76c70001

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      2.332|     14.991|     77.969|       0|     0|
09:55:24/09:55:24

Some (partial) conclusions drawn from what I have been seeing here:

1. The penalty with enabling write-allocate on PL310 caches with respect
to latency seems to have decreased since R3P2. Conversely, R3P1_50REL0
and earlier have better latency figures when write-allocate is disabled
(I seem to remember than early sabrelite boards would even have
pathological figures, in the 300 us range with WA=1). However, disabling
WA for PL310 cache revs >= R3P2 seems actually counter-productive since
it slows down memory accesses uselessly with no upside. This might even
cause cache coherency issues on SMP as observed with some SoCs which
died at booting the kernel in this configuration.

2. The spinlock defined in the PL2xx/3xx L2 cache driver serializes
non-atomic maintenance operation requests on the cache unit. Converting
it to a hard lock is not required, as we cannot run any outer cache
maintenance operation from primary mode, all callers belong to the root
stage (including handle_pte_fault()). There is no issue in being
preempted by out-of-band code while performing such operation, virtually
disabling interrupt has to be enough. Conversely, hard locking increases
latency since it hard disables IRQs. Contexts that would be involved by
hard locking are infrequently seen these days though:

- all operations on PL220 caches
- PL310 with errata 588369 && id < R2P0
- PL310 with errata 727915 && id >= R2P0 && id < R3P1

>> [    0.000000] L2C-310 errata 752271 769419 enabled

According to the errata advertised by your hardware, you should not be
affected by such locking issue.

All other combinations of cache types, revisions and errata do run
locklessly already, so they can't be affected by such hard locking. The
change introducing the hard lock in arch/arm/mm/cache-l2x0.c should be
reverted. As a consequence of this, limiting the bulk operations to 512
cache lines at a time would not be needed anymore either, since those
potentially lengthy operations could still be preempted by real-time
activities.

Tip #1: care for CPU frequency when comparing tests on similar hardware.
If CPU_FREQ is off, some processors may be left running at a lower speed
by the bootloader than they are capable of. CONFIG_ARM_IMX6Q_CPUFREQ can
be enabled with the "performance" CPU_FREQ governor to make sure the
highest speed is picked, and remain stable over time not to confuse
Xenomai timings. Disabling CPU_FREQ entirely has been an ancient mantra
for configuring Xenomai in the past, maybe the message should evolve
because things may not be that clear-cut depending on the SoC.

Tip #2: any test which does not run for hours under significant load is
unlikely to deliver any meaningful figure, at least not on my SoCs. In a
couple of occasions, the worst-case latency was reached after 2h+ of
runtime.

So, I would assume that ~70 us worst-case should be achievable under
high load on a typical i.MX6Q with WA=1 on l2 cache revision >= R3P2,
running at 1Ghz. For earlier revisions, WA=0 may be required for
reaching this target, possibly even less/better, but there is no
guarantee that the results seen on the VIA SoC with such settings could
be generalized though.

If anyone has different or similar results/conclusions about the impact
of l2 with the i.MX6Q series, please let us know.

Thanks,

[1]
https://git.xenomai.org/ipipe-arm/commit/a1bd0cc70391df28ad6678a86c77ce0bf275b5cd

-- 
Philippe.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Xenomai] i.MX6q memory write causes high latency
  2018-07-08 13:45         ` Philippe Gerum
@ 2018-07-16  9:05           ` Federico Sbalchiero
  0 siblings, 0 replies; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-16  9:05 UTC (permalink / raw)
  To: xenomai

Il 08/07/2018 15:45, Philippe Gerum ha scritto:
> On 07/06/2018 10:32 AM, Philippe Gerum wrote:
>> On 07/06/2018 10:07 AM, Federico Sbalchiero wrote:
>>> adding a break at line 837 in file /arch/arm/mm/cache-l2x0.c enables L2
>>> write allocate:
>>>
>>> [    0.000000] L2C-310 errata 752271 769419 enabled
>>> [    0.000000] L2C-310 enabling early BRESP for Cortex-A9
>>> [    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
>>> [    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
>>> [    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
>>> [    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
>>> [    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
>>>
>>>
>>> latency under load (four memwrite instances) is better but still high.
>>>
>>> RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
>>> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat
>>> best|--lat worst
>>> RTD|     42.667|     58.521|     87.667|       0|     0| 42.667|     87.667
>>> RTD|     42.000|     58.935|     89.000|       0|     0| 42.000|     89.000
>>> RTD|     36.666|     58.707|     90.333|       0|     0| 36.666|     90.333
>>> RTD|     38.333|     58.439|     92.666|       0|     0| 36.666|     92.666
>>> RTD|     41.666|     58.595|     84.999|       0|     0| 36.666|     92.666
>>> RTD|     42.666|     58.698|     89.666|       0|     0| 36.666|     92.666
>>> RTD|     40.999|     58.999|     95.665|       0|     0| 36.666|     95.665
>>> RTD|     42.665|     58.823|     88.665|       0|     0| 36.666|     95.665
>>> RTD|     42.665|     58.570|     84.665|       0|     0| 36.666|     95.665
>>> RTD|     41.331|     58.599|     86.998|       0|     0| 36.666|     95.665
>>> RTD|     37.664|     58.596|     92.331|       0|     0| 36.666|     95.665
>>> RTD|     35.331|     58.893|     85.997|       0|     0| 35.331|     95.665
>>> RTD|     41.997|     58.704|     86.997|       0|     0| 35.331|     95.665
>>> RTD|     40.997|     58.723|     94.997|       0|     0| 35.331|     95.665
>>> RTD|     41.330|     58.710|     88.997|       0|     0| 35.331|     95.665
>>> RTD|     41.330|     59.080|     92.663|       0|     0| 35.331|     95.665
>>> RTD|     38.330|     58.733|     85.996|       0|     0| 35.331|     95.665
>>> RTD|     39.996|     59.095|     90.663|       0|     0| 35.331|     95.665
>>> RTD|     41.662|     58.967|     86.662|       0|     0| 35.331|     95.665
>>> RTD|     42.662|     58.884|     86.995|       0|     0| 35.331|     95.665
>>> RTD|     42.662|     58.852|     88.329|       0|     0| 35.331|     95.665
>>>
>> According to my latest tests, waiting for operations to complete in the
>> cache unit induces most of the delay. I'm under the impression that the
>> way we deal with the outer L2 cache is obsolete, based on past
>> assumptions which may not be valid anymore. Typically, some of them
>> would involve events that might occur with VIVT caches, which we don't
>> support in 4.14.
>>
>> The whole logic requires a fresh review. I'll follow up on this.
>>
> I ran extensive tests on two SoCs equipped with PL310 l2 caches,
> i.MX6QP (sabresd) and a VIA pico-ITX which is also a quad-core i.MX6Q
> (much older though).
>
> Background stress load while sampling latencies:
>
> - while :; do dd if=/dev/zero of=/dev/null bs=16M; done&
> - switchtest -s 200
> - ethernet bandwidth testing with iperf to and from the SoC, only for
> the purpose of hammering the system with lots of DMA transfers via the
> FEC driver, which in turn causes a continuous flow of l2 cache
> maintenance operations for cleaning / invalidating ranges of DMA-ed
> cachelines.
>
> With the very same Xenomai 3.0.7 over kernel 4.14.36 configuration
> (I-pipe/4.14 commit [1]), SMP, all debug switches and tracers disabled,
> only toggling l2x0_write_allocate on/off, the results were:
>
> ==========================================================
> VIA (996 Mhz, L2 cache rev: L310_CACHE_ID_RTL_R3P1_50REL0)
> ==========================================================
>
> * kernel 4.14.36, WA=0, ipipe-core-4.14.36-arm-1
>
> L2C: I-pipe: l2x0_write_allocate= not specified, defaults to 0 (disabled).
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32c70000
> L2C-310 errata 752271 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTD|      4.212|      9.852|     57.243|       0|     0|
> 07:07:43/07:07:43
>
> -----------------------------------------------------------------------------
>
> * kernel 4.14.36, WA=1, ipipe-core-4.14.36-arm-1
>
> L2C: I-pipe: write-allocate enabled, induces high latencies.
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32470000
> L2C-310 errata 752271 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTS|      0.996|     16.472|     93.579|       0|     0|
> 03:12:12/03:12:12
>
> =====================================================
> IMX6QP (996Mhz, L2 cache rev: L310_CACHE_ID_RTL_R3P2)
> =====================================================
>
> * kernel 4.14.36, WA=1, ipipe-core-4.14.36-arm-1
>
> L2C: I-pipe: revision >= L310-r3p2 detected, forcing WA.
> L2C: I-pipe: write-allocate enabled, induces high latencies.
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32470000
> L2C-310 erratum 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x76470001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTS|      2.516|     14.070|     71.581|       0|     0|
> 03:28:03/03:28:03
>
> -----------------------------------------------------------------------------
>
> * kernel 4.14.36, WA=0, ipipe-core-4.14.36-arm-1
>
> L2C: DT/platform modifies aux control register: 0x32070000 -> 0x32c70000
> L2C-310 erratum 769419 enabled
> L2C-310 enabling early BRESP for Cortex-A9
> L2C-310 full line of zeros enabled for Cortex-A9
> L2C-310 ID prefetch enabled, offset 16 lines
> L2C-310 dynamic clock gating enabled, standby mode enabled
> L2C-310 cache controller enabled, 16 ways, 1024 kB
> L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x76c70001
>
> RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
> RTS|      2.332|     14.991|     77.969|       0|     0|
> 09:55:24/09:55:24
>
> Some (partial) conclusions drawn from what I have been seeing here:
>
> 1. The penalty with enabling write-allocate on PL310 caches with respect
> to latency seems to have decreased since R3P2. Conversely, R3P1_50REL0
> and earlier have better latency figures when write-allocate is disabled
> (I seem to remember than early sabrelite boards would even have
> pathological figures, in the 300 us range with WA=1). However, disabling
> WA for PL310 cache revs >= R3P2 seems actually counter-productive since
> it slows down memory accesses uselessly with no upside. This might even
> cause cache coherency issues on SMP as observed with some SoCs which
> died at booting the kernel in this configuration.
>
> 2. The spinlock defined in the PL2xx/3xx L2 cache driver serializes
> non-atomic maintenance operation requests on the cache unit. Converting
> it to a hard lock is not required, as we cannot run any outer cache
> maintenance operation from primary mode, all callers belong to the root
> stage (including handle_pte_fault()). There is no issue in being
> preempted by out-of-band code while performing such operation, virtually
> disabling interrupt has to be enough. Conversely, hard locking increases
> latency since it hard disables IRQs. Contexts that would be involved by
> hard locking are infrequently seen these days though:
>
> - all operations on PL220 caches
> - PL310 with errata 588369 && id < R2P0
> - PL310 with errata 727915 && id >= R2P0 && id < R3P1
>
>>> [    0.000000] L2C-310 errata 752271 769419 enabled
> According to the errata advertised by your hardware, you should not be
> affected by such locking issue.
>
> All other combinations of cache types, revisions and errata do run
> locklessly already, so they can't be affected by such hard locking. The
> change introducing the hard lock in arch/arm/mm/cache-l2x0.c should be
> reverted. As a consequence of this, limiting the bulk operations to 512
> cache lines at a time would not be needed anymore either, since those
> potentially lengthy operations could still be preempted by real-time
> activities.
>
> Tip #1: care for CPU frequency when comparing tests on similar hardware.
> If CPU_FREQ is off, some processors may be left running at a lower speed
> by the bootloader than they are capable of. CONFIG_ARM_IMX6Q_CPUFREQ can
> be enabled with the "performance" CPU_FREQ governor to make sure the
> highest speed is picked, and remain stable over time not to confuse
> Xenomai timings. Disabling CPU_FREQ entirely has been an ancient mantra
> for configuring Xenomai in the past, maybe the message should evolve
> because things may not be that clear-cut depending on the SoC.
>
> Tip #2: any test which does not run for hours under significant load is
> unlikely to deliver any meaningful figure, at least not on my SoCs. In a
> couple of occasions, the worst-case latency was reached after 2h+ of
> runtime.
>
> So, I would assume that ~70 us worst-case should be achievable under
> high load on a typical i.MX6Q with WA=1 on l2 cache revision >= R3P2,
> running at 1Ghz. For earlier revisions, WA=0 may be required for
> reaching this target, possibly even less/better, but there is no
> guarantee that the results seen on the VIA SoC with such settings could
> be generalized though.
>
> If anyone has different or similar results/conclusions about the impact
> of l2 with the i.MX6Q series, please let us know.
>
> Thanks,
>
> [1]
> https://git.xenomai.org/ipipe-arm/commit/a1bd0cc70391df28ad6678a86c77ce0bf275b5cd
>
I did some more testing on sabresd comparing WA=0 to WA=1 and
hard/soft lock.

I'm not confident with kernel locks at all, but I made this in
arch/arm/mm/cache-l2x0.c to convert the hard lock into a soft lock:
- defined l2x0_lock using IPIPE_DEFINE_SPINLOCK
- replaced all raw_spin_lock functions with their spin_lock
equivalents
- removed CACHE_RANGE_ATOMIC_MAX definition and related unlock/lock
sequences in cache bulk operations

Base configuration used for tests:
- Freescale sabresd board, i.MX6Q at 966MHz,
- kernel 4.14.36, ipipe-core-4.14.36-arm-1 [1]
- CPU_FREQ governor "performance", debug and trace disabled

dmesg with WA=0:
L2C: I-pipe: l2x0_write_allocate= not specified, defaults to 0 (disabled).
L2C: DT/platform modifies aux control register: 0x32470000 -> 0x32c70000
L2C-310 errata 752271 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001

dmesg with WA=1:
L2C: I-pipe: write-allocate enabled, induces high latencies.
L2C-310 errata 752271 769419 enabled
L2C-310 enabling early BRESP for Cortex-A9
L2C-310 full line of zeros enabled for Cortex-A9
L2C-310 ID prefetch enabled, offset 16 lines
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 16 ways, 1024 kB
L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76470001

--------------------------------------------------------
     TEST 1
--------------------------------------------------------
WA=0, hard l2x0_lock
stress load: dd loop + switchtest + iperf

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      0.811|     13.025|     72.505|       0|     0|
03:38:14/03:38:14


--------------------------------------------------------
     TEST 2
--------------------------------------------------------
WA=1, hard l2x0_lock
stress load: dd loop + switchtest + iperf

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      2.292|     13.216|     66.989|       0|     0|
04:08:19/04:08:19


--------------------------------------------------------
     TEST 3
--------------------------------------------------------
WA=0, hard l2x0_lock
stress load: 4 dd loops (no switchtest, no iperf)

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      4.612|     43.343|    101.798|       0|     0|
02:20:11/00:20:11


--------------------------------------------------------
     TEST 4
--------------------------------------------------------
WA=0, soft l2x0_lock
stress load: dd loop + switchtest + iperf

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|     -0.146|      9.265|     60.964|       0|     0|
08:11:29/08:11:29

--------------------------------------------------------
     TEST 5
--------------------------------------------------------
WA=1, soft l2x0_lock
stress load: dd loop + switchtest + iperf

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      0.001|      9.508|     64.094|       0|     0|
05:02:38/05:02:38

--------------------------------------------------------
     TEST 6
--------------------------------------------------------
WA=0, soft l2x0_lock
stress load: 4 dd loops (no switchtest, no iperf)

RTH|----lat min|----lat avg|----lat max|-overrun|---msw|
RTS|      5.093|     28.399|     74.389|       0|     0|
06:41:53/06:41:53


My conclusions:

1) Overall I see little correlation between WA settings and latency
on my hardware, possibly just noise.

2) Making cache bulk operations preemptible (as Philippe suggested)
improves latency and is a big win in the "4 dd stress" test.

3) This also confirms such test case is very good at exploiting L2
cache issues.

What's next?
Any comment/suggestion is welcome. If there are other ideas to
improve Xenomai/ipipe on i.MX6 I will try to explore them.

If someone is interested I can post a "make l2x0 cache lock
preemptible" patch for better testing.



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-07-16  9:05 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-04 17:06 [Xenomai] i.MX6q memory write causes high latency Federico Sbalchiero
2018-07-04 23:14 ` Tom Evans
2018-07-05  9:14 ` Philippe Gerum
2018-07-05  9:50   ` Federico Sbalchiero
2018-07-06  8:41   ` Federico Sbalchiero
2018-07-06  9:10     ` Philippe Gerum
2018-07-06  9:09   ` Federico Sbalchiero
2018-07-06  9:12     ` Philippe Gerum
2018-07-06  9:14       ` Philippe Gerum
     [not found] ` <CALLqZ8SuY4K=PFP6LeCHXU9PXoXnmDv4UCLnqHmnz-8xc3CacA@mail.gmail.com>
     [not found]   ` <CALLqZ8Q3uZm8jTmtJLHR=Yz3Cw5g+CGTrXaj-oUFPT=1Rh0JbA@mail.gmail.com>
2018-07-06  8:07     ` Federico Sbalchiero
2018-07-06  8:32       ` Philippe Gerum
2018-07-08 13:45         ` Philippe Gerum
2018-07-16  9:05           ` Federico Sbalchiero
2018-07-08  9:41       ` Philippe Gerum
2018-07-06  9:04 ` Philippe Gerum
2018-07-06 10:35   ` Federico Sbalchiero

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.