[Xenomai] i.MX6q memory write causes high latency

* [Xenomai]  i.MX6q memory write causes high latency
@ 2018-07-04 17:06 Federico Sbalchiero
  2018-07-04 23:14 ` Tom Evans
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Federico Sbalchiero @ 2018-07-04 17:06 UTC (permalink / raw)
  To: xenomai

Hi,
first I want to say thanks to everyone involved in Xenomai for their job.

I'm testing Xenomai 3.0.7 and ipipe-arm/4.14 on Freescale/NXP i.MX6q
sabresd board using Yocto. System boots fine and is stable, but latency
under load (xeno-test) is higher than in my reference system (Xenomai
2.6.5 on Freescale kernel 3.10.17 + ipipe 3.10.18).
This is after disabling power management, frequency scaling, CMA,
graphics, tracing, debug.

I have found that a simple non-realtime user space process writing a
buffer in memory (memwrite) is able to trigger such high latencies.
Latency worsen a lot running a copy of the process on each core.
There is a correlation between buffer size and cache size suggesting
an L2 cache issue, like the L2 write allocate discussed in the mailing
list, but I can confirm L2 WA is disabled (see log).

I'm looking for comments or suggestions.

Thanks,
Federico

"memwrite" test case:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
unsigned char *buffer;
int main(int argc, char **argv)
{
     int i;
     int count = 0;
     int n;
     int size = 10 * 1024 * 1024;
     volatile unsigned *pt;
     printf("load system by writing in memory\n");
     buffer = malloc(size);
     if (buffer == NULL) {
         printf("buffer allocation failed\n");
         exit(1);
     }
     n = size / sizeof(unsigned);
     while (1) {
         // write some data to memory buffer
         pt = (unsigned *) buffer;
         for (i = 0; i < n; i++)
             *pt++ = i;
         count++;
     }
     return 0;
}

xeno-test on Xenomai 3.0.7 and ipipe-arm/4.14:
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|     18.000|     26.504|     42.667|       0|     0| 18.000|     42.667
RTD|     19.000|     25.198|     41.000|       0|     0| 18.000|     42.667
RTD|     18.999|     25.494|     40.999|       0|     0| 18.000|     42.667
RTD|     18.666|     25.060|     38.999|       0|     0| 18.000|     42.667
RTD|     18.999|     24.464|     38.332|       0|     0| 18.000|     42.667
RTD|     18.332|     24.546|     41.999|       0|     0| 18.000|     42.667
RTD|     13.332|     22.445|     45.665|       0|     0| 13.332|     45.665
RTD|     13.331|     21.164|     43.665|       0|     0| 13.331|     45.665
RTD|     13.331|     21.930|     43.665|       0|     0| 13.331|     45.665
RTD|     13.331|     22.254|     48.664|       0|     0| 13.331|     48.664
RTD|     13.331|     22.037|     46.664|       0|     0| 13.331|     48.664
RTD|     13.330|     21.053|     42.664|       0|     0| 13.330|     48.664
RTD|     13.330|     20.610|     37.330|       0|     0| 13.330|     48.664
RTD|     13.330|     20.520|     34.997|       0|     0| 13.330|     48.664
RTD|     13.330|     20.398|     39.330|       0|     0| 13.330|     48.664
RTD|     13.663|     21.249|     37.996|       0|     0| 13.330|     48.664
RTD|     13.329|     20.983|     35.663|       0|     0| 13.329|     48.664
RTD|     12.996|     20.039|     34.329|       0|     0| 12.996|     48.664
RTD|     13.329|     20.580|     42.662|       0|     0| 12.996|     48.664
RTD|     12.995|     20.518|     39.329|       0|     0| 12.995|     48.664
RTD|     13.328|     20.168|     35.662|       0|     0| 12.995|     48.664

xeno-test on Xenomai 2.6.5 and Freescale Linux 3.10.17 + ipipe 3.10.18:
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|      4.957|     17.575|     28.088|       0|     0| 4.957|     28.088
RTD|      4.904|     17.560|     26.828|       0|     0| 4.904|     28.088
RTD|      4.479|     13.472|     29.767|       0|     0| 4.479|     29.767
RTD|      4.522|     12.724|     23.275|       0|     0| 4.479|     29.767
RTD|      4.512|     12.904|     25.641|       0|     0| 4.479|     29.767
RTD|      4.542|     12.818|     27.878|       0|     0| 4.479|     29.767
RTD|      4.520|     13.068|     27.926|       0|     0| 4.479|     29.767
RTD|      4.409|     12.770|     26.689|       0|     0| 4.409|     29.767
RTD|      4.568|     12.265|     27.065|       0|     0| 4.409|     29.767
RTD|      4.492|     12.017|     25.898|       0|     0| 4.409|     29.767
RTD|      4.469|     12.303|     24.540|       0|     0| 4.409|     29.767
RTD|      4.489|     12.030|     27.924|       0|     0| 4.409|     29.767
RTD|      4.590|     11.851|     23.651|       0|     0| 4.409|     29.767
RTD|      4.479|     13.371|     24.838|       0|     0| 4.409|     29.767
RTD|      4.396|     13.204|     28.797|       0|     0| 4.396|     29.767
RTD|      4.411|     12.454|     26.002|       0|     0| 4.396|     29.767
RTD|      4.560|     12.234|     27.146|       0|     0| 4.396|     29.767
RTD|      4.593|     12.441|     24.686|       0|     0| 4.396|     29.767
RTD|      4.520|     12.510|     24.275|       0|     0| 4.396|     29.767
RTD|      4.568|     11.797|     24.982|       0|     0| 4.396|     29.767
RTD|      4.482|     12.631|     24.972|       0|     0| 4.396|     29.767

latency (no xeno-test!) with four memwrite instances on Xenomai 3.0.7 
and ipipe-arm/4.14:
RTT|  00:00:01  (periodic user-mode task, 1000 us period, priority 99)
RTH|----lat min|----lat avg|----lat max|-overrun|---msw|---lat 
best|--lat worst
RTD|     41.999|     60.200|    101.333|       0|     0| 41.999|    101.333
RTD|     42.666|     60.471|    103.666|       0|     0| 41.999|    103.666
RTD|     36.999|     60.568|     96.999|       0|     0| 36.999|    103.666
RTD|     38.998|     60.535|     95.332|       0|     0| 36.999|    103.666
RTD|     35.332|     60.359|    105.998|       0|     0| 35.332|    105.998
RTD|     41.998|     60.762|    116.665|       0|     0| 35.332|    116.665
RTD|     40.664|     60.419|     96.331|       0|     0| 35.332|    116.665
RTD|     41.331|     60.678|     91.331|       0|     0| 35.332|    116.665
RTD|     38.997|     60.923|     94.664|       0|     0| 35.332|    116.665
RTD|     42.664|     60.505|     95.664|       0|     0| 35.332|    116.665
RTD|     40.330|     60.175|     95.664|       0|     0| 35.332|    116.665
RTD|     40.663|     60.368|     96.663|       0|     0| 35.332|    116.665
RTD|     42.330|     60.286|     95.330|       0|     0| 35.332|    116.665
RTD|     36.996|     60.304|    100.663|       0|     0| 35.332|    116.665
RTD|     41.996|     61.100|    103.663|       0|     0| 35.332|    116.665
RTD|     40.996|     60.792|    102.996|       0|     0| 35.332|    116.665
RTD|     38.662|     60.205|    104.329|       0|     0| 35.332|    116.665
RTD|     39.995|     60.645|     98.995|       0|     0| 35.332|    116.665
RTD|     37.328|     60.300|     99.662|       0|     0| 35.332|    116.665
RTD|     40.995|     60.206|     98.328|       0|     0| 35.332|    116.665
RTD|     36.994|     60.444|     95.661|       0|     0| 35.332|    116.665

Linux 4.14 menuconfig:
   CPU Power Management\CPU Frequency scaling -> NO
   CPU Power Management\CPU Idle PM support -> NO
   Power management options\Suspend to RAM and standby -> no
   Power management options\Device power management core functionality -> no
   Device Drivers\Graphics support\DRM Support for Freescale i.MX -> no
   Device Drivers\Graphics support\ETNAVIV (DRM support for Vivante GPU 
IP cores) -> no
   Kernel Features\Timer frequency -> 200 Hz
   Kernel Features\Allow for memory compaction -> no
   Kernel Features\Contiguous Memory Allocator -> no
   Kernel hacking\Tracers -> no
   Kernel hacking\Lock Debugging (spinlocks, mutexes, etc...) -> all no
   Kernel hacking\Stack backtrace support -> no
   Bus support\PCI support-> no

Boot log:
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.14.36-fslc+gc664b053e5ce 
(oe-user@oe-host) (gcc version 7.3.0 (GCC)) #1 SMP Tue Jun 26 08:49:22 
UTC 2018
[    0.000000] CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), 
cr=10c5387d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing 
instruction cache
[    0.000000] OF: fdt: Machine model: Freescale i.MX6 Quad SABRE Smart 
Device Board
[    0.000000] Memory policy: Data cache writealloc
[    0.000000] On node 0 totalpages: 262144
[    0.000000] free_area_init_node: node 0, pgdat c0d62900, node_mem_map 
ef7f8000
[    0.000000]   Normal zone: 1536 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 196608 pages, LIFO batch:31
[    0.000000]   HighMem zone: 65536 pages, LIFO batch:15
[    0.000000] percpu: Embedded 19 pages/cpu @ef78a000 s46924 r8192 
d22708 u77824
[    0.000000] pcpu-alloc: s46924 r8192 d22708 u77824 alloc=19*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 260608
[    0.000000] Kernel command line: console=ttymxc0,115200 
root=PARTUUID=1bfc1072-02 rootwait rw[    0.000000] PID hash table 
entries: 4096 (order: 2, 16384 bytes)
[    0.000000] Dentry cache hash table entries: 131072 (order: 7, 524288 
bytes)
[    0.000000] Inode-cache hash table entries: 65536 (order: 6, 262144 
bytes)
[    0.000000] Memory: 1025492K/1048576K available (8192K kernel code, 
397K rwdata, 2220K rodata, 1024K init, 885K bss, 23084K reserved, 0K 
cma-reserved, 262144K highmem)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xffc00000 - 0xfff00000   (3072 kB)
[    0.000000]     vmalloc : 0xf0800000 - 0xff800000   ( 240 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xf0000000   ( 768 MB)
[    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
[    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
[    0.000000]       .text : 0xc0008000 - 0xc0900000   (9184 kB)
[    0.000000]       .init : 0xc0c00000 - 0xc0d00000   (1024 kB)
[    0.000000]       .data : 0xc0d00000 - 0xc0d63700   ( 398 kB)
[    0.000000]        .bss : 0xc0d65000 - 0xc0e4255c   ( 886 kB)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.000000] Hierarchical RCU implementation.
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000000] L2C: I-pipe: l2x0_write_allocate= not specified, defaults 
to 0 (disabled).
[    0.000000] L2C: DT/platform modifies aux control register: 
0x32470000 -> 0x32c70000
[    0.000000] L2C-310 errata 752271 769419 enabled
[    0.000000] L2C-310 enabling early BRESP for Cortex-A9
[    0.000000] L2C-310 full line of zeros enabled for Cortex-A9
[    0.000000] L2C-310 ID prefetch enabled, offset 16 lines
[    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
[    0.000000] L2C-310 cache controller enabled, 16 ways, 1024 kB
[    0.000000] L2C-310: CACHE_ID 0x410000c7, AUX_CTRL 0x76c70001
[    0.000000] I-pipe, 396.000 MHz timer
[    0.000000] Switching to timer-based delay loop, resolution 333ns
[    0.000007] sched_clock: 32 bits at 3000kHz, resolution 333ns, wraps 
every 715827882841ns
[    0.000027] clocksource: mxc_timer1: mask: 0xffffffff max_cycles: 
0xffffffff, max_idle_ns: 637086815595 ns
[    0.000042] I-pipe, 3.000 MHz clocksource, wrap in 1431655 ms
[    0.000059] clocksource: ipipe_tsc: mask: 0xffffffffffffffff 
max_cycles: 0x1623fa770, max_idle_ns: 881590404476 ns
[    0.001630] Interrupt pipeline (release #1)
[    0.001784] Console: colour dummy device 80x30
[    0.001816] Calibrating delay loop (skipped), value calculated using 
timer frequency.. 6.00 BogoMIPS (lpj=15000)
[    0.001833] pid_max: default: 32768 minimum: 301
[    0.001978] Mount-cache hash table entries: 2048 (order: 1, 8192 bytes)
[    0.002002] Mountpoint-cache hash table entries: 2048 (order: 1, 8192 
bytes)
[    0.002545] CPU: Testing write buffer coherency: ok
[    0.002902] CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
[    0.003310] Setting up static identity map for 0x10100000 - 0x10100078
[    0.003453] Hierarchical SRCU implementation.
[    0.004120] smp: Bringing up secondary CPUs ...
[    0.004748] I-pipe, 396.000 MHz timer
[    0.004770] CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
[    0.005491] I-pipe, 396.000 MHz timer
[    0.005512] CPU2: thread -1, cpu 2, socket 0, mpidr 80000002
[    0.006237] I-pipe, 396.000 MHz timer
[    0.006257] CPU3: thread -1, cpu 3, socket 0, mpidr 80000003
[    0.006373] smp: Brought up 1 node, 4 CPUs
[    0.006405] SMP: Total of 4 processors activated (24.00 BogoMIPS).
[    0.006415] CPU: All CPU(s) started in SVC mode.
[    0.007527] devtmpfs: initialized
[    0.016773] random: get_random_u32 called from 
bucket_table_alloc+0x100/0x230 with crng_init=0
[    0.016890] VFP support v0.3: implementor 41 architecture 3 part 30 
variant 9 rev 4
[    0.017029] clocksource: jiffies: mask: 0xffffffff max_cycles: 
0xffffffff, max_idle_ns: 9556302231375000 ns
[    0.017054] futex hash table entries: 1024 (order: 4, 65536 bytes)
[    0.017282] pinctrl core: initialized pinctrl subsystem
[    0.018032] NET: Registered protocol family 16
[    0.019872] DMA: preallocated 256 KiB pool for atomic coherent 
allocations
[    0.020697] CPU identified as i.MX6Q, silicon rev 1.1
[    0.026849] vdd1p1: supplied by regulator-dummy
[    0.027291] vdd3p0: supplied by regulator-dummy
[    0.027687] vdd2p5: supplied by regulator-dummy
[    0.037718] hw-breakpoint: found 5 (+1 reserved) breakpoint and 1 
watchpoint registers.
[    0.037731] hw-breakpoint: maximum watchpoint size is 4 bytes.
[    0.039110] imx6q-pinctrl 20e0000.iomuxc: initialized IMX pinctrl driver
[    0.060477] mxs-dma 110000.dma-apbh: initialized
[    0.063177] SCSI subsystem initialized
[    0.063409] libata version 3.00 loaded.
[    0.063601] usbcore: registered new interface driver usbfs
[    0.063658] usbcore: registered new interface driver hub
[    0.063755] usbcore: registered new device driver usb
[    0.064922] i2c i2c-0: IMX I2C adapter registered
[    0.064941] i2c i2c-0: can't use DMA, using PIO instead.
[    0.065483] i2c i2c-1: IMX I2C adapter registered
[    0.065501] i2c i2c-1: can't use DMA, using PIO instead.
[    0.065939] i2c i2c-2: IMX I2C adapter registered
[    0.065955] i2c i2c-2: can't use DMA, using PIO instead.
[    0.066077] media: Linux media interface: v0.10
[    0.066129] Linux video capture interface: v2.00
[    0.066188] pps_core: LinuxPPS API ver. 1 registered
[    0.066197] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 
Rodolfo Giometti <giometti@linux.it>
[    0.066219] PTP clock support registered
[    0.066476] Advanced Linux Sound Architecture Driver Initialized.
[    0.067494] Bluetooth: Core ver 2.22
[    0.067538] NET: Registered protocol family 31
[    0.067548] Bluetooth: HCI device and connection manager initialized
[    0.067565] Bluetooth: HCI socket layer initialized
[    0.067579] Bluetooth: L2CAP socket layer initialized
[    0.067615] Bluetooth: SCO socket layer initialized
[    0.068490] clocksource: Switched to clocksource ipipe_tsc
[    0.068592] VFS: Disk quotas dquot_6.6.0
[    0.068662] VFS: Dquot-cache hash table entries: 1024 (order 0, 4096 
bytes)
[    0.076791] NET: Registered protocol family 2
[    0.077372] TCP established hash table entries: 8192 (order: 3, 32768 
bytes)
[    0.077461] TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
[    0.077605] TCP: Hash tables configured (established 8192 bind 8192)
[    0.077815] UDP hash table entries: 512 (order: 2, 16384 bytes)
[    0.077862] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes)
[    0.078054] NET: Registered protocol family 1
[    0.078552] RPC: Registered named UNIX socket transport module.
[    0.078564] RPC: Registered udp transport module.
[    0.078574] RPC: Registered tcp transport module.
[    0.078583] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.079359] hw perfevents: no interrupt-affinity property for 
/soc/pmu, guessing.
[    0.079613] hw perfevents: enabled with armv7_cortex_a9 PMU driver, 7 
counters available
[    0.080925] [Xenomai] scheduling class idle registered.
[    0.080936] [Xenomai] scheduling class rt registered.
[    0.081181] I-pipe: head domain Xenomai registered.
[    0.083885] [Xenomai] Cobalt v3.0.7 (Lingering Dawn)
[    0.083987] workingset: timestamp_bits=30 max_order=18 bucket_order=0
[    0.088810] NFS: Registering the id_resolver key type
[    0.088850] Key type id_resolver registered
[    0.088860] Key type id_legacy registered
[    0.088905] jffs2: version 2.2. (NAND) © 2001-2006 Red Hat, Inc.
[    0.089328] fuse init (API version 7.26)
[    0.094536] bounce: pool size: 64 pages
[    0.094558] io scheduler noop registered
[    0.094568] io scheduler deadline registered
[    0.094785] io scheduler cfq registered (default)
[    0.094798] io scheduler mq-deadline registered
[    0.094807] io scheduler kyber registered
[    0.097088] pwm-backlight backlight-lvds: backlight-lvds supply power 
not found, using dummy regulator
[    0.098906] imx-sdma 20ec000.sdma: Direct firmware load for 
imx/sdma/sdma-imx6q.bin failed with error -2
[    0.098924] imx-sdma 20ec000.sdma: external firmware not found, using 
ROM firmware
...

^ permalink raw reply	[flat|nested] 16+ messages in thread