All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
       [not found] <543E4B9F.60602@cgglobal.com>
@ 2014-10-15 10:59 ` ZIV-Alberto Ozalla Cantabrana
  2014-10-15 11:11   ` Gilles Chanteperdrix
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: ZIV-Alberto Ozalla Cantabrana @ 2014-10-15 10:59 UTC (permalink / raw)
  To: xenomai


Dear colleagues,

I face an unexpected switch to secondary mode after the very first call to rt_timer_tsc() into a real-time task created by Xenomai.
This unexpected switch  only arises if it is the very first call to rt_timer_tsc().
Calling rt_timer_tsc() before creating the task, solves the unexpected switch to secondary mode.
I can not figure where the problem is.
Help will be welcomed.

Thanks in advance.
Alberto Ozalla


Simplest possible self-contained test case:

RT_TASK Test_Task_descriptor;

void Test_Task(void *arg)
{
    // Arguments: &task (NULL=self), start time, period (here: 1,001 ms)
    rt_task_set_periodic(NULL, TM_NOW, 1001000);

    while(1) {

        // Wait for the next periodic release point.
        rt_task_wait_period(NULL);

        rt_timer_tsc();    // Very first call causes an unexpected switch to secondary mode! - Removing this call works OK.
    }
}

main()
{
    ...
        // rt_timer_tsc();  // Calling rt_timer_tsc() before creating the task, solves the unexpected switch to secondary mode.

        // Real-time function creation.
        if (int err = rt_task_create(&Test_Task_descriptor, "Test_Task", 0, 90, 0)) {
            log(CU_LOG_ERR, "Error creating Test_Task (%d)", err);
            return FALSE;
        }

        // Real-time function started.
        if(rt_task_start(&Test_Task_descriptor, &Test_Task, NULL)) {
            log(CU_LOG_ERR, "Error starting Test_Task");
            return FALSE;
        }
    ...
}

Additional information:


  *   Xenomai version:

/proc/xenomai # cat version
2.6.3

  *   Configuration knobs passed to the configure script, used in building the Xenomai libraries:

:~/pro/npcp/output/beagleboneblack-npcptest/build/xenomai-2.6.git_09072014$ grep configure config.status
# Generated by configure.
# Compiler output produced by configure, useful for debugging
# configure, is in config.log if it exists.
configured by ./configure, generated by GNU Autoconf 2.69,
ac_configure_extra_args=
  ac_configure_extra_args="$ac_configure_extra_args --silent"
  set X /bin/bash './configure'  '--prefix=/home/aozalla/pro/npcp/output/beagleboneblack-npcptest/stage' '--build=x86_64-pc-linux-gnu' '--host=arm-linux-uclibcgnueabihf' '--target=arm-linux-uclibcgnueabihf' '--disable-rpath' '--disable-shared' '--enable-static' '--disable-devel' '--disable-ipv6' '--disable-libipq' '--with-kernel=/home/aozalla/pro/npcp/output/beagleboneblack-npcptest/build/linux-3.14.9' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=arm-linux-uclibcgnueabihf' 'target_alias=arm-linux-uclibcgnueabihf' 'CC=/home/aozalla/pro/npcp/output/beagleboneblack-npcptest/stage/bin/arm-linux-uclibcgnueabihf-gcc' 'CFLAGS=-Os -pipe -D_REENTRANT -fomit-frame-pointer -march=armv7-a -mtune=cortex-a8  -mfpu=vfpv3-d16 -finline-functions -finline-limit=50 -ffast-math -include /home/aozalla/pro/npcp/toolchain/include/builtin_redefines.h   -DMAX_PCI_SLOTS=8 -fPIC -DPIC -DNO_LARGEFILE_SOURCE -U_LARGEFILE_SOURCE -U_LARGE_FILES -U_FILE_OFFSET_BITS' 'LDFLAGS=' 'CPP=/home/aozalla/pro/npcp/output/beagleboneblack-npcptest/stage/bin/arm-linux-uclibcgnueabihf-cpp' $ac_configure_extra_args --no-create --no-recursion
    # on some systems where configure will not decide to define it.
    # Let's still pretend it is `configure' which instantiates (i.e., don't
    configure_input='Generated from '`
    `' by configure.'
      configure_input="$ac_file.  $configure_input"
    case $configure_input in #(
       ac_sed_conf_input=`$as_echo "$configure_input" |
    *) ac_sed_conf_input=$configure_input;;
s|@configure_input@|$ac_sed_conf_input|;t t
      $as_echo "/* $configure_input  */" \
    $as_echo "/* $configure_input  */" \
# Libtool was configured on host `(hostname || uname -n) 2>/dev/null | sed 1q`:



  *   Booting logs:

U-Boot SPL 2013.04-dirty (Jul 10 2013 - 14:02:53)
musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Peripheral mode controller at 47401000 using PIO, IRQ 0
musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Host mode controller at 47401800 using PIO, IRQ 0
OMAP SD/MMC: 0
mmc_send_cmd : timeout: No status update
reading u-boot.img
reading u-boot.img


U-Boot 2013.04-dirty (Jul 10 2013 - 14:02:53)

I2C:   ready
DRAM:  512 MiB
WARNING: Caches not enabled
NAND:  No NAND device found!!!
0 MiB
MMC:   OMAP SD/MMC: 0, OMAP SD/MMC: 1
*** Warning - readenv() failed, using default environment

musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Peripheral mode controller at 47401000 using PIO, IRQ 0
musb-hdrc: ConfigData=0xde (UTMI-8, dyn FIFOs, HB-ISO Rx, HB-ISO Tx, SoftConn)
musb-hdrc: MHDRC RTL version 2.0
musb-hdrc: setup fifo_mode 4
musb-hdrc: 28/31 max ep, 16384/16384 memory
USB Host mode controller at 47401800 using PIO, IRQ 0
Net:   <ethaddr> not set. Validating first E-fuse MAC
cpsw, usb_ether
Hit any key to stop autoboot:  0
gpio: pin 53 (gpio 53) value is 1
Card did not respond to voltage select!
mmc0(part 0) is current device
Card did not respond to voltage select!
No micro SD card found, setting mmcdev to 1
mmc_send_cmd : timeout: No status update
mmc1(part 0) is current device
mmc_send_cmd : timeout: No status update
gpio: pin 54 (gpio 54) value is 1
SD/MMC found on device 1
reading uEnv.txt
26 bytes read in 3 ms (7.8 KiB/s)
Loaded environment from uEnv.txt
Importing environment from mmc ...
gpio: pin 55 (gpio 55) value is 1
2544928 bytes read in 371 ms (6.5 MiB/s)
gpio: pin 56 (gpio 56) value is 1
31371 bytes read in 50 ms (612.3 KiB/s)
Booting from mmc ...
## Booting kernel from Legacy Image at 80007fc0 ...
   Image Name:   Linux-3.14.9
   Image Type:   ARM Linux Kernel Image (uncompressed)
   Data Size:    2544864 Bytes = 2.4 MiB
   Load Address: 80008000
   Entry Point:  80008000
   Verifying Checksum ... OK
## Flattened Device Tree blob at 80f80000
   Booting using the fdt blob at 0x80f80000
   XIP Kernel Image ... OK
OK
   Using Device Tree in place at 80f80000, end 80f8aa8a

Starting kernel ...

[    0.280790] omap_init_mbox: hwmod doesn't have valid attrs
[    1.110098] musb-hdrc musb-hdrc.0.auto: Failed to request rx1.
[    1.116533] musb-hdrc musb-hdrc.0.auto: musb_init_controller failed with status -517
[    1.129661] musb-hdrc musb-hdrc.1.auto: Failed to request rx1.
[    1.136052] musb-hdrc musb-hdrc.1.auto: musb_init_controller failed with status -517
[    1.273438] cpu cpu0: cpu0 regulator not ready, retry
starting pid 1361, tty '/dev/ttyO0': '/etc/rcS'

2000/01/06,18:01:27  kern.notice kernel: klogd started: BusyBox v1.22.1 ()
2000/01/06,18:01:27  kern.info kernel: [    0.000000] Booting Linux on physical CPU 0x0
2000/01/06,18:01:27  kern.info kernel: [    0.000000] Initializing cgroup subsys cpuset
2000/01/06,18:01:27  kern.info kernel: [    0.000000] Initializing cgroup subsys cpu
2000/01/06,18:01:27  kern.info kernel: [    0.000000] Initializing cgroup subsys cpuacct
2000/01/06,18:01:27  kern.notice kernel: [    0.000000] Linux version 3.14.9 (gcc version 4.7.3 (GCC) ) #1 SMP 201<6>Jan  6 18:01:27 kernel: [    0.000000] CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7), cr=10c5387d
2000/01/06,18:01:27  kern.info kernel: [    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
2000/01/06,18:01:27  kern.info kernel: [    0.000000] Machine model: TI AM335x BeagleBone
2000/01/06,18:01:27  kern.info kernel: [    0.000000] cma: CMA: reserved 16 MiB at 9e800000
2000/01/06,18:01:27  kern.info kernel: [    0.000000] Memory policy: Data cache writeback
2000/01/06,18:01:27  kern.debug kernel: [    0.000000] On node 0 totalpages: 130816
2000/01/06,18:01:27  kern.debug kernel: [    0.000000]   Normal zone: 1024 pages used for memmap
2000/01/06,18:01:27  kern.debug kernel: [    0.000000]   Normal zone: 0 pages reserved
2000/01/06,18:01:27  kern.debug kernel: [    0.000000]   Normal zone: 130816 pages, LIFO batch:31
2000/01/06,18:01:27  kern.info kernel: [    0.000000] CPU: All CPU(s) started in SVC mode.
2000/01/06,18:01:27  kern.info kernel: [    0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups
2000/01/06,18:01:27  kern.notice kernel: [    0.000000] Memory: 492048K/523264K available (4919K kernel code, 450K rwdata, 1832K rodata, 467K init, 1418K bss, 31216K reserved, 0K highmem)
2000/01/06,18:01:27  kern.notice kernel: [    0.000000] Virtual kernel memory layout:
2000/01/06,18:01:27  kern.notice kernel: [    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
2000/01/06,18:01:27  kern.notice kernel: [    0.000000]     fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
2000/01/06,18:01:27  kern.notice kernel: [    0.000000]     vmalloc : 0xe0800000 - 0xff000000   ( 488 MB)
2000/01/06,18:01:27  kern.notice kernel: [    0.000000]     lowmem  : 0xc0000000 - 0xe0000000   ( 512 MB)
2000/01/06,18:01:27  kern.notice kernel: [    0.000000]     pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
2000/01/06,18:01:27  kern.notice kernel: [    0.000000]     modules : 0xbf000000 - 0xbfe00000   (  14 MB)
2000/01/06,18:01:27  kern.info kernel: [    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
2000/01/06,18:01:27  kern.info kernel: [    0.000000] NR_IRQS:16 nr_irqs:16 16
2000/01/06,18:01:27  kern.info kernel: [    0.000000] IRQ: Found an INTC at 0xfa200000 (revision 5.0) with 128 interrupts
2000/01/06,18:01:27  kern.info kernel: [    0.000000] Total of 128 interrupts on 1 active controller
2000/01/06,18:01:27  kern.info kernel: [    0.000000] OMAP clockevent source: timer2 at 24000000 Hz
2000/01/06,18:01:27  kern.info kernel: [    0.000017] sched_clock: 32 bits at 24MHz, resolution 41ns, wraps every 178956969942ns
2000/01/06,18:01:27  kern.info kernel: [    0.000044] I-pipe, 24.000 MHz clocksource
2000/01/06,18:01:27  kern.info kernel: [    0.000075] OMAP clocksource: timer1 at 24000000 Hz
2000/01/06,18:01:27  kern.info kernel: [    0.132442] CPU: Testing write buffer coherency: ok
2000/01/06,18:01:27  kern.info kernel: [    0.132991] CPU0: thread -1, cpu 0, socket -1, mpidr 0
2000/01/06,18:01:27  kern.info kernel: [    0.133072] Setting up static identity map for 0x804a4920 - 0x804a4978
2000/01/06,18:01:27  kern.info kernel: [    0.134831] Brought up 1 CPUs
2000/01/06,18:01:27  kern.info kernel: [    0.134852] SMP: Total of 1 processors activated.
2000/01/06,18:01:27  kern.info kernel: [    0.134864] CPU: All CPU(s) started in SVC mode.
2000/01/06,18:01:27  kern.info kernel: [    0.366597] pps_core: LinuxPPS API ver. 1 registered
2000/01/06,18:01:27  kern.info kernel: [    0.366616] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it><mailto:giometti@linux.it>
2000/01/06,18:01:27  kern.info kernel: [    0.366887] PTP clock support registered
2000/01/06,18:01:27  kern.info kernel: [    0.372800] Switched to clocksource ipipe_tsc
2000/01/06,18:01:27  kern.info kernel: [    0.442248] NET: Registered protocol family 2
2000/01/06,18:01:28  kern.info kernel: [    0.443539] TCP established hash table entries: 4096 (order: 2, 16384 bytes)
2000/01/06,18:01:28  kern.info kernel: [    0.447153] hw perfevents: enabled with ARMv7 Cortex-A8 PMU driver, 5 counters available
2000/01/06,18:01:28  kern.info kernel: [    0.452330] futex hash table entries: 256 (order: 2, 16384 bytes)
2000/01/06,18:01:28  kern.info kernel: [    0.708933] I-pipe: head domain Xenomai registered.
2000/01/06,18:01:28  kern.info kernel: [    0.708983] Xenomai: hal/arm started.
2000/01/06,18:01:28  kern.info kernel: [    0.710767] Xenomai: scheduling class idle registered.
2000/01/06,18:01:28  kern.info kernel: [    0.710794] Xenomai: scheduling class rt registered.
2000/01/06,18:01:28  kern.info kernel: [    0.727536] Xenomai: real-time nucleus v2.6.3 (Lies and Truths) loaded.
2000/01/06,18:01:28  kern.info kernel: [    0.767422] io scheduler noop registered
2000/01/06,18:01:28  kern.info kernel: [    0.767439] io scheduler deadline registered
2000/01/06,18:01:28  kern.info kernel: [    0.768054] io scheduler cfq registered (default)


Thanks in advance.

Alberto Ozalla


CG DISCLAIMER: This email contains confidential information. It is intended exclusively for the addressees. If you are not an addressee, you must not store, transmit or disclose its contents. Instead please notify the sender immediately; and permanently delete this e-mail from your computer systems. We have taken reasonable precautions to ensure that no viruses are present. However, you must check this email and the attachments, for viruses. We accept no liability whatsoever, for any detriment caused by any transmitted virus.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 10:59 ` [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode ZIV-Alberto Ozalla Cantabrana
@ 2014-10-15 11:11   ` Gilles Chanteperdrix
  2014-10-15 13:12     ` ZIV-Alberto Ozalla Cantabrana
  2014-10-15 13:19   ` Gilles Chanteperdrix
  2014-10-16  7:16   ` Gilles Chanteperdrix
  2 siblings, 1 reply; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-15 11:11 UTC (permalink / raw)
  To: alberto.ozalla, xenomai

On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
> 
> Dear colleagues,
> 
> I face an unexpected switch to secondary mode after the very first call to rt_timer_tsc() into a real-time task created by Xenomai.
> This unexpected switch  only arises if it is the very first call to rt_timer_tsc().

Hi,

I see two possible reasons for that:
- either on-demand mapping of library text page
- or first access to the hardware register.

In order to find which is the reason.

Could you replace the call to rt_timer_tsc() with a call to __xn_rdtsc()
after including asm/xenomai/tsc.h ?


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 11:11   ` Gilles Chanteperdrix
@ 2014-10-15 13:12     ` ZIV-Alberto Ozalla Cantabrana
  2014-10-15 13:16       ` Gilles Chanteperdrix
  2014-10-15 20:03       ` Gilles Chanteperdrix
  0 siblings, 2 replies; 23+ messages in thread
From: ZIV-Alberto Ozalla Cantabrana @ 2014-10-15 13:12 UTC (permalink / raw)
  To: gilles.chanteperdrix; +Cc: xenomai


On 15/10/14 13:11, Gilles Chanteperdrix wrote:
> On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
>> Dear colleagues,
>>
>> I face an unexpected switch to secondary mode after the very first call to rt_timer_tsc() into a real-time task created by Xenomai.
>> This unexpected switch  only arises if it is the very first call to rt_timer_tsc().
> Hi,
>
> I see two possible reasons for that:
> - either on-demand mapping of library text page
> - or first access to the hardware register.
>
> In order to find which is the reason.
>
> Could you replace the call to rt_timer_tsc() with a call to __xn_rdtsc()
> after including asm/xenomai/tsc.h ?
>
>
Hello,

Thanks for the fast response.

Unfortunately replacing the call to rt_timer_tsc() with a call to 
__xn_rdtsc() does not solve the problem.

Another clue is that PF value increases at the same time as MSW.


CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
   0  0      0          11594571   0     00500080   96.3 ROOT/0
   0  1690   1          5339       1     00300184    1.0 Test_Task

It seems to be a first access to the hardware register.

Note: I use a call to mlockall(MCL_CURRENT|MCL_FUTURE) before creating 
the task.

-- 
Saludos,
Alberto Ozalla

CG DISCLAIMER: This email contains confidential information. It is intended exclusively for the addressees. If you are not an addressee, you must not store, transmit or disclose its contents. Instead please notify the sender immediately; and permanently delete this e-mail from your computer systems. We have taken reasonable precautions to ensure that no viruses are present. However, you must check this email and the attachments, for viruses. We accept no liability whatsoever, for any detriment caused by any transmitted virus.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 13:12     ` ZIV-Alberto Ozalla Cantabrana
@ 2014-10-15 13:16       ` Gilles Chanteperdrix
  2014-10-15 20:03       ` Gilles Chanteperdrix
  1 sibling, 0 replies; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-15 13:16 UTC (permalink / raw)
  To: alberto.ozalla; +Cc: xenomai

On 10/15/2014 03:12 PM, ZIV-Alberto Ozalla Cantabrana wrote:
> 
> On 15/10/14 13:11, Gilles Chanteperdrix wrote:
>> On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
>>> Dear colleagues,
>>>
>>> I face an unexpected switch to secondary mode after the very first call to rt_timer_tsc() into a real-time task created by Xenomai.
>>> This unexpected switch  only arises if it is the very first call to rt_timer_tsc().
>> Hi,
>>
>> I see two possible reasons for that:
>> - either on-demand mapping of library text page
>> - or first access to the hardware register.
>>
>> In order to find which is the reason.
>>
>> Could you replace the call to rt_timer_tsc() with a call to __xn_rdtsc()
>> after including asm/xenomai/tsc.h ?
>>
>>
> Hello,
> 
> Thanks for the fast response.
> 
> Unfortunately replacing the call to rt_timer_tsc() with a call to 
> __xn_rdtsc() does not solve the problem.

The intent was not to solve the problem, but to investigate it.

> 
> Another clue is that PF value increases at the same time as MSW.
> 
> 
> CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
>    0  0      0          11594571   0     00500080   96.3 ROOT/0
>    0  1690   1          5339       1     00300184    1.0 Test_Task
> 
> It seems to be a first access to the hardware register.

It would seem so. I will prepare a piece of code tonight that does not
rely on any external symbol, just to be sure.

> 
> Note: I use a call to mlockall(MCL_CURRENT|MCL_FUTURE) before creating 
> the task.

I do not believe mlockall applies to register mapped from /dev/mem, only
to memory mappings


-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 10:59 ` [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode ZIV-Alberto Ozalla Cantabrana
  2014-10-15 11:11   ` Gilles Chanteperdrix
@ 2014-10-15 13:19   ` Gilles Chanteperdrix
  2014-10-15 13:34     ` ZIV-Alberto Ozalla Cantabrana
  2014-10-16  7:16   ` Gilles Chanteperdrix
  2 siblings, 1 reply; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-15 13:19 UTC (permalink / raw)
  To: alberto.ozalla, xenomai

On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
> CG DISCLAIMER: This email contains confidential information. It is
> intended exclusively for the addressees. If you are not an addressee,
> you must not store, transmit or disclose its contents. Instead please
> notify the sender immediately; and permanently delete this e-mail
> from your computer systems. We have taken reasonable precautions to
> ensure that no viruses are present. However, you must check this
> email and the attachments, for viruses. We accept no liability
> whatsoever, for any detriment caused by any transmitted virus.

Note that sending a mail to the xenomail mailing list will get this mail
stored, transmitted and disclosed. So, the only way for me to comply
with this disclaimer is to unsubscribe you from the mailing list.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 13:19   ` Gilles Chanteperdrix
@ 2014-10-15 13:34     ` ZIV-Alberto Ozalla Cantabrana
  0 siblings, 0 replies; 23+ messages in thread
From: ZIV-Alberto Ozalla Cantabrana @ 2014-10-15 13:34 UTC (permalink / raw)
  To: gilles.chanteperdrix; +Cc: xenomai


On 15/10/14 15:19, Gilles Chanteperdrix wrote:
> On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
>> CG DISCLAIMER: This email contains confidential information. It is
>> intended exclusively for the addressees. If you are not an addressee,
>> you must not store, transmit or disclose its contents. Instead please
>> notify the sender immediately; and permanently delete this e-mail
>> from your computer systems. We have taken reasonable precautions to
>> ensure that no viruses are present. However, you must check this
>> email and the attachments, for viruses. We accept no liability
>> whatsoever, for any detriment caused by any transmitted virus.
> Note that sending a mail to the xenomail mailing list will get this mail
> stored, transmitted and disclosed. So, the only way for me to comply
> with this disclaimer is to unsubscribe you from the mailing list.
>
Sorry.
That clause is automatically appended to every mail. Do not take it into 
account.
It only applies if you are not the intended addressee.

Please, keep me subscribed in the mailing list.
Thanks.

-- 
Saludos,
Alberto Ozalla

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 13:12     ` ZIV-Alberto Ozalla Cantabrana
  2014-10-15 13:16       ` Gilles Chanteperdrix
@ 2014-10-15 20:03       ` Gilles Chanteperdrix
  2014-10-17 16:33         ` ZIV-Alberto Ozalla Cantabrana
  1 sibling, 1 reply; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-15 20:03 UTC (permalink / raw)
  To: ZIV-Alberto Ozalla Cantabrana; +Cc: xenomai

On Wed, Oct 15, 2014 at 01:12:27PM +0000, ZIV-Alberto Ozalla Cantabrana wrote:
> 
> On 15/10/14 13:11, Gilles Chanteperdrix wrote:
> > On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
> >> Dear colleagues,
> >>
> >> I face an unexpected switch to secondary mode after the very first call to rt_timer_tsc() into a real-time task created by Xenomai.
> >> This unexpected switch  only arises if it is the very first call to rt_timer_tsc().
> > Hi,
> >
> > I see two possible reasons for that:
> > - either on-demand mapping of library text page
> > - or first access to the hardware register.
> >
> > In order to find which is the reason.
> >
> > Could you replace the call to rt_timer_tsc() with a call to __xn_rdtsc()
> > after including asm/xenomai/tsc.h ?
> >
> >
> Hello,
> 
> Thanks for the fast response.
> 
> Unfortunately replacing the call to rt_timer_tsc() with a call to 
> __xn_rdtsc() does not solve the problem.
> 
> Another clue is that PF value increases at the same time as MSW.
> 
> 
> CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
>    0  0      0          11594571   0     00500080   96.3 ROOT/0
>    0  1690   1          5339       1     00300184    1.0 Test_Task
> 
> It seems to be a first access to the hardware register.
> 
> Note: I use a call to mlockall(MCL_CURRENT|MCL_FUTURE) before creating 
> the task.
> 

I just tried on omap3, which is pretty close to beagle bone, and did
not observe the same issue. Could you try and add the following code
to your test:

#include <fcntl.h>
#include <sys/mman.h>
#include <asm/xenomai/syscall.h>

void tsc_init(void)
{
	struct __xn_tscinfo tscinfo;
	unsigned long phys_addr;
	unsigned page_size;
	int fd;

	XENOMAI_SYSCALL2(__xn_sys_arch, XENOMAI_SYSARCH_TSCINFO, &tscinfo);
	fd = open("/dev/mem", O_RDONLY | O_SYNC);
	page_size = sysconf(_SC_PAGESIZE);
	phys_addr = (unsigned long) tscinfo.counter;
	tsc_vaddr = mmap(NULL, page_size, PROT_READ, MAP_SHARED,
		    fd, phys_addr & ~(page_size - 1))
		+ (phys_addr & (page_size - 1));
	close(fd);

	rdtsc = (typeof(rdtsc))(0xffff1004 -
				((*(unsigned *)(0xffff0ffc) + 3) << 5));
}

static inline unsigned long long my_rdtsc(void)
{
	return rdtsc(tsc_vaddr);
}

Call tsc_init at the beginning of your program, then replace the
first call to rt_timer_tsc() with a call to my_rdtsc(), and see if
you still get the page fault?

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 10:59 ` [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode ZIV-Alberto Ozalla Cantabrana
  2014-10-15 11:11   ` Gilles Chanteperdrix
  2014-10-15 13:19   ` Gilles Chanteperdrix
@ 2014-10-16  7:16   ` Gilles Chanteperdrix
  2014-10-16  8:16     ` Gilles Chanteperdrix
  2 siblings, 1 reply; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-16  7:16 UTC (permalink / raw)
  To: alberto.ozalla, xenomai

On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
> 'CFLAGS=-Os -pipe -D_REENTRANT -fomit-frame-pointer -march=armv7-a
> -mtune=cortex-a8  -mfpu=vfpv3-d16 -finline-functions
> -finline-limit=50 -ffast-math -include
> /home/aozalla/pro/npcp/toolchain/include/builtin_redefines.h
> -DMAX_PCI_SLOTS=8 -fPIC -DPIC -DNO_LARGEFILE_SOURCE
> -U_LARGEFILE_SOURCE -U_LARGE_FILES -U_FILE_OFFSET_BITS'

As a side note, if you have a modern toolchain which compiles by default
for thumb2, by passing -march=armv7-a you disable compilation for thumb,
and from my experience thumb2 code results in better latencies than ARM
code. Also -Os -finline-functions is a bit strange: the advantage of
using -Os is that it inline less functions, so it looks like a strange
way to ask for -O2. Also note that according to my tests, you get better
latencies on ARM with -Os than with -O2. Finally, I do not know what
builtin_redefines.h does but beware: xenomai uses gcc builtin functions
for atomic instructions by default now.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16  7:16   ` Gilles Chanteperdrix
@ 2014-10-16  8:16     ` Gilles Chanteperdrix
  2014-10-16  8:33       ` ZIV-Alberto Ozalla Cantabrana
  2014-10-16 18:17       ` Lennart Sorensen
  0 siblings, 2 replies; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-16  8:16 UTC (permalink / raw)
  To: alberto.ozalla, xenomai

On 10/16/2014 09:16 AM, Gilles Chanteperdrix wrote:
> On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
>> 'CFLAGS=-Os -pipe -D_REENTRANT -fomit-frame-pointer -march=armv7-a
>> -mtune=cortex-a8  -mfpu=vfpv3-d16 -finline-functions
>> -finline-limit=50 -ffast-math -include
>> /home/aozalla/pro/npcp/toolchain/include/builtin_redefines.h
>> -DMAX_PCI_SLOTS=8 -fPIC -DPIC -DNO_LARGEFILE_SOURCE
>> -U_LARGEFILE_SOURCE -U_LARGE_FILES -U_FILE_OFFSET_BITS'
> 
> As a side note, if you have a modern toolchain which compiles by default
> for thumb2, by passing -march=armv7-a you disable compilation for thumb,
> and from my experience thumb2 code results in better latencies than ARM
> code. Also -Os -finline-functions is a bit strange: the advantage of
> using -Os is that it inline less functions, so it looks like a strange
> way to ask for -O2. Also note that according to my tests, you get better
> latencies on ARM with -Os than with -O2. Finally, I do not know what
> builtin_redefines.h does but beware: xenomai uses gcc builtin functions
> for atomic instructions by default now.
> 

Also, vfpv3-d16 looks suspicious to me: you may be uselessly limiting
the number of vfp registers available to the compiler. If cat
/proc/cpuinfo tells you vfpv3 and not vfpv3-d16, you can use
-mfpu=vfpv3, if it tells you neon, you can use -mfpu=neon. I am almost
sure the beaglebone processor has 32 double vfp registers, not 16.

This page:
http://www.ti.com/product/AM3358/datasheet

Seems to imply that the beaglebone processor has NEON.

And this page:
http://en.wikipedia.org/wiki/ARM_architecture#Floating-point_.28VFP.29

Seems to imply that since it is a cortex a8, it has 32 double vfp
registers and not 16.

It would seem among arm processors of the "A" family and the armv7
architecture, only cortex A5 and cortex A7 may have 16 double vfp
registers (the atmel sama5d3 for instance is in this case).

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16  8:16     ` Gilles Chanteperdrix
@ 2014-10-16  8:33       ` ZIV-Alberto Ozalla Cantabrana
  2014-10-16  8:39         ` Gilles Chanteperdrix
  2014-10-16 18:17       ` Lennart Sorensen
  1 sibling, 1 reply; 23+ messages in thread
From: ZIV-Alberto Ozalla Cantabrana @ 2014-10-16  8:33 UTC (permalink / raw)
  To: gilles.chanteperdrix; +Cc: xenomai


On 16/10/14 10:16, Gilles Chanteperdrix wrote:

On 10/16/2014 09:16 AM, Gilles Chanteperdrix wrote:


On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:


'CFLAGS=-Os -pipe -D_REENTRANT -fomit-frame-pointer -march=armv7-a
-mtune=cortex-a8  -mfpu=vfpv3-d16 -finline-functions
-finline-limit=50 -ffast-math -include
/home/aozalla/pro/npcp/toolchain/include/builtin_redefines.h
-DMAX_PCI_SLOTS=8 -fPIC -DPIC -DNO_LARGEFILE_SOURCE
-U_LARGEFILE_SOURCE -U_LARGE_FILES -U_FILE_OFFSET_BITS'



As a side note, if you have a modern toolchain which compiles by default
for thumb2, by passing -march=armv7-a you disable compilation for thumb,
and from my experience thumb2 code results in better latencies than ARM
code. Also -Os -finline-functions is a bit strange: the advantage of
using -Os is that it inline less functions, so it looks like a strange
way to ask for -O2. Also note that according to my tests, you get better
latencies on ARM with -Os than with -O2. Finally, I do not know what
builtin_redefines.h does but beware: xenomai uses gcc builtin functions
for atomic instructions by default now.




Also, vfpv3-d16 looks suspicious to me: you may be uselessly limiting
the number of vfp registers available to the compiler. If cat
/proc/cpuinfo tells you vfpv3 and not vfpv3-d16, you can use
-mfpu=vfpv3, if it tells you neon, you can use -mfpu=neon. I am almost
sure the beaglebone processor has 32 double vfp registers, not 16.

This page:
http://www.ti.com/product/AM3358/datasheet

Seems to imply that the beaglebone processor has NEON.

And this page:
http://en.wikipedia.org/wiki/ARM_architecture#Floating-point_.28VFP.29

Seems to imply that since it is a cortex a8, it has 32 double vfp
registers and not 16.

It would seem among arm processors of the "A" family and the armv7
architecture, only cortex A5 and cortex A7 may have 16 double vfp
registers (the atmel sama5d3 for instance is in this case).



Thanks for your support.

We moved one step ahead:

  *   Xenomai updated from 2.6.3.git to 2.6.4
  *   Linux updated from 3.14.9 to 3.14.17

The problem has gone away. And I hope for good.

Anyway, during this evening I will recover former situation (former versions of Xenomai and Linux) and try the test you pointed out (uisng tsc_init and my_rdtsc).


--
Saludos,
Alberto Ozalla

CG DISCLAIMER: This email contains confidential information. It is intended exclusively for the addressees. If you are not an addressee, you must not store, transmit or disclose its contents. Instead please notify the sender immediately; and permanently delete this e-mail from your computer systems. We have taken reasonable precautions to ensure that no viruses are present. However, you must check this email and the attachments, for viruses. We accept no liability whatsoever, for any detriment caused by any transmitted virus.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16  8:33       ` ZIV-Alberto Ozalla Cantabrana
@ 2014-10-16  8:39         ` Gilles Chanteperdrix
  0 siblings, 0 replies; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-16  8:39 UTC (permalink / raw)
  To: alberto.ozalla; +Cc: xenomai

On 10/16/2014 10:33 AM, ZIV-Alberto Ozalla Cantabrana wrote:
> We moved one step ahead:
> 
> *   Xenomai updated from 2.6.3.git to 2.6.4 *   Linux updated from
> 3.14.9 to 3.14.17
> 
> The problem has gone away. And I hope for good.
> 
> Anyway, during this evening I will recover former situation (former
> versions of Xenomai and Linux) and try the test you pointed out
> (uisng tsc_init and my_rdtsc).

Xenomai 2.6.3 does not support Linux 3.14. You need at least that patch:
http://git.xenomai.org/xenomai-2.6.git/commit/?id=41cb1f73814d1094e0ea75ccbbd23ff01280787e

To get things (context switches) to work properly on ARM.

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16  8:16     ` Gilles Chanteperdrix
  2014-10-16  8:33       ` ZIV-Alberto Ozalla Cantabrana
@ 2014-10-16 18:17       ` Lennart Sorensen
  2014-10-16 18:58         ` Gilles Chanteperdrix
  1 sibling, 1 reply; 23+ messages in thread
From: Lennart Sorensen @ 2014-10-16 18:17 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

On Thu, Oct 16, 2014 at 10:16:18AM +0200, Gilles Chanteperdrix wrote:
> Also, vfpv3-d16 looks suspicious to me: you may be uselessly limiting
> the number of vfp registers available to the compiler. If cat
> /proc/cpuinfo tells you vfpv3 and not vfpv3-d16, you can use
> -mfpu=vfpv3, if it tells you neon, you can use -mfpu=neon. I am almost
> sure the beaglebone processor has 32 double vfp registers, not 16.
> 
> This page:
> http://www.ti.com/product/AM3358/datasheet
> 
> Seems to imply that the beaglebone processor has NEON.

Neon is mandetory on Cortex-A8.  Only Cortex-A9 made it optional
(unfortunately) and I believe nvidia was pretty much the only one to
not include Neon.  Of course non Cortex designs don't have it either
(Marvell JP4 cores for example), but those are pretty rare.  I have no
idea if qualcomm's designs have it or not.

> And this page:
> http://en.wikipedia.org/wiki/ARM_architecture#Floating-point_.28VFP.29
> 
> Seems to imply that since it is a cortex a8, it has 32 double vfp
> registers and not 16.
> 
> It would seem among arm processors of the "A" family and the armv7
> architecture, only cortex A5 and cortex A7 may have 16 double vfp
> registers (the atmel sama5d3 for instance is in this case).

-- 
Len Sorensen


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16 18:17       ` Lennart Sorensen
@ 2014-10-16 18:58         ` Gilles Chanteperdrix
  2014-10-16 20:56           ` Lennart Sorensen
  0 siblings, 1 reply; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-16 18:58 UTC (permalink / raw)
  To: Lennart Sorensen; +Cc: xenomai

On Thu, Oct 16, 2014 at 02:17:11PM -0400, Lennart Sorensen wrote:
> On Thu, Oct 16, 2014 at 10:16:18AM +0200, Gilles Chanteperdrix wrote:
> > Also, vfpv3-d16 looks suspicious to me: you may be uselessly limiting
> > the number of vfp registers available to the compiler. If cat
> > /proc/cpuinfo tells you vfpv3 and not vfpv3-d16, you can use
> > -mfpu=vfpv3, if it tells you neon, you can use -mfpu=neon. I am almost
> > sure the beaglebone processor has 32 double vfp registers, not 16.
> > 
> > This page:
> > http://www.ti.com/product/AM3358/datasheet
> > 
> > Seems to imply that the beaglebone processor has NEON.
> 
> Neon is mandetory on Cortex-A8.  Only Cortex-A9 made it optional
> (unfortunately) and I believe nvidia was pretty much the only one to
> not include Neon.  Of course non Cortex designs don't have it either
> (Marvell JP4 cores for example), but those are pretty rare.  I have no
> idea if qualcomm's designs have it or not.

Well, I toyed with NEON a bit, and was surprised to discover that it
has no instruction to accelerate 64 bits maths despite the fact that
it has 128bits vectors, which for me means that you can not do any
serious math with it. After implementing a routine to average pixels
from a bayer pattern on cortex A8 (where I could use NEON) I got a
factor gain of 2 or 3, far from what could have been expected from
processing 16 pixels at once, and I got a biggest gain by inserting
the non-NEON "pld" instruction at key points (which I could do in
the non NEON code as well). I also do not really understand how NEON
accelerates memcpy, why is a NEON multiple registers load/store
faster than ldm/stm, is not it a problem in ldm/stm rather than a
virtue of NEON?

All this to say, is NEON that useful?

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16 18:58         ` Gilles Chanteperdrix
@ 2014-10-16 20:56           ` Lennart Sorensen
  2014-10-16 23:14             ` Tom Evans
  0 siblings, 1 reply; 23+ messages in thread
From: Lennart Sorensen @ 2014-10-16 20:56 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

On Thu, Oct 16, 2014 at 08:58:19PM +0200, Gilles Chanteperdrix wrote:
> Well, I toyed with NEON a bit, and was surprised to discover that it
> has no instruction to accelerate 64 bits maths despite the fact that
> it has 128bits vectors, which for me means that you can not do any
> serious math with it. After implementing a routine to average pixels
> from a bayer pattern on cortex A8 (where I could use NEON) I got a
> factor gain of 2 or 3, far from what could have been expected from
> processing 16 pixels at once, and I got a biggest gain by inserting
> the non-NEON "pld" instruction at key points (which I could do in
> the non NEON code as well). I also do not really understand how NEON
> accelerates memcpy, why is a NEON multiple registers load/store
> faster than ldm/stm, is not it a problem in ldm/stm rather than a
> virtue of NEON?

Not sure why it is faster on some designs.

> All this to say, is NEON that useful?

Well the vfp on the Cortex-A8 is rather slow, so neon is usually much
faster, however on the A9 and A15 the vfp is much faster and neon is
often no gain at all and often the same speed in fact.

-- 
Len Sorensen


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16 20:56           ` Lennart Sorensen
@ 2014-10-16 23:14             ` Tom Evans
  2014-10-17  5:34               ` Gilles Chanteperdrix
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Evans @ 2014-10-16 23:14 UTC (permalink / raw)
  To: xenomai

On 17/10/14 07:56, Lennart Sorensen wrote:
> On Thu, Oct 16, 2014 at 08:58:19PM +0200, Gilles Chanteperdrix wrote:
>> ... After implementing a routine to average pixels
>> from a bayer pattern on cortex A8 (where I could use NEON) I got a
>> factor gain of 2 or 3, far from what could have been expected from
>> processing 16 pixels at once,

How big is your data-set? You are probably breaking the L2 cache.

Work out how many pixels per second you're processing and then compare it to 
the memory bandwidth. You may be surprised at how slow the memory system is.

Download, compile and run this program:

http://www.cwi.nl/~manegold/Calibrator/

   root@triton1:/tmp# nice --20 ./calibrator 800 1700k report

   caches:
   level  size    linesize   miss-latency        replace-time
     1     32 KB  128 bytes   12.70 ns =  10 cy   13.40 ns =  11 cy
     2    256 KB   64 bytes  191.21 ns = 153 cy  194.37 ns = 155 cy

   TLBs:
   level #entries  pagesize  miss-latency
     1       32       4 KB    57.65 ns =  46 cy

Miss L1 and wait 10 clocks. Miss L2 and wait 153 clocks! Step through memory 
4k at a time and wait 46 clocks for the TLB to reload.

 >> and I got a biggest gain by inserting the non-NEON "pld"
 >> instruction at key points (which I could do in the non NEON
 >> code as well).

With a 153-clock latency on an L2 miss, PLD will have a large affect if you 
can get them in early enough. You should preload multiple cache lines ahead 
and not just a few words.

>> I also do not really understand how NEON accelerates memcpy,
 >> why is a NEON multiple registers load/store faster than
 >> ldm/stm, is not it a problem in ldm/stm rather than a
>> virtue of NEON?

The following should be a good reference, but doesn't answer this question. It 
says there is no difference, but that's not what we're seeing.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/kihAsZfdS5wTMO.html

The faster Neon copy indicates a problem with the ARM architecture itself. 
Whenever the ARM CPU performs a memcpy(), the sequence is (read(src); 
read(dst); write(dst)). The cache design means that the destination cache line 
is READ before being written, so the memcpy() speed is 1/3 of the basic memory 
speed.

The PPC architecture provides DCBZ and friends. During a memcpy() you perform 
a DCBZ on the destination which is a "promise" to the CPU that you're going to 
write the entire cache line so it doesn't have to be read first.

Neon performs the operations a cache line at a time and gets rid of the 
redundant read operation, so it runs faster by 3/2. The previous link implies 
this might require the correct CPU configuration (Neon bypassing L1).

>> All this to say, is NEON that useful?

We're performing alpha blending with 32-bit pixels and our Neon code is able 
to do that at the same speed as a CPU-driven memcpy(). It is a lot faster than 
my poor attempts at alpha-blending 4 bytes per pixel in C. Our Neon memcpy() 
(copying 800x480 32-bit pixels at 20Hz to /dev/fb0) is 50% faster than the 
alternative.

I also had an instance when performing an affine transform (rotation) where 
the speed dropped to 29% at specific rotation angle where a 300-pixel "walk" 
through memory aliased in the caches. That was fixed by performing the 
transform in 48 by 48 pixel "tiles".

> Well the vfp on the Cortex-A8 is rather slow,

10 CPU clocks per instruction instead of 1 clock on the other chips.

Tom



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-16 23:14             ` Tom Evans
@ 2014-10-17  5:34               ` Gilles Chanteperdrix
  2014-10-17  6:47                 ` Tom Evans
  0 siblings, 1 reply; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-17  5:34 UTC (permalink / raw)
  To: Tom Evans; +Cc: xenomai

On Fri, Oct 17, 2014 at 10:14:44AM +1100, Tom Evans wrote:
> On 17/10/14 07:56, Lennart Sorensen wrote:
> >On Thu, Oct 16, 2014 at 08:58:19PM +0200, Gilles Chanteperdrix wrote:
> >>... After implementing a routine to average pixels
> >>from a bayer pattern on cortex A8 (where I could use NEON) I got a
> >>factor gain of 2 or 3, far from what could have been expected from
> >>processing 16 pixels at once,
> 
> How big is your data-set? You are probably breaking the L2 cache.

1600x1200, I was definitely breaking the L2 cache (hence the fact
that pld improves things).

> 
> Work out how many pixels per second you're processing and then
> compare it to the memory bandwidth. You may be surprised at how slow
> the memory system is.

The memory was a DDR3 running at 533/1066 MHZ. I would not call that
slow. Given the fact that:
- there were two interleaved banks
- each bank processes 2 bytes at every half tick
that would be 4 Gbytes/sec.

Since the processor was running at 1GHz too, if it had been limited
by memory, it should have been able to process 2 pixels every
processor tick (2 reads, 2 writes), that is process the whole image
in 960us. The process took milliseconds, so, I would say the memory
definitely was not the limit. I do not think latency was an issue
either, because the memory was accessed sequentially.

An FPGA, master on the PCI bus had absolutely no problem to DMA the
1600x1200 pixels at 60 fps.

In my case, the NEON code was written to process two quads per
instruction, that would be 32 pixels at once. After having written
the NEON code, I rewrote the plain C version to work with 32 bits
integer registers, and process 4 pixels at once, and to use
pld. In the end, the NEON version was only performing twice as fast
as the plain C version, whereas it was processing 8 times the number
of pixels at each instruction.

> 
> Download, compile and run this program:
> 
> http://www.cwi.nl/~manegold/Calibrator/
> 
>   root@triton1:/tmp# nice --20 ./calibrator 800 1700k report
> 
>   caches:
>   level  size    linesize   miss-latency        replace-time
>     1     32 KB  128 bytes   12.70 ns =  10 cy   13.40 ns =  11 cy
>     2    256 KB   64 bytes  191.21 ns = 153 cy  194.37 ns = 155 cy
> 
>   TLBs:
>   level #entries  pagesize  miss-latency
>     1       32       4 KB    57.65 ns =  46 cy
> 
> Miss L1 and wait 10 clocks. Miss L2 and wait 153 clocks! Step
> through memory 4k at a time and wait 46 clocks for the TLB to
> reload.

That does not prove that the memory system is slow, that proves that
the processor access to memory is slow. But why is that?

> 
> >> and I got a biggest gain by inserting the non-NEON "pld"
> >> instruction at key points (which I could do in the non NEON
> >> code as well).
> 
> With a 153-clock latency on an L2 miss, PLD will have a large affect
> if you can get them in early enough. You should preload multiple
> cache lines ahead and not just a few words.

Yes, I adjusted the parameters of preload (how many iterations
ahead) and preloaded all the data I needed. In my case, the best
place to put the pld was right before the first vld, I guess because
pld was able to do its job during the vld stall.

> 
> >>I also do not really understand how NEON accelerates memcpy,
> >> why is a NEON multiple registers load/store faster than
> >> ldm/stm, is not it a problem in ldm/stm rather than a
> >>virtue of NEON?
> 
> The following should be a good reference, but doesn't answer this
> question. It says there is no difference, but that's not what we're
> seeing.
> 
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/kihAsZfdS5wTMO.html
> 
> The faster Neon copy indicates a problem with the ARM architecture
> itself. Whenever the ARM CPU performs a memcpy(), the sequence is
> (read(src); read(dst); write(dst)). The cache design means that the
> destination cache line is READ before being written, so the memcpy()
> speed is 1/3 of the basic memory speed.

Ah, thanks for the explanation, I had found this page and was rather
puzzled by this result.

> 
> The PPC architecture provides DCBZ and friends. During a memcpy()
> you perform a DCBZ on the destination which is a "promise" to the
> CPU that you're going to write the entire cache line so it doesn't
> have to be read first.
> 
> Neon performs the operations a cache line at a time and gets rid of
> the redundant read operation, so it runs faster by 3/2. The previous
> link implies this might require the correct CPU configuration (Neon
> bypassing L1).
> 
> >>All this to say, is NEON that useful?
> 
> We're performing alpha blending with 32-bit pixels and our Neon code
> is able to do that at the same speed as a CPU-driven memcpy(). It is
> a lot faster than my poor attempts at alpha-blending 4 bytes per
> pixel in C. Our Neon memcpy() (copying 800x480 32-bit pixels at 20Hz
> to /dev/fb0) is 50% faster than the alternative.

I am sorry, I do not want to critic you work, only doubt the power
of NEON: do you really find this impressive? People want to handle
2M pixels images at 60 Hz now, and soon 4K. If you look at x264
performances for instance: 

http://x264dev.multimedia.cx/archives/142

They announce that they can encode CIF resolution with very low
quality (ultrafast setting) at 30 fps with NEON on cortex A8. Once
again, I do not want to critic peoples work, only the hardware,
common x86 hardware can encode several 1080p30 streams concurrently 
with a normal quality.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-17  5:34               ` Gilles Chanteperdrix
@ 2014-10-17  6:47                 ` Tom Evans
  2014-10-17  7:02                   ` Gilles Chanteperdrix
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Evans @ 2014-10-17  6:47 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

On 17/10/14 16:34, Gilles Chanteperdrix wrote:
> On Fri, Oct 17, 2014 at 10:14:44AM +1100, Tom Evans wrote:

I think we're way off topic here. Should be stop?

>> Work out how many pixels per second you're processing and then
>> compare it to the memory bandwidth. You may be surprised at how slow
>> the memory system is.
>
> The memory was a DDR3 running at 533/1066 MHZ. I would not call that
> slow. Given the fact that:
> - there were two interleaved banks
> - each bank processes 2 bytes at every half tick
> that would be 4 Gbytes/sec.

That has to be slow. Measure your memcpy() speed and see how many MBytes/sec 
you're getting.

You're working your way through memory, possibly linearly, which SHOULD keep 
the memory pages open (and give you some speed), but can't.

What is happening at "the code level" for a memcpy()is:

1 - Read a word or 16 into the CPU from one address,
2 - Write them out to another address,
3 - Repeat until done.

What is happening is:

1 - Read a word or 16 into the CPU from one address,
1a - Pick a RANDOM cache line to evict to make room,
1b - Write the data from that cache line to memory,
1c - Whoops, wrong DDR3 page, close the previous page and open THAT one,
1d - Read the data into that cache line.
1e - Whoops, wrong DDR3 page, close the previous page and open THAT one,
1f - Read from the cache into the CPU,
2 - Write them out to another address,
2a - Pick a RANDOM cache line to evict to make room
2b - Whoops, wrong DDR3 page, close the page and open THAT one,
2c - Write the data from that cache line to memory,
2d - Read the data into that cache line.
2e - Whoops, wrong DDR3 page, close the previous page and open THAT one,
2f - Write from the CPU into that cache line
3 - Repeat until done.

The DDR3 can't keep on the same page and that slows it down. Opening the new 
page takes hugely longer than the double-clocked burst transfer.

Using Neon gets rid of one redundant read, but the writes still have to evict 
cache lines.

It might be better to FLUSH the entire cache, perform a L2-sized transfer and 
then flush it again. The flushes *might* be to linear addresses in open pages.

Otherwise it might be worth burst-reading to static RAM inside the CPU and 
then burst-writing that, again possibly with full (or specific) cache flushes.

I got my fastest memcpy() speed on an MCF5329 by reading 2k to the stack (in 
static ram in the CPU) and then writing that back out. Copying twice was a LOT 
faster than any other method.

Tom




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-17  6:47                 ` Tom Evans
@ 2014-10-17  7:02                   ` Gilles Chanteperdrix
  2014-10-17 14:08                     ` Tom Evans
  2014-10-17 14:32                     ` Anders Blomdell
  0 siblings, 2 replies; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-17  7:02 UTC (permalink / raw)
  To: Tom Evans; +Cc: xenomai

On Fri, Oct 17, 2014 at 05:47:07PM +1100, Tom Evans wrote:
> On 17/10/14 16:34, Gilles Chanteperdrix wrote:
> >On Fri, Oct 17, 2014 at 10:14:44AM +1100, Tom Evans wrote:
> 
> I think we're way off topic here. Should be stop?

the xenomai mailing list is low traffic, I believe we can have such
an interesting digression from time to time.

> 
> >>Work out how many pixels per second you're processing and then
> >>compare it to the memory bandwidth. You may be surprised at how slow
> >>the memory system is.
> >
> >The memory was a DDR3 running at 533/1066 MHZ. I would not call that
> >slow. Given the fact that:
> >- there were two interleaved banks
> >- each bank processes 2 bytes at every half tick
> >that would be 4 Gbytes/sec.
> 
> That has to be slow. Measure your memcpy() speed and see how many
> MBytes/sec you're getting.
> 
> You're working your way through memory, possibly linearly, which
> SHOULD keep the memory pages open (and give you some speed), but
> can't.
> 
> What is happening at "the code level" for a memcpy()is:
> 
> 1 - Read a word or 16 into the CPU from one address,
> 2 - Write them out to another address,
> 3 - Repeat until done.
> 
> What is happening is:
> 
> 1 - Read a word or 16 into the CPU from one address,
> 1a - Pick a RANDOM cache line to evict to make room,
> 1b - Write the data from that cache line to memory,
> 1c - Whoops, wrong DDR3 page, close the previous page and open THAT one,
> 1d - Read the data into that cache line.
> 1e - Whoops, wrong DDR3 page, close the previous page and open THAT one,
> 1f - Read from the cache into the CPU,
> 2 - Write them out to another address,
> 2a - Pick a RANDOM cache line to evict to make room
> 2b - Whoops, wrong DDR3 page, close the page and open THAT one,
> 2c - Write the data from that cache line to memory,
> 2d - Read the data into that cache line.
> 2e - Whoops, wrong DDR3 page, close the previous page and open THAT one,
> 2f - Write from the CPU into that cache line
> 3 - Repeat until done.
> 
> The DDR3 can't keep on the same page and that slows it down. Opening
> the new page takes hugely longer than the double-clocked burst
> transfer.
> 
> Using Neon gets rid of one redundant read, but the writes still have
> to evict cache lines.
> 
> It might be better to FLUSH the entire cache, perform a L2-sized
> transfer and then flush it again. The flushes *might* be to linear
> addresses in open pages.
> 
> Otherwise it might be worth burst-reading to static RAM inside the
> CPU and then burst-writing that, again possibly with full (or
> specific) cache flushes.
> 
> I got my fastest memcpy() speed on an MCF5329 by reading 2k to the
> stack (in static ram in the CPU) and then writing that back out.
> Copying twice was a LOT faster than any other method.


Actually, I am wrong, I was only reading the image, not writing to
it, simply computing a very reduced averaged image, so there were
some writes from time to time, but very rarely. I agree that you can
not assume that you read or write at full speed, and that the DDR
controller has to change page from time to time, but that should be
negligible if you do not alternate reads and writes.

But I see your point, the problem is not NEON, but the way the
processor handles memory and cache.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-17  7:02                   ` Gilles Chanteperdrix
@ 2014-10-17 14:08                     ` Tom Evans
  2014-10-17 19:36                       ` Gilles Chanteperdrix
  2014-10-17 14:32                     ` Anders Blomdell
  1 sibling, 1 reply; 23+ messages in thread
From: Tom Evans @ 2014-10-17 14:08 UTC (permalink / raw)
  To: Gilles Chanteperdrix; +Cc: xenomai

On 17/10/2014 6:02 PM, Gilles Chanteperdrix wrote:
> On Fri, Oct 17, 2014 at 05:47:07PM +1100, Tom Evans wrote:
>> On 17/10/14 16:34, Gilles Chanteperdrix wrote:
>>>> Work out how many pixels per second you're processing and then
>>>> compare it to the memory bandwidth.

That would still be an interesting number to measure and quote.

>> It might be better to FLUSH the entire cache, perform a L2-sized
>> transfer and then flush it again. The flushes *might* be to linear
>> addresses in open pages.

Thinking more about this it would be better to flush the entire cache 
and then perform a preload-and-read pass (to load the cache from 
complete open rows in on-page RAM) then loop reading and writing (from 
cache to cache) and then loop flushing the destination cache lines into 
open rows of the RAM.

This would be easy on the PPC. It has six "User level cache 
instructions". Even the Coldfire has a "CPUSHL" user-mode instruction.

This seems to be impossible from user-space on the ARM, for as far as I 
can tell all of the 13 "Cache and branch predictor maintenance 
operations, VMSA" instructions "can be executed only by software 
executing at PL1 or higher". The only user-space ones are PLD, PLDW and 
PLI. So I'd have to write a kernel driver to copy user memory and worry 
about the page translation.

>> I got my fastest memcpy() speed on an MCF5329 by reading 2k to the
>> stack (in static ram in the CPU) and then writing that back out.
>> Copying twice was a LOT faster than any other method.

240MHz Coldfire with 80MHz 32-bit SDR memory. It started out at 33MB/s, 
got to 39MB/s by using the multiple register move instruction, and 
peaked at 55 MB/s copying via the stack in internal SRAM. Memcpy() from 
internal RAM to internal RAM managed 304MB/s!

RAM could be read at 87MB/s (due to the lack of pipelining in this CPU) 
but could be written at 207MB/s.

Function             kB/s   Memclk/cache line
=============================================
memcpy_gcc_4_4       30883  41.45
memcpy_gcc_4_3_O1    33382  38.34
memcpy_gcc_4_3_O2    33385  38.34
memcpy_gcc_2         33390  38.33

memcpy(132096)       33379  38.35
memcpy_moveml        39752  32.20
memcpy_dma           43709  29.28
memcpy_moveml_32     49618  25.80
memcpy_stack         52912  24.19
memcpy_moveml_192    54052  23.68
memcpy_moveml_48     54093  23.66
memcpy_stack_48      54997  23.27
memcpy_stack_32_mis  55079  23.24
memcpy_stack_32      55125  23.22
memcpy_stack_192     55736  22.97
memcpy_moveml_96_ps  56739  22.56
memRead_stack_32     85017  15.06
memRead_moveml_32    87141  14.69
memWrite_stack_32   196864   6.50
memWrite_moveml_32  207535   6.17
memcpy_stack_stack  304368   4.21 (12.62 CPU clocks)

> Actually, I am wrong, I was only reading the image,
 > not writing to it, simply computing a very reduced
 > averaged image, so there were some writes from time
 > to time, but very rarely.

That should get the best performance. That's not a common operation in 
what I'm doing which is alpha-blending graphics over each other.

 > > Miss L1 and wait 10 clocks. Miss L2 and wait 153 clocks! Step
 > > through memory 4k at a time and wait 46 clocks for the TLB to
 > > reload.
 >
 > That does not prove that the memory system is slow, that
 > proves that the processor access to memory is slow. But
 > why is that?

The memory controller may not be capable of keeping multiple banks open, 
or even rows open. It takes a long time to close an open row with the 
precharge and to then open the next one.

Don't even think about reading or writing peripheral pins. I worked on 
an ARM chip (PXA) that took 200 CPU Clocks to read or write a port 
register. It was actually recommended to program the DMA controller to 
read and write the ports and to interrupt the CPU when done!

That previously-quoted ARM FAQ on memory copying suggests the same thing 
(DMA or Preload Engine) for copying memory while the CPU goes and does 
something else.

> But I see your point, the problem is not NEON, but the
 > way the processor handles memory and cache.

The frustrating thing is the missing user-mode cache control instructions.

Tom




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-17  7:02                   ` Gilles Chanteperdrix
  2014-10-17 14:08                     ` Tom Evans
@ 2014-10-17 14:32                     ` Anders Blomdell
  1 sibling, 0 replies; 23+ messages in thread
From: Anders Blomdell @ 2014-10-17 14:32 UTC (permalink / raw)
  To: Gilles Chanteperdrix, Tom Evans; +Cc: xenomai

On 2014-10-17 09:02, Gilles Chanteperdrix wrote:
> On Fri, Oct 17, 2014 at 05:47:07PM +1100, Tom Evans wrote:
>> On 17/10/14 16:34, Gilles Chanteperdrix wrote:
>>> On Fri, Oct 17, 2014 at 10:14:44AM +1100, Tom Evans wrote:
>>
>> I think we're way off topic here. Should be stop?
> 
> the xenomai mailing list is low traffic, I believe we can have such
> an interesting digression from time to time.
Yes, please let these interesting snippets of knowledge trickle
down to us mere mortals?

/Anders

-- 
Anders Blomdell                  Email: anders.blomdell@control.lth.se
Department of Automatic Control
Lund University                  Phone:    +46 46 222 4625
P.O. Box 118                     Fax:      +46 46 138118
SE-221 00 Lund, Sweden



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-15 20:03       ` Gilles Chanteperdrix
@ 2014-10-17 16:33         ` ZIV-Alberto Ozalla Cantabrana
  2014-10-17 16:38           ` Gilles Chanteperdrix
  0 siblings, 1 reply; 23+ messages in thread
From: ZIV-Alberto Ozalla Cantabrana @ 2014-10-17 16:33 UTC (permalink / raw)
  To: gilles.chanteperdrix; +Cc: xenomai


On 15/10/14 22:03, Gilles Chanteperdrix wrote:
> On Wed, Oct 15, 2014 at 01:12:27PM +0000, ZIV-Alberto Ozalla Cantabrana wrote:
>> On 15/10/14 13:11, Gilles Chanteperdrix wrote:
>>> On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
>>>> Dear colleagues,
>>>>
>>>> I face an unexpected switch to secondary mode after the very first call to rt_timer_tsc() into a real-time task created by Xenomai.
>>>> This unexpected switch  only arises if it is the very first call to rt_timer_tsc().
>>> Hi,
>>>
>>> I see two possible reasons for that:
>>> - either on-demand mapping of library text page
>>> - or first access to the hardware register.
>>>
>>> In order to find which is the reason.
>>>
>>> Could you replace the call to rt_timer_tsc() with a call to __xn_rdtsc()
>>> after including asm/xenomai/tsc.h ?
>>>
>>>
>> Hello,
>>
>> Thanks for the fast response.
>>
>> Unfortunately replacing the call to rt_timer_tsc() with a call to
>> __xn_rdtsc() does not solve the problem.
>>
>> Another clue is that PF value increases at the same time as MSW.
>>
>>
>> CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
>>     0  0      0          11594571   0     00500080   96.3 ROOT/0
>>     0  1690   1          5339       1     00300184    1.0 Test_Task
>>
>> It seems to be a first access to the hardware register.
>>
>> Note: I use a call to mlockall(MCL_CURRENT|MCL_FUTURE) before creating
>> the task.
>>
> I just tried on omap3, which is pretty close to beagle bone, and did
> not observe the same issue. Could you try and add the following code
> to your test:
>
> #include <fcntl.h>
> #include <sys/mman.h>
> #include <asm/xenomai/syscall.h>
>
> void tsc_init(void)
> {
> 	struct __xn_tscinfo tscinfo;
> 	unsigned long phys_addr;
> 	unsigned page_size;
> 	int fd;
>
> 	XENOMAI_SYSCALL2(__xn_sys_arch, XENOMAI_SYSARCH_TSCINFO, &tscinfo);
> 	fd = open("/dev/mem", O_RDONLY | O_SYNC);
> 	page_size = sysconf(_SC_PAGESIZE);
> 	phys_addr = (unsigned long) tscinfo.counter;
> 	tsc_vaddr = mmap(NULL, page_size, PROT_READ, MAP_SHARED,
> 		    fd, phys_addr & ~(page_size - 1))
> 		+ (phys_addr & (page_size - 1));
> 	close(fd);
>
> 	rdtsc = (typeof(rdtsc))(0xffff1004 -
> 				((*(unsigned *)(0xffff0ffc) + 3) << 5));
> }
>
> static inline unsigned long long my_rdtsc(void)
> {
> 	return rdtsc(tsc_vaddr);
> }
>
> Call tsc_init at the beginning of your program, then replace the
> first call to rt_timer_tsc() with a call to my_rdtsc(), and see if
> you still get the page fault?
>
Hi all,

It works fine.
There are neither mode switches, nor page faults.

I will spend the weekend thinking about it.

Thanks a lot, Gilles.

-- 
Saludos,
Alberto Ozalla


CG DISCLAIMER: This email contains confidential information. It is intended exclusively for the addressees. If you are not an addressee, you must not store, transmit or disclose its contents. Instead please notify the sender immediately; and permanently delete this e-mail from your computer systems. We have taken reasonable precautions to ensure that no viruses are present. However, you must check this email and the attachments, for viruses. We accept no liability whatsoever, for any detriment caused by any transmitted virus.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-17 16:33         ` ZIV-Alberto Ozalla Cantabrana
@ 2014-10-17 16:38           ` Gilles Chanteperdrix
  0 siblings, 0 replies; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-17 16:38 UTC (permalink / raw)
  To: alberto.ozalla; +Cc: xenomai

On 10/17/2014 06:33 PM, ZIV-Alberto Ozalla Cantabrana wrote:
> 
> On 15/10/14 22:03, Gilles Chanteperdrix wrote:
>> On Wed, Oct 15, 2014 at 01:12:27PM +0000, ZIV-Alberto Ozalla Cantabrana wrote:
>>> On 15/10/14 13:11, Gilles Chanteperdrix wrote:
>>>> On 10/15/2014 12:59 PM, ZIV-Alberto Ozalla Cantabrana wrote:
>>>>> Dear colleagues,
>>>>>
>>>>> I face an unexpected switch to secondary mode after the very first call to rt_timer_tsc() into a real-time task created by Xenomai.
>>>>> This unexpected switch  only arises if it is the very first call to rt_timer_tsc().
>>>> Hi,
>>>>
>>>> I see two possible reasons for that:
>>>> - either on-demand mapping of library text page
>>>> - or first access to the hardware register.
>>>>
>>>> In order to find which is the reason.
>>>>
>>>> Could you replace the call to rt_timer_tsc() with a call to __xn_rdtsc()
>>>> after including asm/xenomai/tsc.h ?
>>>>
>>>>
>>> Hello,
>>>
>>> Thanks for the fast response.
>>>
>>> Unfortunately replacing the call to rt_timer_tsc() with a call to
>>> __xn_rdtsc() does not solve the problem.
>>>
>>> Another clue is that PF value increases at the same time as MSW.
>>>
>>>
>>> CPU  PID    MSW        CSW        PF    STAT       %CPU  NAME
>>>     0  0      0          11594571   0     00500080   96.3 ROOT/0
>>>     0  1690   1          5339       1     00300184    1.0 Test_Task
>>>
>>> It seems to be a first access to the hardware register.
>>>
>>> Note: I use a call to mlockall(MCL_CURRENT|MCL_FUTURE) before creating
>>> the task.
>>>
>> I just tried on omap3, which is pretty close to beagle bone, and did
>> not observe the same issue. Could you try and add the following code
>> to your test:
>>
>> #include <fcntl.h>
>> #include <sys/mman.h>
>> #include <asm/xenomai/syscall.h>
>>
>> void tsc_init(void)
>> {
>> 	struct __xn_tscinfo tscinfo;
>> 	unsigned long phys_addr;
>> 	unsigned page_size;
>> 	int fd;
>>
>> 	XENOMAI_SYSCALL2(__xn_sys_arch, XENOMAI_SYSARCH_TSCINFO, &tscinfo);
>> 	fd = open("/dev/mem", O_RDONLY | O_SYNC);
>> 	page_size = sysconf(_SC_PAGESIZE);
>> 	phys_addr = (unsigned long) tscinfo.counter;
>> 	tsc_vaddr = mmap(NULL, page_size, PROT_READ, MAP_SHARED,
>> 		    fd, phys_addr & ~(page_size - 1))
>> 		+ (phys_addr & (page_size - 1));
>> 	close(fd);
>>
>> 	rdtsc = (typeof(rdtsc))(0xffff1004 -
>> 				((*(unsigned *)(0xffff0ffc) + 3) << 5));
>> }
>>
>> static inline unsigned long long my_rdtsc(void)
>> {
>> 	return rdtsc(tsc_vaddr);
>> }
>>
>> Call tsc_init at the beginning of your program, then replace the
>> first call to rt_timer_tsc() with a call to my_rdtsc(), and see if
>> you still get the page fault?
>>
> Hi all,
> 
> It works fine.
> There are neither mode switches, nor page faults.
> 
> I will spend the weekend thinking about it.
> 
> Thanks a lot, Gilles.
> 
It is exactly the same thing as what libxenomai does, the only
difference being that libxenomai may be mapped on demand, but this is
doubtful since libxenomai initializations are done before main runs, so
it should have been mapped already. Are you sure the kernel you are
running still has the issue when using rt_timer_tsc()?

-- 
                                                                Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode.
  2014-10-17 14:08                     ` Tom Evans
@ 2014-10-17 19:36                       ` Gilles Chanteperdrix
  0 siblings, 0 replies; 23+ messages in thread
From: Gilles Chanteperdrix @ 2014-10-17 19:36 UTC (permalink / raw)
  To: Tom Evans; +Cc: xenomai

On Sat, Oct 18, 2014 at 01:08:57AM +1100, Tom Evans wrote:
> On 17/10/2014 6:02 PM, Gilles Chanteperdrix wrote:
> >On Fri, Oct 17, 2014 at 05:47:07PM +1100, Tom Evans wrote:
> >>On 17/10/14 16:34, Gilles Chanteperdrix wrote:
> >>>>Work out how many pixels per second you're processing and then
> >>>>compare it to the memory bandwidth.
> 
> That would still be an interesting number to measure and quote.

I no longer have access to either the hardware (a TI DM8148) or the
software, so that is going to be a bit hard.

> 
> >>It might be better to FLUSH the entire cache, perform a L2-sized
> >>transfer and then flush it again. The flushes *might* be to linear
> >>addresses in open pages.
> 
> Thinking more about this it would be better to flush the entire
> cache and then perform a preload-and-read pass (to load the cache
> from complete open rows in on-page RAM) then loop reading and
> writing (from cache to cache) and then loop flushing the destination
> cache lines into open rows of the RAM.
> 
> This would be easy on the PPC. It has six "User level cache
> instructions". Even the Coldfire has a "CPUSHL" user-mode
> instruction.
> 
> This seems to be impossible from user-space on the ARM, for as far
> as I can tell all of the 13 "Cache and branch predictor maintenance
> operations, VMSA" instructions "can be executed only by software
> executing at PL1 or higher". The only user-space ones are PLD, PLDW
> and PLI. So I'd have to write a kernel driver to copy user memory
> and worry about the page translation.
> 
> >>I got my fastest memcpy() speed on an MCF5329 by reading 2k to the
> >>stack (in static ram in the CPU) and then writing that back out.
> >>Copying twice was a LOT faster than any other method.
> 
> 240MHz Coldfire with 80MHz 32-bit SDR memory. It started out at
> 33MB/s, got to 39MB/s by using the multiple register move
> instruction, and peaked at 55 MB/s copying via the stack in internal
> SRAM. Memcpy() from internal RAM to internal RAM managed 304MB/s!
> 
> RAM could be read at 87MB/s (due to the lack of pipelining in this
> CPU) but could be written at 207MB/s.
> 
> Function             kB/s   Memclk/cache line
> =============================================
> memcpy_gcc_4_4       30883  41.45
> memcpy_gcc_4_3_O1    33382  38.34
> memcpy_gcc_4_3_O2    33385  38.34
> memcpy_gcc_2         33390  38.33
> 
> memcpy(132096)       33379  38.35
> memcpy_moveml        39752  32.20
> memcpy_dma           43709  29.28
> memcpy_moveml_32     49618  25.80
> memcpy_stack         52912  24.19
> memcpy_moveml_192    54052  23.68
> memcpy_moveml_48     54093  23.66
> memcpy_stack_48      54997  23.27
> memcpy_stack_32_mis  55079  23.24
> memcpy_stack_32      55125  23.22
> memcpy_stack_192     55736  22.97
> memcpy_moveml_96_ps  56739  22.56
> memRead_stack_32     85017  15.06
> memRead_moveml_32    87141  14.69
> memWrite_stack_32   196864   6.50
> memWrite_moveml_32  207535   6.17
> memcpy_stack_stack  304368   4.21 (12.62 CPU clocks)
> 
> >Actually, I am wrong, I was only reading the image,
> > not writing to it, simply computing a very reduced
> > averaged image, so there were some writes from time
> > to time, but very rarely.
> 
> That should get the best performance. That's not a common operation
> in what I'm doing which is alpha-blending graphics over each other.

The procedure was used during a calibration phase of an optical
system, to compute tables that were later used for real-time images
corrections by an FPGA, so, optimizing this procedure only made the
calibration shorter, it was not used in real-time, so, I spent some
time optimizing it, but could not spend a lot.

-- 
					    Gilles.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2014-10-17 19:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <543E4B9F.60602@cgglobal.com>
2014-10-15 10:59 ` [Xenomai] First call to rt_timer_tsc() causes an unexpected switch to secondary mode ZIV-Alberto Ozalla Cantabrana
2014-10-15 11:11   ` Gilles Chanteperdrix
2014-10-15 13:12     ` ZIV-Alberto Ozalla Cantabrana
2014-10-15 13:16       ` Gilles Chanteperdrix
2014-10-15 20:03       ` Gilles Chanteperdrix
2014-10-17 16:33         ` ZIV-Alberto Ozalla Cantabrana
2014-10-17 16:38           ` Gilles Chanteperdrix
2014-10-15 13:19   ` Gilles Chanteperdrix
2014-10-15 13:34     ` ZIV-Alberto Ozalla Cantabrana
2014-10-16  7:16   ` Gilles Chanteperdrix
2014-10-16  8:16     ` Gilles Chanteperdrix
2014-10-16  8:33       ` ZIV-Alberto Ozalla Cantabrana
2014-10-16  8:39         ` Gilles Chanteperdrix
2014-10-16 18:17       ` Lennart Sorensen
2014-10-16 18:58         ` Gilles Chanteperdrix
2014-10-16 20:56           ` Lennart Sorensen
2014-10-16 23:14             ` Tom Evans
2014-10-17  5:34               ` Gilles Chanteperdrix
2014-10-17  6:47                 ` Tom Evans
2014-10-17  7:02                   ` Gilles Chanteperdrix
2014-10-17 14:08                     ` Tom Evans
2014-10-17 19:36                       ` Gilles Chanteperdrix
2014-10-17 14:32                     ` Anders Blomdell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.