All of lore.kernel.org
 help / color / mirror / Atom feed
* ARM cortex A9 performance issue
@ 2011-07-07  9:18 rd bairva
  2011-07-07 15:27 ` Dave Martin
  0 siblings, 1 reply; 6+ messages in thread
From: rd bairva @ 2011-07-07  9:18 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

We are trying to benchmark ARM cortex A9 dual core behavior with
respect to single core
performance and also measuring it with respect to x86 dual core/single core.
Please find attached with the mail is c app which we are using for benchmarking.

Simple overview of C application:
- It creates a shared memory area using shm_open().
  in this shm area, it declares 2 Process shared pthread mutex, lets say L1, L2.
- then it forks to create a server_task and client task.
- server_task takes L1 lock, touches a shm area, unlock L2, in a loop.
- client_task takes L2 lock, touches a shm area, unlock L1, in a loop.
- This loop runs for N number of times that we measure.

Here are the results for N/sec for different CPUs.

Platform              Up cores              req/sec                cpuload
Cortexa9                    2                              ~44000
            100%
Cortexa9                    1                              ~18000
            100%
X86                            2                              ~64000
               35%
X86                            1                              ~458886
             100%     (1 cpu down by sysfs)

we are not able to understand the results.
- why for coretx A9 both dual/single core we are getting 100% cpu usage.
- why in case of x86, N/sec is very high for single core than dual core.
- why single core N/s is 1/3 in case of CortexA9.

Thanks,
Ramdayal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shmem-test-20110421.tgz
Type: application/x-gzip
Size: 2648 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110707/6f2e4b29/attachment.tgz>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* ARM cortex A9 performance issue
  2011-07-07  9:18 ARM cortex A9 performance issue rd bairva
@ 2011-07-07 15:27 ` Dave Martin
  2011-07-08 11:38   ` rd bairva
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Martin @ 2011-07-07 15:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 07, 2011 at 02:48:05PM +0530, rd bairva wrote:
> Hi,
> 
> We are trying to benchmark ARM cortex A9 dual core behavior with
> respect to single core
> performance and also measuring it with respect to x86 dual core/single core.
> Please find attached with the mail is c app which we are using for benchmarking.
> 
> Simple overview of C application:
> - It creates a shared memory area using shm_open().
>   in this shm area, it declares 2 Process shared pthread mutex, lets say L1, L2.
> - then it forks to create a server_task and client task.
> - server_task takes L1 lock, touches a shm area, unlock L2, in a loop.
> - client_task takes L2 lock, touches a shm area, unlock L1, in a loop.
> - This loop runs for N number of times that we measure.

The behaviour of pthread_mutex_unlock is unspecified if an attempt is made
to ulock a mutex from a thread which doesn't currently own that mutex.
You probably need to re-code your test to avoid this incorrect use of
the ABI before an interpretation can be placed on the results.

See pthread_mutexattr_init(3) for details.



Some people have been reported issues related to  process shared mutexes
on ARM recently:

https://bugs.launchpad.net/ubuntu/+source/apr/+bug/604753

I'm not sure of the current status of that though, and I don't know
whether it would affect your test or not.

Cheers
---Dave

> 
> Here are the results for N/sec for different CPUs.
> 
> Platform              Up cores              req/sec                cpuload
> Cortexa9                    2                              ~44000
>             100%
> Cortexa9                    1                              ~18000
>             100%
> X86                            2                              ~64000
>                35%
> X86                            1                              ~458886
>              100%     (1 cpu down by sysfs)
> 
> we are not able to understand the results.
> - why for coretx A9 both dual/single core we are getting 100% cpu usage.
> - why in case of x86, N/sec is very high for single core than dual core.
> - why single core N/s is 1/3 in case of CortexA9.
> 
> Thanks,
> Ramdayal


> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* ARM cortex A9 performance issue
  2011-07-07 15:27 ` Dave Martin
@ 2011-07-08 11:38   ` rd bairva
  2011-07-08 13:33     ` Dave Martin
  0 siblings, 1 reply; 6+ messages in thread
From: rd bairva @ 2011-07-08 11:38 UTC (permalink / raw)
  To: linux-arm-kernel

Now i have modified a source a little bit. Now I am doing a pingpong
using msgsnd and msgrcv. using this I am getting 40000req/sec and 55%
CPU usage.
In another version I have taken the same lock in both the processes to
ensure same thread is unlocking the mutex. But CPU usage is 100%.
Shouldn't be the behaviour 50%.

algo
msgsnd/msgrcv version.

Process1
shared memory counter++
msgsnd1
msgrcv 2

Process2
msgrcv 1
shared memory counter--
msgsnd2

Mutex version:
Process 1

mutex1
c++
unlock_mutex1

Process 2
mutex1
c++
unlock_mutex1


Regards,
Ramdayal


On Thu, Jul 7, 2011 at 8:57 PM, Dave Martin <dave.martin@linaro.org> wrote:
> On Thu, Jul 07, 2011 at 02:48:05PM +0530, rd bairva wrote:
>> Hi,
>>
>> We are trying to benchmark ARM cortex A9 dual core behavior with
>> respect to single core
>> performance and also measuring it with respect to x86 dual core/single core.
>> Please find attached with the mail is c app which we are using for benchmarking.
>>
>> Simple overview of C application:
>> - It creates a shared memory area using shm_open().
>> ? in this shm area, it declares 2 Process shared pthread mutex, lets say L1, L2.
>> - then it forks to create a server_task and client task.
>> - server_task takes L1 lock, touches a shm area, unlock L2, in a loop.
>> - client_task takes L2 lock, touches a shm area, unlock L1, in a loop.
>> - This loop runs for N number of times that we measure.
>
> The behaviour of pthread_mutex_unlock is unspecified if an attempt is made
> to ulock a mutex from a thread which doesn't currently own that mutex.
> You probably need to re-code your test to avoid this incorrect use of
> the ABI before an interpretation can be placed on the results.
>
> See pthread_mutexattr_init(3) for details.
>
>
>
> Some people have been reported issues related to ?process shared mutexes
> on ARM recently:
>
> https://bugs.launchpad.net/ubuntu/+source/apr/+bug/604753
>
> I'm not sure of the current status of that though, and I don't know
> whether it would affect your test or not.
>
> Cheers
> ---Dave
>
>>
>> Here are the results for N/sec for different CPUs.
>>
>> Platform ? ? ? ? ? ? ?Up cores ? ? ? ? ? ? ?req/sec ? ? ? ? ? ? ? ?cpuload
>> Cortexa9 ? ? ? ? ? ? ? ? ? ?2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?~44000
>> ? ? ? ? ? ? 100%
>> Cortexa9 ? ? ? ? ? ? ? ? ? ?1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?~18000
>> ? ? ? ? ? ? 100%
>> X86 ? ? ? ? ? ? ? ? ? ? ? ? ? ?2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?~64000
>> ? ? ? ? ? ? ? ?35%
>> X86 ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?~458886
>> ? ? ? ? ? ? ?100% ? ? (1 cpu down by sysfs)
>>
>> we are not able to understand the results.
>> - why for coretx A9 both dual/single core we are getting 100% cpu usage.
>> - why in case of x86, N/sec is very high for single core than dual core.
>> - why single core N/s is 1/3 in case of CortexA9.
>>
>> Thanks,
>> Ramdayal
>
>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* ARM cortex A9 performance issue
  2011-07-08 11:38   ` rd bairva
@ 2011-07-08 13:33     ` Dave Martin
  2011-07-11  8:26       ` rd bairva
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Martin @ 2011-07-08 13:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 08, 2011 at 05:08:38PM +0530, rd bairva wrote:
> Now i have modified a source a little bit. Now I am doing a pingpong
> using msgsnd and msgrcv. using this I am getting 40000req/sec and 55%
> CPU usage.
> In another version I have taken the same lock in both the processes to
> ensure same thread is unlocking the mutex. But CPU usage is 100%.
> Shouldn't be the behaviour 50%.

Possibly.  Do you see that behaviour on all platforms, or just A9?

> 
> algo
> msgsnd/msgrcv version.
> 
> Process1
> shared memory counter++
> msgsnd1
> msgrcv 2
> 
> Process2
> msgrcv 1
> shared memory counter--
> msgsnd2
> 
> Mutex version:
> Process 1
> 
> mutex1
> c++
> unlock_mutex1
> 
> Process 2
> mutex1
> c++
> unlock_mutex1

If there is some overhead outside the critical section, then the two threads
are likely to end up synchronised in such a way that this hides some or all of
the latency of acquiring the lock.  So I'd expect a CPU load somewhere between
50% and 100%, though I'd be a bit surprised if all the latency is hidden.

If you don't check or wait for the counter increment, the same thread may
repeatedly take the lock of course, without ever waiting.  If that happens,
you would see 100% load.  This probably can't happen with the msgsnd/msgrcv
version.

Your results from msgsnd/msgrcv also suggest that the hidden message
receive latency and other system overheads account for something like  
of the total CPU load for that code, which sounds plausible.

Can you attach your new code?

Cheers
---Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread

* ARM cortex A9 performance issue
  2011-07-08 13:33     ` Dave Martin
@ 2011-07-11  8:26       ` rd bairva
  2011-07-11 17:00         ` Dave Martin
  0 siblings, 1 reply; 6+ messages in thread
From: rd bairva @ 2011-07-11  8:26 UTC (permalink / raw)
  To: linux-arm-kernel

Attaching the new code.

On Fri, Jul 8, 2011 at 7:03 PM, Dave Martin <dave.martin@linaro.org> wrote:
> On Fri, Jul 08, 2011 at 05:08:38PM +0530, rd bairva wrote:
>> Now i have modified a source a little bit. Now I am doing a pingpong
>> using msgsnd and msgrcv. using this I am getting 40000req/sec and 55%
>> CPU usage.
>> In another version I have taken the same lock in both the processes to
>> ensure same thread is unlocking the mutex. But CPU usage is 100%.
>> Shouldn't be the behaviour 50%.
>
> Possibly. ?Do you see that behaviour on all platforms, or just A9?
On single processor it is always 100% on ARM as well as X86. but on
X86, cpu is >50% free.
>
>>
>> algo
>> msgsnd/msgrcv version.
>>
>> Process1
>> shared memory counter++
>> msgsnd1
>> msgrcv 2
>>
>> Process2
>> msgrcv 1
>> shared memory counter--
>> msgsnd2
>>
>> Mutex version:
>> Process 1
>>
>> mutex1
>> c++
>> unlock_mutex1
>>
>> Process 2
>> mutex1
>> c++
>> unlock_mutex1
>
> If there is some overhead outside the critical section, then the two threads
> are likely to end up synchronised in such a way that this hides some or all of
> the latency of acquiring the lock. ?So I'd expect a CPU load somewhere between
> 50% and 100%, though I'd be a bit surprised if all the latency is hidden.
>
> If you don't check or wait for the counter increment, the same thread may
> repeatedly take the lock of course, without ever waiting. ?If that happens,
> you would see 100% load. ?This probably can't happen with the msgsnd/msgrcv
> version.
>
> Your results from msgsnd/msgrcv also suggest that the hidden message
> receive latency and other system overheads account for something like
> of the total CPU load for that code, which sounds plausible.
>
> Can you attach your new code?
>
> Cheers
> ---Dave
>
>
Thanks and regards,
Ramdayal
-------------- next part --------------
A non-text attachment was scrubbed...
Name: shmemtest.zip
Type: application/zip
Size: 3414 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110711/3367f4db/attachment.zip>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* ARM cortex A9 performance issue
  2011-07-11  8:26       ` rd bairva
@ 2011-07-11 17:00         ` Dave Martin
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Martin @ 2011-07-11 17:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 11, 2011 at 01:56:35PM +0530, rd bairva wrote:
> Attaching the new code.
> 
> On Fri, Jul 8, 2011 at 7:03 PM, Dave Martin <dave.martin@linaro.org> wrote:
> > On Fri, Jul 08, 2011 at 05:08:38PM +0530, rd bairva wrote:
> >> Now i have modified a source a little bit. Now I am doing a pingpong
> >> using msgsnd and msgrcv. using this I am getting 40000req/sec and 55%
> >> CPU usage.
> >> In another version I have taken the same lock in both the processes to
> >> ensure same thread is unlocking the mutex. But CPU usage is 100%.
> >> Shouldn't be the behaviour 50%.
> >
> > Possibly. ?Do you see that behaviour on all platforms, or just A9?
> On single processor it is always 100% on ARM as well as X86. but on
> X86, cpu is >50% free.

For me, on x86:

With WITH_MSG, the test consumes about 56-57% on two CPUs.
Without WITH_MSG, the test consumes about 59-60% on two CPUs.

(as reported by running top during the test)

For me, this is as expected: there is some latency involved in signalling
between CPUs, so there should be a bit less than 50% of useful work per
CPU.  However, there is overhead in both process, and this can parallelise
against the signalling latency, giving an overall load of a bit more than
50% per CPU.  The total is about the same on each CPU, since both processes
are doing essentially the same thing.

I quickly hacked up a completely different implementation using pthreads
condition variables to signal between threads and got similar results (though
with a bit more CPU load).

Someone borrowed by pandaboard, but I'll try on there when I get a chance...

Cheers
---Dave

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pingpong.c
Type: text/x-csrc
Size: 3602 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20110711/18ac98eb/attachment.bin>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-07-11 17:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-07  9:18 ARM cortex A9 performance issue rd bairva
2011-07-07 15:27 ` Dave Martin
2011-07-08 11:38   ` rd bairva
2011-07-08 13:33     ` Dave Martin
2011-07-11  8:26       ` rd bairva
2011-07-11 17:00         ` Dave Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.