linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Enquiry on unbalanced memory throughput for dual-Cortex A9 core.
       [not found] <5F1105621EDF844291AF8B109E27C06D34C4BEBD@PGSMSX109.gar.corp.intel.com>
@ 2018-07-20 10:36 ` Russell King - ARM Linux
  2018-07-23  3:39   ` Ooi, Tzy Way
  2018-07-24  2:22   ` Ooi, Tzy Way
  0 siblings, 2 replies; 4+ messages in thread
From: Russell King - ARM Linux @ 2018-07-20 10:36 UTC (permalink / raw)
  To: Ooi, Tzy Way
  Cc: linux-kernel, See, Chin Liang, Tan, Ley Foon, Nguyen, Dinh, Aw,
	Khai Liang

On Fri, Jul 20, 2018 at 08:49:47AM +0000, Ooi, Tzy Way wrote:
> Hi Russell,
> 
> I am trying the memory write operation with the LM benchmark test. I
> tried to execute the memory write operation here
> <http://lmbench.sourceforge.net/cgi-bin/man?section=8&keyword=bw_mem>
> twice to get both Cortex A9 core processor to work on each processes.
> Both processors is going to perform write operation at almost the same
> time to the memory.
> 
> As shown in the pictures below, the memory throughput from one of the
> cores is about double the throughput of another core. i.e. 377MB/s VS
> 728MB/s
> 
> [cid:image001.png@01D42049.5A7D0070]
> 
> I have tested this operation across few dual cores Cortex A9 boards and
> all the board is having the same result. The test is tested on kernel
> version 4.9 and newest Linux kernel version 4.18.0-rc2

Here's how 4.14 behaves on an iMX6D SoC (also dual core Cortex A9):

$ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000 1M fwr
[1] 21799
1.00 521.10
1.00 497.27
[1]+  Done                    taskset -c 0 ./bw_mem -N 1000 1M fwr
$ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000 1M fwr
[1] 21803
1.00 520.83
1.00 496.44

which shows some asymmetry but nowhere near yours.

I'm using taskset to force each to be locked to a particular CPU - you'll
see why further down.  Even without it, I get similar results to those I
mention above.

Now, playing around with this, so we can identify which bw_mem output is
which:

$ taskset -c 0 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 1 ./bw_mem -N 1000 1M fwr 2>&1); echo "c1: $c1"
[1] 21876
1.00 521.92
c1: 1.00 496.69
$ taskset -c 1 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 0 ./bw_mem -N 1000 1M fwr 2>&1); echo "c0: $c1"
[1] 21881
c0: 1.00 521.83
1.00 496.20

CPU0 is always the slightly faster of the two.  If we use /usr/bin/time
to time these:

CPU0:
6.10user 0.25system 0:06.56elapsed 96%CPU (0avgtext+0avgdata 1664maxresident)k
0inputs+0outputs (0major+407minor)pagefaults 0swaps

CPU1:
6.36user 0.24system 0:06.77elapsed 97%CPU (0avgtext+0avgdata 1600maxresident)k
0inputs+0outputs (0major+399minor)pagefaults 0swaps

So, CPU1 takes slightly longer in userspace, has less resident pages and
less minor faults which is rather odd.  Repeatedly running just one
instance gives different results each time... disabling virtual address
space randomisation solves that:

  echo 0 >/proc/sys/kernel/randomize_va_space

which then gives me:

CPU0: 1.00 520.20
6.18user 0.20system 0:06.59elapsed 96%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps
CPU1: 1.00 496.61
6.46user 0.14system 0:06.77elapsed 97%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps

CPU0: 1.00 521.10
6.13user 0.21system 0:06.57elapsed 96%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps
CPU1: 1.00 498.01
6.40user 0.18system 0:06.75elapsed 97%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps

which is rather more stable as far as resource usage goes between the
two CPUs, but still an asymmetry in the reported bandwidths and times.
So, this has ruled out differences in VA layout.

Now for the interesting bit... it's important to understand what and
how stuff is being measured.  Looking at the bw_mem.c and associated
source code, it measures the performance against the wall clock, which
includes everything that the system is doing on each particular CPU.
So, if a CPU is interrupted by another thread wanting to run, it'll
affect the results.  Hence, it's best to run on an otherwise quiet
system, eg, without an init daemon (eg, booted with init=/bin/sh on
the kernel command line - but note there won't be any job control,
so ^C won't work!)

However, continuing on...

If I run bw_mem on just one CPU:

CPU1: 1.00 2617.31
5.74user 0.18system 0:06.03elapsed 98%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps

Same number of iterations, same memory size, but notice that it appears
to be a lot faster reported by bw_mem, but the time taken is about the
same.  cpufreq comes to mind, but that's disabled on this system.

So, it brings up a rather obvious question: what exactly is bw_mem
measuring, and is it measuring it correctly?

$ /usr/bin/time taskset -c 1 ./bw_mem -P 1 -N 1000 1M fwr
1.00 2601.26
5.80user 0.16system 0:06.06elapsed 98%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps
$ /usr/bin/time ./bw_mem -P 2 -N 1000 1M fwr
^CCommand terminated by signal 2
5.54user 0.13system 1:12.20elapsed 7%CPU (0avgtext+0avgdata 1696maxresident)k
0inputs+0outputs (0major+365minor)pagefaults 0swaps

so requesting a parallelism of 2 results in the program never seemingly
ending in a reasonable period of time, which suggests a bug somewhere.
Are we sure that bw_mem is actually working as intended?

Maybe if Larry is reading this, he could share some thoughts.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps up
According to speedtest.net: 13Mbps down 490kbps up

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Enquiry on unbalanced memory throughput for dual-Cortex A9 core.
  2018-07-20 10:36 ` Enquiry on unbalanced memory throughput for dual-Cortex A9 core Russell King - ARM Linux
@ 2018-07-23  3:39   ` Ooi, Tzy Way
  2018-07-24  2:22   ` Ooi, Tzy Way
  1 sibling, 0 replies; 4+ messages in thread
From: Ooi, Tzy Way @ 2018-07-23  3:39 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: linux-kernel, See, Chin Liang, Tan, Ley Foon, Nguyen, Dinh, Aw,
	Khai Liang

[-- Attachment #1: Type: text/plain, Size: 9350 bytes --]

Hi Russell,

Thanks for the details explanation.

I am running the bw_mem test provided in LM Benchmark to measures the throughput to write data to memory as it seems to be a general benchmark program. 

Initially, I tested to run my own's created memory test program and encountered the same unbalanced memory throughput whenever two threads are running on different cores. I tested to run the memory test program on either one core or two cores. The unbalanced memory throughput is seen when running on two cores. Hence, I tried out the bw_mem test and it appears to be alike to my own test case.

Attached is the memory test program memtest.c file and the Linux executable file. The memtest -a1 will forces the two threads running on different cores while memtest -a2 will forces the two threads running on one core. 

May I know if is it possible if you could try out my test program on your iMX6D SoC (also dual core Cortex A9) board?

Comparison on the test result one core vs two cores on my board with Linux 4.9 and 4.18-rc2
One core:

========= Multi Thread =========

Thread 3067511920 - data size 1 MB, runs = 1000
Thread 3059123312 - data size 1 MB, runs = 1000
Thread :3059123312: Datarate: 974.887201 MB/s
Thread :3067511920: Datarate: 960.289834 MB/s
Thread :3067511920: Datarate: 1083.249741 MB/s
Thread :3059123312: Datarate: 1055.545769 MB/s
Thread :3067511920: Datarate: 1085.555446 MB/s
Thread :3059123312: Datarate: 1084.503430 MB/s
Thread :3067511920: Datarate: 1063.379303 MB/s
Thread :3059123312: Datarate: 1070.705338 MB/s
Thread :3067511920: Datarate: 1050.933243 MB/s
Thread :3059123312: Datarate: 1050.153330 MB/s
Thread :3067511920: Datarate: 1085.489144 MB/s
Thread :3059123312: Datarate: 1071.774560 MB/s
Thread :3067511920: Datarate: 1084.506795 MB/s
Thread :3059123312: Datarate: 1060.260066 MB/s
Thread :3067511920: Datarate: 1074.058027 MB/s
Thread :3059123312: Datarate: 1069.279388 MB/s
Thread :3067511920: Datarate: 1073.924924 MB/s
Thread :3059123312: Datarate: 1080.818992 MB/s
Thread :3067511920: Datarate: 1081.871683 MB/s
Thread :3067511920: Average Datarate: 1064.325814 MB/s
Thread :3059123312: Datarate: 1097.549768 MB/s
Thread :3059123312: Average Datarate: 1061.547784 MB/s
Finished!

Two Cores:
========= Multi Thread =========

Thread 3067954288 - data size 1 MB, runs = 1000
Thread 3059565680 - data size 1 MB, runs = 1000
Thread :3067954288: Datarate: 741.930805 MB/s
Thread :3059565680: Datarate: 377.979641 MB/s
Thread :3067954288: Datarate: 741.976479 MB/s
Thread :3067954288: Datarate: 740.548015 MB/s
Thread :3059565680: Datarate: 376.706463 MB/s
Thread :3067954288: Datarate: 740.313260 MB/s
Thread :3067954288: Datarate: 740.363440 MB/s
Thread :3059565680: Datarate: 376.129877 MB/s
Thread :3067954288: Datarate: 740.056194 MB/s
Thread :3067954288: Datarate: 740.219191 MB/s
Thread :3059565680: Datarate: 376.114092 MB/s
Thread :3067954288: Datarate: 740.152311 MB/s
Thread :3067954288: Datarate: 724.094688 MB/s
Thread :3059565680: Datarate: 388.118117 MB/s
Thread :3067954288: Datarate: 740.556383 MB/s
Thread :3067954288: Average Datarate: 739.021077 MB/s
Thread :3059565680: Datarate: 1323.735631 MB/s
Thread :3059565680: Datarate: 2072.948256 MB/s
Thread :3059565680: Datarate: 2069.817984 MB/s
Thread :3059565680: Datarate: 2069.295149 MB/s
Thread :3059565680: Datarate: 2040.932474 MB/s
Thread :3059565680: Average Datarate: 1147.177768 MB/s
Finished!

Thank you for your comment and help

Best regards,
Tzy Way

-----Original Message-----
From: Russell King - ARM Linux <linux@armlinux.org.uk> 
Sent: Friday, July 20, 2018 6:36 PM
To: Ooi, Tzy Way <tzy.way.ooi@intel.com>
Cc: linux-kernel@vger.kernel.org; See, Chin Liang <chin.liang.see@intel.com>; Tan, Ley Foon <ley.foon.tan@intel.com>; Nguyen, Dinh <dinh.nguyen@intel.com>; Aw, Khai Liang <khai.liang.aw@intel.com>
Subject: Re: Enquiry on unbalanced memory throughput for dual-Cortex A9 core.

On Fri, Jul 20, 2018 at 08:49:47AM +0000, Ooi, Tzy Way wrote:
> Hi Russell,
> 
> I am trying the memory write operation with the LM benchmark test. I 
> tried to execute the memory write operation here 
> <http://lmbench.sourceforge.net/cgi-bin/man?section=8&keyword=bw_mem>
> twice to get both Cortex A9 core processor to work on each processes.
> Both processors is going to perform write operation at almost the same 
> time to the memory.
> 
> As shown in the pictures below, the memory throughput from one of the 
> cores is about double the throughput of another core. i.e. 377MB/s VS 
> 728MB/s
> 
> [cid:image001.png@01D42049.5A7D0070]
> 
> I have tested this operation across few dual cores Cortex A9 boards 
> and all the board is having the same result. The test is tested on 
> kernel version 4.9 and newest Linux kernel version 4.18.0-rc2

Here's how 4.14 behaves on an iMX6D SoC (also dual core Cortex A9):

$ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000 1M fwr [1] 21799
1.00 521.10
1.00 497.27
[1]+  Done                    taskset -c 0 ./bw_mem -N 1000 1M fwr
$ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000 1M fwr [1] 21803
1.00 520.83
1.00 496.44

which shows some asymmetry but nowhere near yours.

I'm using taskset to force each to be locked to a particular CPU - you'll see why further down.  Even without it, I get similar results to those I mention above.

Now, playing around with this, so we can identify which bw_mem output is
which:

$ taskset -c 0 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 1 ./bw_mem -N 1000 1M fwr 2>&1); echo "c1: $c1"
[1] 21876
1.00 521.92
c1: 1.00 496.69
$ taskset -c 1 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 0 ./bw_mem -N 1000 1M fwr 2>&1); echo "c0: $c1"
[1] 21881
c0: 1.00 521.83
1.00 496.20

CPU0 is always the slightly faster of the two.  If we use /usr/bin/time to time these:

CPU0:
6.10user 0.25system 0:06.56elapsed 96%CPU (0avgtext+0avgdata 1664maxresident)k
0inputs+0outputs (0major+407minor)pagefaults 0swaps

CPU1:
6.36user 0.24system 0:06.77elapsed 97%CPU (0avgtext+0avgdata 1600maxresident)k
0inputs+0outputs (0major+399minor)pagefaults 0swaps

So, CPU1 takes slightly longer in userspace, has less resident pages and less minor faults which is rather odd.  Repeatedly running just one instance gives different results each time... disabling virtual address space randomisation solves that:

  echo 0 >/proc/sys/kernel/randomize_va_space

which then gives me:

CPU0: 1.00 520.20
6.18user 0.20system 0:06.59elapsed 96%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps
CPU1: 1.00 496.61
6.46user 0.14system 0:06.77elapsed 97%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps

CPU0: 1.00 521.10
6.13user 0.21system 0:06.57elapsed 96%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps
CPU1: 1.00 498.01
6.40user 0.18system 0:06.75elapsed 97%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps

which is rather more stable as far as resource usage goes between the two CPUs, but still an asymmetry in the reported bandwidths and times.
So, this has ruled out differences in VA layout.

Now for the interesting bit... it's important to understand what and how stuff is being measured.  Looking at the bw_mem.c and associated source code, it measures the performance against the wall clock, which includes everything that the system is doing on each particular CPU.
So, if a CPU is interrupted by another thread wanting to run, it'll affect the results.  Hence, it's best to run on an otherwise quiet system, eg, without an init daemon (eg, booted with init=/bin/sh on the kernel command line - but note there won't be any job control, so ^C won't work!)

However, continuing on...

If I run bw_mem on just one CPU:

CPU1: 1.00 2617.31
5.74user 0.18system 0:06.03elapsed 98%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps

Same number of iterations, same memory size, but notice that it appears to be a lot faster reported by bw_mem, but the time taken is about the same.  cpufreq comes to mind, but that's disabled on this system.

So, it brings up a rather obvious question: what exactly is bw_mem measuring, and is it measuring it correctly?

$ /usr/bin/time taskset -c 1 ./bw_mem -P 1 -N 1000 1M fwr
1.00 2601.26
5.80user 0.16system 0:06.06elapsed 98%CPU (0avgtext+0avgdata 1700maxresident)k
0inputs+0outputs (0major+403minor)pagefaults 0swaps
$ /usr/bin/time ./bw_mem -P 2 -N 1000 1M fwr ^CCommand terminated by signal 2 5.54user 0.13system 1:12.20elapsed 7%CPU (0avgtext+0avgdata 1696maxresident)k
0inputs+0outputs (0major+365minor)pagefaults 0swaps

so requesting a parallelism of 2 results in the program never seemingly ending in a reasonable period of time, which suggests a bug somewhere.
Are we sure that bw_mem is actually working as intended?

Maybe if Larry is reading this, he could share some thoughts.

--
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps up According to speedtest.net: 13Mbps down 490kbps up

[-- Attachment #2: memtest.c --]
[-- Type: text/plain, Size: 6496 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <malloc.h>
#include <pthread.h>
#include <semaphore.h>
#include <string.h>
#include <sys/time.h>
#include <sched.h>
#include <sys/resource.h>
#include <sys/mman.h>

#define MB	(1024*1024)
#define QUAD_CORE	0

# define handle_error_en(en, msg) \
	do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0);

unsigned long l_megabytes = 1;
unsigned long l_repeat = 1000;

void SingleFct(void)
{
	unsigned long l_size = l_megabytes*MB;
	unsigned char * l_memBuffer = memalign(32, l_size);
	int i,j;
	unsigned long l_time;
	double l_datarate;
	double t_datarate = 0.0;
	struct timeval  start, end;

	printf("Single Thread - data size %ld MB, runs = %ld\n", l_megabytes, l_repeat);
	for(j=0; j<10; j++)
	{
                gettimeofday(&start, NULL);
		for(i=0; i<l_repeat; i++)
		{
			memset(l_memBuffer, 0x5a, l_size);
		}
                gettimeofday(&end, NULL);

		l_time = (end.tv_sec*1000000 + end.tv_usec) - (start.tv_sec*1000000 + start.tv_usec);
		// printf("end time = %u.%06u, start time = %u.%06u\n", end.tv_sec, end.tv_usec, start.tv_sec, start.tv_usec);
		// printf("l_time = %u\n", l_time);
		l_datarate = (double)l_size*(double)l_repeat/(double)l_time;
		printf("Single Thread Datarate: %f MB/s\n", l_datarate);
		t_datarate += l_datarate;
	}
	printf("Single Thread Average Datarate: %f MB/s\n", t_datarate/10.0f);
	free(l_memBuffer);
}

void Fct(void)
{
	unsigned long l_size = l_megabytes*MB;
	unsigned char * l_memBuffer = memalign(32, l_size);
	unsigned long l_time;
	double l_datarate;
	double t_datarate = 0.0;
	int i,j;
	struct timeval  start, end;

	printf("Thread %u - data size %ld MB, runs = %ld\n", (unsigned int)pthread_self(), l_megabytes, l_repeat);
	for(j=0; j<10; j++)
	{
                gettimeofday(&start, NULL);
		for(i=0; i<l_repeat; i++)
		{
			memset(l_memBuffer, 0x5a, l_size);
		}
                gettimeofday(&end, NULL);
		l_time = (end.tv_sec*1000000 + end.tv_usec) - (start.tv_sec*1000000 + start.tv_usec);
		l_datarate = (double)l_size*(double)l_repeat/(double)l_time;
		printf("Thread :%u: Datarate: %lf MB/s\n", (unsigned int)pthread_self(), l_datarate);
		t_datarate += l_datarate;

	}

	printf("Thread :%u: Average Datarate: %lf MB/s\n", (unsigned int)pthread_self(), t_datarate/10.0f);
	free(l_memBuffer);

	pthread_exit((void *) 0);
}

int main(int argc, char** argv) {

	pthread_t t1, t2, t3, t4;

	pthread_attr_t t1_attr, t2_attr, t3_attr, t4_attr;
	cpu_set_t t1_cpuset, t2_cpuset, t3_cpuset, t4_cpuset;

	int rv;
	int affinity = 0;
	int iarg = 1;

	while (iarg < argc) {
	    if (!strcmp(argv[iarg],"-a1")) {
                affinity = 1;
            }
            else if (!strcmp(argv[iarg],"-a2")) {
                affinity = 2;
            } 
            else if (!strcmp(argv[iarg],"-s1")) {
                l_megabytes = 1;
            }
            else if (!strcmp(argv[iarg],"-s2")) {
                l_megabytes = 10;
            }
            else if (!strcmp(argv[iarg],"-s3")) {
                l_megabytes = 100;
            }
            else {
                printf("Usage: %s [-a1|-a2 ] [-s1|-s2|-s3]\n",argv[0]);
                exit(0);
            }
            iarg++;
	}
	printf("========= Single Thread =========\n\n");

	SingleFct();

	printf("\n\n========= Multi Thread =========\n\n");

	rv = pthread_attr_init(&t1_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #1");

	rv = pthread_attr_init(&t2_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #2");

	if (affinity == 1) 
	{
	    CPU_ZERO(&t1_cpuset);
	    CPU_SET(0,&t1_cpuset);
	    rv = pthread_attr_setaffinity_np(&t1_attr,sizeof(t1_cpuset),&t1_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #1");

	    CPU_ZERO(&t2_cpuset);
	    CPU_SET(1,&t2_cpuset);
	    rv = pthread_attr_setaffinity_np(&t2_attr,sizeof(t2_cpuset),&t2_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #2");
	}
	else if (affinity == 2) 
	{
	    CPU_ZERO(&t1_cpuset);
	    CPU_SET(0,&t1_cpuset);
	    rv = pthread_attr_setaffinity_np(&t1_attr,sizeof(t1_cpuset),&t1_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #1");

	    CPU_ZERO(&t2_cpuset);
	    CPU_SET(0,&t2_cpuset);
	    rv = pthread_attr_setaffinity_np(&t2_attr,sizeof(t2_cpuset),&t2_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #2");
	}

	rv = pthread_create(&t1, &t1_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #1");

	rv = pthread_create(&t2, &t2_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #2");

#if QUAD_CORE
	rv = pthread_attr_init(&t3_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #3");

	rv = pthread_attr_init(&t4_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #4");

	if (affinity == 1) 
	{
	    CPU_ZERO(&t3_cpuset);
	    CPU_SET(2,&t3_cpuset);
	    rv = pthread_attr_setaffinity_np(&t3_attr,sizeof(t3_cpuset),&t3_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #3");

	    CPU_ZERO(&t4_cpuset);
	    CPU_SET(3,&t4_cpuset);
	    rv = pthread_attr_setaffinity_np(&t4_attr,sizeof(t4_cpuset),&t4_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #4");
	}
	else if (affinity == 2) 
	{
	    CPU_ZERO(&t3_cpuset);
	    CPU_SET(0,&t3_cpuset);
	    rv = pthread_attr_setaffinity_np(&t3_attr,sizeof(t3_cpuset),&t3_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #3");

	    CPU_ZERO(&t4_cpuset);
	    CPU_SET(0,&t4_cpuset);
	    rv = pthread_attr_setaffinity_np(&t4_attr,sizeof(t4_cpuset),&t4_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #4");
	}

	rv = pthread_create(&t3, &t3_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #3");

	rv = pthread_create(&t4, &t4_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #4");

#endif

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
#if QUAD_CORE
    pthread_join(t3, NULL);
    pthread_join(t4, NULL);
#endif

	printf("Finished!\n");
	return 0;
}

[-- Attachment #3: memtest_2 --]
[-- Type: application/octet-stream, Size: 10384 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Enquiry on unbalanced memory throughput for dual-Cortex A9 core.
  2018-07-20 10:36 ` Enquiry on unbalanced memory throughput for dual-Cortex A9 core Russell King - ARM Linux
  2018-07-23  3:39   ` Ooi, Tzy Way
@ 2018-07-24  2:22   ` Ooi, Tzy Way
  2018-07-27  6:51     ` Ooi, Tzy Way
  1 sibling, 1 reply; 4+ messages in thread
From: Ooi, Tzy Way @ 2018-07-24  2:22 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: linux-kernel, See, Chin Liang, Tan, Ley Foon, Nguyen, Dinh, Aw,
	Khai Liang

[-- Attachment #1: Type: text/plain, Size: 9542 bytes --]

> -----Original Message-----
> From: Russell King - ARM Linux <linux@armlinux.org.uk>
> Sent: Friday, July 20, 2018 6:36 PM
> To: Ooi, Tzy Way <tzy.way.ooi@intel.com>
> Cc: linux-kernel@vger.kernel.org; See, Chin Liang
> <chin.liang.see@intel.com>; Tan, Ley Foon <ley.foon.tan@intel.com>;
> Nguyen, Dinh <dinh.nguyen@intel.com>; Aw, Khai Liang
> <khai.liang.aw@intel.com>
> Subject: Re: Enquiry on unbalanced memory throughput for dual-Cortex A9
> core.
> 
> On Fri, Jul 20, 2018 at 08:49:47AM +0000, Ooi, Tzy Way wrote:
> > Hi Russell,
> >
> > I am trying the memory write operation with the LM benchmark test. I
> > tried to execute the memory write operation here
> > <http://lmbench.sourceforge.net/cgi-
> bin/man?section=8&keyword=bw_mem>
> > twice to get both Cortex A9 core processor to work on each processes.
> > Both processors is going to perform write operation at almost the same
> > time to the memory.
> >
> > As shown in the pictures below, the memory throughput from one of the
> > cores is about double the throughput of another core. i.e. 377MB/s VS
> > 728MB/s
> >
> > [cid:image001.png@01D42049.5A7D0070]
> >
> > I have tested this operation across few dual cores Cortex A9 boards
> > and all the board is having the same result. The test is tested on
> > kernel version 4.9 and newest Linux kernel version 4.18.0-rc2
> 
> Here's how 4.14 behaves on an iMX6D SoC (also dual core Cortex A9):
> 
> $ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000
> 1M fwr [1] 21799
> 1.00 521.10
> 1.00 497.27
> [1]+  Done                    taskset -c 0 ./bw_mem -N 1000 1M fwr
> $ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000
> 1M fwr [1] 21803
> 1.00 520.83
> 1.00 496.44
> 
> which shows some asymmetry but nowhere near yours.
> 
> I'm using taskset to force each to be locked to a particular CPU - you'll see
> why further down.  Even without it, I get similar results to those I mention
> above.
> 
> Now, playing around with this, so we can identify which bw_mem output is
> which:
> 
> $ taskset -c 0 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 1 ./bw_mem -N
> 1000 1M fwr 2>&1); echo "c1: $c1"
> [1] 21876
> 1.00 521.92
> c1: 1.00 496.69
> $ taskset -c 1 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 0 ./bw_mem -N
> 1000 1M fwr 2>&1); echo "c0: $c1"
> [1] 21881
> c0: 1.00 521.83
> 1.00 496.20
> 
> CPU0 is always the slightly faster of the two.  If we use /usr/bin/time to time
> these:
> 
> CPU0:
> 6.10user 0.25system 0:06.56elapsed 96%CPU (0avgtext+0avgdata
> 1664maxresident)k
> 0inputs+0outputs (0major+407minor)pagefaults 0swaps
> 
> CPU1:
> 6.36user 0.24system 0:06.77elapsed 97%CPU (0avgtext+0avgdata
> 1600maxresident)k
> 0inputs+0outputs (0major+399minor)pagefaults 0swaps
> 
> So, CPU1 takes slightly longer in userspace, has less resident pages and less
> minor faults which is rather odd.  Repeatedly running just one instance gives
> different results each time... disabling virtual address space randomisation
> solves that:
> 
>   echo 0 >/proc/sys/kernel/randomize_va_space
> 
> which then gives me:
> 
> CPU0: 1.00 520.20
> 6.18user 0.20system 0:06.59elapsed 96%CPU (0avgtext+0avgdata
> 1700maxresident)k
> 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> CPU1: 1.00 496.61
> 6.46user 0.14system 0:06.77elapsed 97%CPU (0avgtext+0avgdata
> 1700maxresident)k
> 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> 
> CPU0: 1.00 521.10
> 6.13user 0.21system 0:06.57elapsed 96%CPU (0avgtext+0avgdata
> 1700maxresident)k
> 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> CPU1: 1.00 498.01
> 6.40user 0.18system 0:06.75elapsed 97%CPU (0avgtext+0avgdata
> 1700maxresident)k
> 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> 
> which is rather more stable as far as resource usage goes between the two
> CPUs, but still an asymmetry in the reported bandwidths and times.
> So, this has ruled out differences in VA layout.
> 
> Now for the interesting bit... it's important to understand what and how stuff
> is being measured.  Looking at the bw_mem.c and associated source code, it
> measures the performance against the wall clock, which includes everything
> that the system is doing on each particular CPU.
> So, if a CPU is interrupted by another thread wanting to run, it'll affect the
> results.  Hence, it's best to run on an otherwise quiet system, eg, without an
> init daemon (eg, booted with init=/bin/sh on the kernel command line - but
> note there won't be any job control, so ^C won't work!)
> 
> However, continuing on...
> 
> If I run bw_mem on just one CPU:
> 
> CPU1: 1.00 2617.31
> 5.74user 0.18system 0:06.03elapsed 98%CPU (0avgtext+0avgdata
> 1700maxresident)k
> 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> 
> Same number of iterations, same memory size, but notice that it appears to
> be a lot faster reported by bw_mem, but the time taken is about the same.
> cpufreq comes to mind, but that's disabled on this system.
> 
> So, it brings up a rather obvious question: what exactly is bw_mem
> measuring, and is it measuring it correctly?
> 
> $ /usr/bin/time taskset -c 1 ./bw_mem -P 1 -N 1000 1M fwr
> 1.00 2601.26
> 5.80user 0.16system 0:06.06elapsed 98%CPU (0avgtext+0avgdata
> 1700maxresident)k
> 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> $ /usr/bin/time ./bw_mem -P 2 -N 1000 1M fwr ^CCommand terminated by
> signal 2 5.54user 0.13system 1:12.20elapsed 7%CPU (0avgtext+0avgdata
> 1696maxresident)k
> 0inputs+0outputs (0major+365minor)pagefaults 0swaps
> 
> so requesting a parallelism of 2 results in the program never seemingly
> ending in a reasonable period of time, which suggests a bug somewhere.
> Are we sure that bw_mem is actually working as intended?
> 
> Maybe if Larry is reading this, he could share some thoughts.
> 
> --
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down 630kbps
> up According to speedtest.net: 13Mbps down 490kbps up

Thanks for the detail explanation on LM bench test.

Initially, I tested to run my own's created memory test program and encountered the same unbalanced memory throughput whenever two threads are running on different cores. I tested to run the memory test program on either one core or two cores. The unbalanced memory throughput is seen when running on two cores. Hence, I tried out the bw_mem test as it is a general benchmark program and it appears to be alike to my own test case.

Attached is the memory test program memtest.c file and the Linux executable file. The memtest -a1 will forces the two threads running on different cores while memtest -a2 will forces the two threads running on one core.  
May I know if is it possible if you could try out my test program on your iMX6D SoC (also dual core Cortex A9) board?

Below show the comparison between two threads running on one core vs two cores

One core:
========= Multi Thread =========

Thread 3067511920 - data size 1 MB, runs = 1000 
Thread 3059123312 - data size 1 MB, runs = 1000 
Thread :3059123312: Datarate: 974.887201 MB/s 
Thread :3067511920: Datarate: 960.289834 MB/s 
Thread :3067511920: Datarate: 1083.249741 MB/s 
Thread :3059123312: Datarate: 1055.545769 MB/s 
Thread :3067511920: Datarate: 1085.555446 MB/s
Thread :3059123312: Datarate: 1084.503430 MB/s 
Thread :3067511920: Datarate: 1063.379303 MB/s 
Thread :3059123312: Datarate: 1070.705338 MB/s 
Thread :3067511920: Datarate: 1050.933243 MB/s 
Thread :3059123312: Datarate: 1050.153330 MB/s 
Thread :3067511920: Datarate: 1085.489144 MB/s 
Thread :3059123312: Datarate: 1071.774560 MB/s
Thread :3067511920: Datarate: 1084.506795 MB/s 
Thread :3059123312: Datarate: 1060.260066 MB/s 
Thread :3067511920: Datarate: 1074.058027 MB/s 
Thread :3059123312: Datarate: 1069.279388 MB/s
Thread :3067511920: Datarate: 1073.924924 MB/s 
Thread :3059123312: Datarate: 1080.818992 MB/s 
Thread :3067511920: Datarate: 1081.871683 MB/s 
Thread :3067511920: Average Datarate: 1064.325814 MB/s 
Thread :3059123312: Datarate: 1097.549768 MB/s 
Thread :3059123312: Average Datarate: 1061.547784 MB/s
Finished!

Two cores:
========= Multi Thread =========

Thread 3067954288 - data size 1 MB, runs = 1000 
Thread 3059565680 - data size 1 MB, runs = 1000 
Thread :3067954288: Datarate: 741.930805 MB/s 
Thread :3059565680: Datarate: 377.979641 MB/s 
Thread :3067954288: Datarate: 741.976479 MB/s 
Thread :3067954288: Datarate: 740.548015 MB/s 
Thread :3059565680: Datarate: 376.706463 MB/s 
Thread :3067954288: Datarate: 740.313260 MB/s 
Thread :3067954288: Datarate: 740.363440 MB/s 
Thread :3059565680: Datarate: 376.129877 MB/s 
Thread :3067954288: Datarate: 740.056194 MB/s 
Thread :3067954288: Datarate: 740.219191 MB/s
Thread :3059565680: Datarate: 376.114092 MB/s
Thread :3067954288: Datarate: 740.152311 MB/s 
Thread :3067954288: Datarate: 724.094688 MB/s 
Thread :3059565680: Datarate: 388.118117 MB/s 
Thread :3067954288: Datarate: 740.556383 MB/s 
Thread :3067954288: Average Datarate: 739.021077 MB/s 
Thread :3059565680: Datarate: 1323.735631 MB/s 
Thread :3059565680: Datarate: 2072.948256 MB/s 
Thread :3059565680: Datarate: 2069.817984 MB/s
Thread :3059565680: Datarate: 2069.295149 MB/s
Thread :3059565680: Datarate: 2040.932474 MB/s 
Thread :3059565680: Average Datarate: 1147.177768 MB/s
Finished!

Thanks
Tzy Way

[-- Attachment #2: memtest.c --]
[-- Type: text/plain, Size: 6496 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <malloc.h>
#include <pthread.h>
#include <semaphore.h>
#include <string.h>
#include <sys/time.h>
#include <sched.h>
#include <sys/resource.h>
#include <sys/mman.h>

#define MB	(1024*1024)
#define QUAD_CORE	0

# define handle_error_en(en, msg) \
	do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0);

unsigned long l_megabytes = 1;
unsigned long l_repeat = 1000;

void SingleFct(void)
{
	unsigned long l_size = l_megabytes*MB;
	unsigned char * l_memBuffer = memalign(32, l_size);
	int i,j;
	unsigned long l_time;
	double l_datarate;
	double t_datarate = 0.0;
	struct timeval  start, end;

	printf("Single Thread - data size %ld MB, runs = %ld\n", l_megabytes, l_repeat);
	for(j=0; j<10; j++)
	{
                gettimeofday(&start, NULL);
		for(i=0; i<l_repeat; i++)
		{
			memset(l_memBuffer, 0x5a, l_size);
		}
                gettimeofday(&end, NULL);

		l_time = (end.tv_sec*1000000 + end.tv_usec) - (start.tv_sec*1000000 + start.tv_usec);
		// printf("end time = %u.%06u, start time = %u.%06u\n", end.tv_sec, end.tv_usec, start.tv_sec, start.tv_usec);
		// printf("l_time = %u\n", l_time);
		l_datarate = (double)l_size*(double)l_repeat/(double)l_time;
		printf("Single Thread Datarate: %f MB/s\n", l_datarate);
		t_datarate += l_datarate;
	}
	printf("Single Thread Average Datarate: %f MB/s\n", t_datarate/10.0f);
	free(l_memBuffer);
}

void Fct(void)
{
	unsigned long l_size = l_megabytes*MB;
	unsigned char * l_memBuffer = memalign(32, l_size);
	unsigned long l_time;
	double l_datarate;
	double t_datarate = 0.0;
	int i,j;
	struct timeval  start, end;

	printf("Thread %u - data size %ld MB, runs = %ld\n", (unsigned int)pthread_self(), l_megabytes, l_repeat);
	for(j=0; j<10; j++)
	{
                gettimeofday(&start, NULL);
		for(i=0; i<l_repeat; i++)
		{
			memset(l_memBuffer, 0x5a, l_size);
		}
                gettimeofday(&end, NULL);
		l_time = (end.tv_sec*1000000 + end.tv_usec) - (start.tv_sec*1000000 + start.tv_usec);
		l_datarate = (double)l_size*(double)l_repeat/(double)l_time;
		printf("Thread :%u: Datarate: %lf MB/s\n", (unsigned int)pthread_self(), l_datarate);
		t_datarate += l_datarate;

	}

	printf("Thread :%u: Average Datarate: %lf MB/s\n", (unsigned int)pthread_self(), t_datarate/10.0f);
	free(l_memBuffer);

	pthread_exit((void *) 0);
}

int main(int argc, char** argv) {

	pthread_t t1, t2, t3, t4;

	pthread_attr_t t1_attr, t2_attr, t3_attr, t4_attr;
	cpu_set_t t1_cpuset, t2_cpuset, t3_cpuset, t4_cpuset;

	int rv;
	int affinity = 0;
	int iarg = 1;

	while (iarg < argc) {
	    if (!strcmp(argv[iarg],"-a1")) {
                affinity = 1;
            }
            else if (!strcmp(argv[iarg],"-a2")) {
                affinity = 2;
            } 
            else if (!strcmp(argv[iarg],"-s1")) {
                l_megabytes = 1;
            }
            else if (!strcmp(argv[iarg],"-s2")) {
                l_megabytes = 10;
            }
            else if (!strcmp(argv[iarg],"-s3")) {
                l_megabytes = 100;
            }
            else {
                printf("Usage: %s [-a1|-a2 ] [-s1|-s2|-s3]\n",argv[0]);
                exit(0);
            }
            iarg++;
	}
	printf("========= Single Thread =========\n\n");

	SingleFct();

	printf("\n\n========= Multi Thread =========\n\n");

	rv = pthread_attr_init(&t1_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #1");

	rv = pthread_attr_init(&t2_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #2");

	if (affinity == 1) 
	{
	    CPU_ZERO(&t1_cpuset);
	    CPU_SET(0,&t1_cpuset);
	    rv = pthread_attr_setaffinity_np(&t1_attr,sizeof(t1_cpuset),&t1_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #1");

	    CPU_ZERO(&t2_cpuset);
	    CPU_SET(1,&t2_cpuset);
	    rv = pthread_attr_setaffinity_np(&t2_attr,sizeof(t2_cpuset),&t2_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #2");
	}
	else if (affinity == 2) 
	{
	    CPU_ZERO(&t1_cpuset);
	    CPU_SET(0,&t1_cpuset);
	    rv = pthread_attr_setaffinity_np(&t1_attr,sizeof(t1_cpuset),&t1_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #1");

	    CPU_ZERO(&t2_cpuset);
	    CPU_SET(0,&t2_cpuset);
	    rv = pthread_attr_setaffinity_np(&t2_attr,sizeof(t2_cpuset),&t2_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #2");
	}

	rv = pthread_create(&t1, &t1_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #1");

	rv = pthread_create(&t2, &t2_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #2");

#if QUAD_CORE
	rv = pthread_attr_init(&t3_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #3");

	rv = pthread_attr_init(&t4_attr);
	if (rv != 0) handle_error_en(rv,"pthread_attr_init thread #4");

	if (affinity == 1) 
	{
	    CPU_ZERO(&t3_cpuset);
	    CPU_SET(2,&t3_cpuset);
	    rv = pthread_attr_setaffinity_np(&t3_attr,sizeof(t3_cpuset),&t3_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #3");

	    CPU_ZERO(&t4_cpuset);
	    CPU_SET(3,&t4_cpuset);
	    rv = pthread_attr_setaffinity_np(&t4_attr,sizeof(t4_cpuset),&t4_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #4");
	}
	else if (affinity == 2) 
	{
	    CPU_ZERO(&t3_cpuset);
	    CPU_SET(0,&t3_cpuset);
	    rv = pthread_attr_setaffinity_np(&t3_attr,sizeof(t3_cpuset),&t3_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #3");

	    CPU_ZERO(&t4_cpuset);
	    CPU_SET(0,&t4_cpuset);
	    rv = pthread_attr_setaffinity_np(&t4_attr,sizeof(t4_cpuset),&t4_cpuset);
	    if (rv != 0) handle_error_en(rv,"pthread_attr_setaffinity_np thread #4");
	}

	rv = pthread_create(&t3, &t3_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #3");

	rv = pthread_create(&t4, &t4_attr, (void *)&Fct, (void *)NULL);
	if (rv != 0) handle_error_en(rv,"pthread_create thread #4");

#endif

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
#if QUAD_CORE
    pthread_join(t3, NULL);
    pthread_join(t4, NULL);
#endif

	printf("Finished!\n");
	return 0;
}

[-- Attachment #3: memtest_2 --]
[-- Type: application/octet-stream, Size: 10384 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Enquiry on unbalanced memory throughput for dual-Cortex A9 core.
  2018-07-24  2:22   ` Ooi, Tzy Way
@ 2018-07-27  6:51     ` Ooi, Tzy Way
  0 siblings, 0 replies; 4+ messages in thread
From: Ooi, Tzy Way @ 2018-07-27  6:51 UTC (permalink / raw)
  To: 'Russell King - ARM Linux'
  Cc: 'linux-kernel@vger.kernel.org',
	See, Chin Liang, Tan, Ley Foon, Nguyen, Dinh, Aw, Khai Liang

> -----Original Message-----
> From: Ooi, Tzy Way
> Sent: Tuesday, July 24, 2018 10:23 AM
> To: Russell King - ARM Linux <linux@armlinux.org.uk>
> Cc: linux-kernel@vger.kernel.org; See, Chin Liang
> <chin.liang.see@intel.com>; Tan, Ley Foon <ley.foon.tan@intel.com>;
> Nguyen, Dinh <dinh.nguyen@intel.com>; Aw, Khai Liang
> <khai.liang.aw@intel.com>
> Subject: RE: Enquiry on unbalanced memory throughput for dual-Cortex A9
> core.
> 
> > -----Original Message-----
> > From: Russell King - ARM Linux <linux@armlinux.org.uk>
> > Sent: Friday, July 20, 2018 6:36 PM
> > To: Ooi, Tzy Way <tzy.way.ooi@intel.com>
> > Cc: linux-kernel@vger.kernel.org; See, Chin Liang
> > <chin.liang.see@intel.com>; Tan, Ley Foon <ley.foon.tan@intel.com>;
> > Nguyen, Dinh <dinh.nguyen@intel.com>; Aw, Khai Liang
> > <khai.liang.aw@intel.com>
> > Subject: Re: Enquiry on unbalanced memory throughput for dual-Cortex
> > A9 core.
> >
> > On Fri, Jul 20, 2018 at 08:49:47AM +0000, Ooi, Tzy Way wrote:
> > > Hi Russell,
> > >
> > > I am trying the memory write operation with the LM benchmark test. I
> > > tried to execute the memory write operation here
> > > <http://lmbench.sourceforge.net/cgi-
> > bin/man?section=8&keyword=bw_mem>
> > > twice to get both Cortex A9 core processor to work on each processes.
> > > Both processors is going to perform write operation at almost the
> > > same time to the memory.
> > >
> > > As shown in the pictures below, the memory throughput from one of
> > > the cores is about double the throughput of another core. i.e.
> > > 377MB/s VS 728MB/s
> > >
> > > [cid:image001.png@01D42049.5A7D0070]
> > >
> > > I have tested this operation across few dual cores Cortex A9 boards
> > > and all the board is having the same result. The test is tested on
> > > kernel version 4.9 and newest Linux kernel version 4.18.0-rc2
> >
> > Here's how 4.14 behaves on an iMX6D SoC (also dual core Cortex A9):
> >
> > $ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000
> > 1M fwr [1] 21799
> > 1.00 521.10
> > 1.00 497.27
> > [1]+  Done                    taskset -c 0 ./bw_mem -N 1000 1M fwr
> > $ taskset -c 0 ./bw_mem -N 1000 1M fwr & taskset -c 1 ./bw_mem -N 1000
> > 1M fwr [1] 21803
> > 1.00 520.83
> > 1.00 496.44
> >
> > which shows some asymmetry but nowhere near yours.
> >
> > I'm using taskset to force each to be locked to a particular CPU -
> > you'll see why further down.  Even without it, I get similar results
> > to those I mention above.
> >
> > Now, playing around with this, so we can identify which bw_mem output
> > is
> > which:
> >
> > $ taskset -c 0 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 1 ./bw_mem -N
> > 1000 1M fwr 2>&1); echo "c1: $c1"
> > [1] 21876
> > 1.00 521.92
> > c1: 1.00 496.69
> > $ taskset -c 1 ./bw_mem -N 1000 1M fwr & c1=$(taskset -c 0 ./bw_mem -N
> > 1000 1M fwr 2>&1); echo "c0: $c1"
> > [1] 21881
> > c0: 1.00 521.83
> > 1.00 496.20
> >
> > CPU0 is always the slightly faster of the two.  If we use
> > /usr/bin/time to time
> > these:
> >
> > CPU0:
> > 6.10user 0.25system 0:06.56elapsed 96%CPU (0avgtext+0avgdata
> > 1664maxresident)k
> > 0inputs+0outputs (0major+407minor)pagefaults 0swaps
> >
> > CPU1:
> > 6.36user 0.24system 0:06.77elapsed 97%CPU (0avgtext+0avgdata
> > 1600maxresident)k
> > 0inputs+0outputs (0major+399minor)pagefaults 0swaps
> >
> > So, CPU1 takes slightly longer in userspace, has less resident pages
> > and less minor faults which is rather odd.  Repeatedly running just
> > one instance gives different results each time... disabling virtual
> > address space randomisation solves that:
> >
> >   echo 0 >/proc/sys/kernel/randomize_va_space
> >
> > which then gives me:
> >
> > CPU0: 1.00 520.20
> > 6.18user 0.20system 0:06.59elapsed 96%CPU (0avgtext+0avgdata
> > 1700maxresident)k
> > 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> > CPU1: 1.00 496.61
> > 6.46user 0.14system 0:06.77elapsed 97%CPU (0avgtext+0avgdata
> > 1700maxresident)k
> > 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> >
> > CPU0: 1.00 521.10
> > 6.13user 0.21system 0:06.57elapsed 96%CPU (0avgtext+0avgdata
> > 1700maxresident)k
> > 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> > CPU1: 1.00 498.01
> > 6.40user 0.18system 0:06.75elapsed 97%CPU (0avgtext+0avgdata
> > 1700maxresident)k
> > 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> >
> > which is rather more stable as far as resource usage goes between the
> > two CPUs, but still an asymmetry in the reported bandwidths and times.
> > So, this has ruled out differences in VA layout.
> >
> > Now for the interesting bit... it's important to understand what and
> > how stuff is being measured.  Looking at the bw_mem.c and associated
> > source code, it measures the performance against the wall clock, which
> > includes everything that the system is doing on each particular CPU.
> > So, if a CPU is interrupted by another thread wanting to run, it'll
> > affect the results.  Hence, it's best to run on an otherwise quiet
> > system, eg, without an init daemon (eg, booted with init=/bin/sh on
> > the kernel command line - but note there won't be any job control, so
> > ^C won't work!)
> >
> > However, continuing on...
> >
> > If I run bw_mem on just one CPU:
> >
> > CPU1: 1.00 2617.31
> > 5.74user 0.18system 0:06.03elapsed 98%CPU (0avgtext+0avgdata
> > 1700maxresident)k
> > 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> >
> > Same number of iterations, same memory size, but notice that it
> > appears to be a lot faster reported by bw_mem, but the time taken is
> about the same.
> > cpufreq comes to mind, but that's disabled on this system.
> >
> > So, it brings up a rather obvious question: what exactly is bw_mem
> > measuring, and is it measuring it correctly?
> >
> > $ /usr/bin/time taskset -c 1 ./bw_mem -P 1 -N 1000 1M fwr
> > 1.00 2601.26
> > 5.80user 0.16system 0:06.06elapsed 98%CPU (0avgtext+0avgdata
> > 1700maxresident)k
> > 0inputs+0outputs (0major+403minor)pagefaults 0swaps
> > $ /usr/bin/time ./bw_mem -P 2 -N 1000 1M fwr ^CCommand terminated
> by
> > signal 2 5.54user 0.13system 1:12.20elapsed 7%CPU (0avgtext+0avgdata
> > 1696maxresident)k
> > 0inputs+0outputs (0major+365minor)pagefaults 0swaps
> >
> > so requesting a parallelism of 2 results in the program never
> > seemingly ending in a reasonable period of time, which suggests a bug
> somewhere.
> > Are we sure that bw_mem is actually working as intended?
> >
> > Maybe if Larry is reading this, he could share some thoughts.
> >
> > --
> > RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> > FTTC broadband for 0.8mile line in suburbia: sync at 13.8Mbps down
> > 630kbps up According to speedtest.net: 13Mbps down 490kbps up
> 
> Thanks for the detail explanation on LM bench test.
> 
> Initially, I tested to run my own's created memory test program and
> encountered the same unbalanced memory throughput whenever two
> threads are running on different cores. I tested to run the memory test
> program on either one core or two cores. The unbalanced memory
> throughput is seen when running on two cores. Hence, I tried out the
> bw_mem test as it is a general benchmark program and it appears to be alike
> to my own test case.
> 
> Attached is the memory test program memtest.c file and the Linux
> executable file. The memtest -a1 will forces the two threads running on
> different cores while memtest -a2 will forces the two threads running on one
> core.
> May I know if is it possible if you could try out my test program on your
> iMX6D SoC (also dual core Cortex A9) board?
> 
> Below show the comparison between two threads running on one core vs
> two cores
> 
> One core:
> ========= Multi Thread =========
> 
> Thread 3067511920 - data size 1 MB, runs = 1000 Thread 3059123312 - data
> size 1 MB, runs = 1000 Thread :3059123312: Datarate: 974.887201 MB/s
> Thread :3067511920: Datarate: 960.289834 MB/s Thread :3067511920:
> Datarate: 1083.249741 MB/s Thread :3059123312: Datarate: 1055.545769 MB/s
> Thread :3067511920: Datarate: 1085.555446 MB/s Thread :3059123312:
> Datarate: 1084.503430 MB/s Thread :3067511920: Datarate: 1063.379303 MB/s
> Thread :3059123312: Datarate: 1070.705338 MB/s Thread :3067511920:
> Datarate: 1050.933243 MB/s Thread :3059123312: Datarate: 1050.153330 MB/s
> Thread :3067511920: Datarate: 1085.489144 MB/s Thread :3059123312:
> Datarate: 1071.774560 MB/s Thread :3067511920: Datarate: 1084.506795 MB/s
> Thread :3059123312: Datarate: 1060.260066 MB/s Thread :3067511920:
> Datarate: 1074.058027 MB/s Thread :3059123312: Datarate: 1069.279388 MB/s
> Thread :3067511920: Datarate: 1073.924924 MB/s Thread :3059123312:
> Datarate: 1080.818992 MB/s Thread :3067511920: Datarate: 1081.871683 MB/s
> Thread :3067511920: Average Datarate: 1064.325814 MB/s
> Thread :3059123312: Datarate: 1097.549768 MB/s Thread :3059123312:
> Average Datarate: 1061.547784 MB/s Finished!
> 
> Two cores:
> ========= Multi Thread =========
> 
> Thread 3067954288 - data size 1 MB, runs = 1000 Thread 3059565680 - data
> size 1 MB, runs = 1000 Thread :3067954288: Datarate: 741.930805 MB/s
> Thread :3059565680: Datarate: 377.979641 MB/s Thread :3067954288:
> Datarate: 741.976479 MB/s Thread :3067954288: Datarate: 740.548015 MB/s
> Thread :3059565680: Datarate: 376.706463 MB/s Thread :3067954288:
> Datarate: 740.313260 MB/s Thread :3067954288: Datarate: 740.363440 MB/s
> Thread :3059565680: Datarate: 376.129877 MB/s Thread :3067954288:
> Datarate: 740.056194 MB/s Thread :3067954288: Datarate: 740.219191 MB/s
> Thread :3059565680: Datarate: 376.114092 MB/s Thread :3067954288:
> Datarate: 740.152311 MB/s Thread :3067954288: Datarate: 724.094688 MB/s
> Thread :3059565680: Datarate: 388.118117 MB/s Thread :3067954288:
> Datarate: 740.556383 MB/s Thread :3067954288: Average Datarate:
> 739.021077 MB/s Thread :3059565680: Datarate: 1323.735631 MB/s
> Thread :3059565680: Datarate: 2072.948256 MB/s Thread :3059565680:
> Datarate: 2069.817984 MB/s Thread :3059565680: Datarate: 2069.295149 MB/s
> Thread :3059565680: Datarate: 2040.932474 MB/s Thread :3059565680:
> Average Datarate: 1147.177768 MB/s Finished!
> 
> Thanks
> Tzy Way


Hi Russell,

Sorry for interrupt. I would like to check with you if do you have the chance to try out the memory test program on your iMX6D SoC board?
Currently, I do not have other board with me and I wish to test it across other platform to see if the same issue will be seen on other board.

Really appreciate on your help to run the test program on your board. Hope to hear from you soon.

Thank you

Best regards,
Tzy Way

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-07-27  6:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <5F1105621EDF844291AF8B109E27C06D34C4BEBD@PGSMSX109.gar.corp.intel.com>
2018-07-20 10:36 ` Enquiry on unbalanced memory throughput for dual-Cortex A9 core Russell King - ARM Linux
2018-07-23  3:39   ` Ooi, Tzy Way
2018-07-24  2:22   ` Ooi, Tzy Way
2018-07-27  6:51     ` Ooi, Tzy Way

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).