linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
@ 2017-05-16  4:06 Zhao Yibin
  2017-05-17 20:32 ` Fabio Estevam
  2017-05-28 19:17 ` Pavel Machek
  0 siblings, 2 replies; 8+ messages in thread
From: Zhao Yibin @ 2017-05-16  4:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi, ARM maintainers,

We met some DDR performance issue caused by armv7 cache policy, hope
you can help.
On a single armv7(Cortex-A7) processor system, the arm linux kernel,
without CONFIG_SMP, the cache policy is set to write-back no-allocate,
which lead to very low DRAM speed,
around 7MB/s for write, 28MB/s for read.
If enable CONFIG_SMP and CONFIG_SMP_ON_UP, then the cache policy is
changed to write-back read-write-allocate, and the DRAM speed is
improved a lot, around 120MB/s for write, 160MB/s for read.

The cache property set in arch/arm/mm/proc-v7-2level.S, shows the difference:
/* PTWs cacheable, inner WB not shareable, outer WB not shareable */
#define TTB_FLAGS_UP    TTB_IRGN_WB|TTB_RGN_OC_WB
#define PMD_FLAGS_UP    PMD_SECT_WB

/* PTWs cacheable, inner WBWA shareable, outer WBWA not shareable */
#define TTB_FLAGS_SMP    TTB_IRGN_WBWA|TTB_S|TTB_NOS|TTB_RGN_OC_WBWA
#define PMD_FLAGS_SMP    PMD_SECT_WBWA|PMD_SECT_S


Is this a bug, or the desired behavior? and we have to enable SMP on
single processor system?

2. Test method:
dd if=/dev/zero of=/dev/shm/abc bs=10M count=5
dd if=/dev/shm/abc of=/dev/null bs=10M count=5

3. ARM Linux kernel version:
Linux version 4.4.52+ (xxx at yyy) (gcc version 5.3.0 (Timesys 20160523)
) #8 SMP PREEMPT Tue May 16 10:59:10 CST 2017

4. sh$ sh scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux build1 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30)
x86_64 GNU/Linux

GNU C                   4.9.2
GNU Make                3.81
Binutils                2.25
Util-linux              2.25.2
Mount                   2.25.2
Module-init-tools       18
E2fsprogs               1.42.12
Linux C Library         2.19
Dynamic linker (ldd)    2.23
Procps                  3.3.9
Net-tools               1.60
Kbd                     1.15.5
Console-tools           1.15.5
Sh-utils                8.23
Udev                    215
Wireless-tools          30
Modules Loaded          ablk_helper aesni_intel aes_x86_64 ahci
asus_wmi auth_rpcgss autofs4 button cfg80211 coretemp
cpufreq_conservative cpufreq_powersave cpufreq_stats cpufreq_userspace
crc16 crc32c_intel crc32_pclmul crc_t10dif crct10dif_common
crct10dif_generic crct10dif_pclmul cryptd dm_mod drm drm_kms_helper
e1000e eeepc_wmi ehci_hcd ehci_pci evdev ext4 fscache fuse gf128mul
glue_helper i2c_algo_bit i2c_core i2c_i801 intel_powerclamp intel_rapl
iTCO_vendor_support iTCO_wdt jbd2 kvm kvm_intel libahci libata lockd
lpc_ich lrw mbcache mei mei_me mfd_core mxm_wmi nfs nfs_acl nfsd
nouveau oid_registry pcspkr pps_core processor psmouse ptp rfkill
scsi_mod sd_mod serio_raw sg shpchp snd snd_hda_codec
snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_codec_realtek
snd_hda_controller snd_hda_intel snd_hwdep snd_pcm snd_timer soundcore
sparse_keymap sunrpc thermal_sys tpm tpm_infineon tpm_tis ttm
usb_common usbcore video wmi x86_pkg_temp_thermal xhci_hcd


Thanks
Bob

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
  2017-05-16  4:06 PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance Zhao Yibin
@ 2017-05-17 20:32 ` Fabio Estevam
  2017-05-18  9:35   ` Zhao Yibin
  2017-05-28 19:17 ` Pavel Machek
  1 sibling, 1 reply; 8+ messages in thread
From: Fabio Estevam @ 2017-05-17 20:32 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Zhao,

On Tue, May 16, 2017 at 1:06 AM, Zhao Yibin <ybzhao1989@gmail.com> wrote:
> Hi, ARM maintainers,
>
> We met some DDR performance issue caused by armv7 cache policy, hope
> you can help.
> On a single armv7(Cortex-A7) processor system, the arm linux kernel,
> without CONFIG_SMP, the cache policy is set to write-back no-allocate,
> which lead to very low DRAM speed,
> around 7MB/s for write, 28MB/s for read.

I saw the same behaviour on a mx6ul, which also has a CortexA7.

To fix this issue we had to enable the SMP bit in the bootloader.

As a reference you can look at this U-Boot patch:
https://patchwork.ozlabs.org/patch/747074/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
  2017-05-17 20:32 ` Fabio Estevam
@ 2017-05-18  9:35   ` Zhao Yibin
  0 siblings, 0 replies; 8+ messages in thread
From: Zhao Yibin @ 2017-05-18  9:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi, Fabio,

Thanks,
Enable ACTLR.SMP did change the cache behavior, the result is same as
enable CONFIG_SMP.

The TRM of cortex-a7 did mention this,

[6] SMP Enables coherent requests to the processor:
0  Disables coherent requests to the processor. This is the reset value.
1  Enables coherent requests to the processor.
When coherent requests are disabled:
? loads to cacheable memory are not cached by the processor.
? Load-Exclusive instructions take a precise abort if the memory attributes are:
? Inner Write-Back and Outer Shareable.
? Inner Write-Through and Outer Shareable.
? Outer Write-Back and Outer Shareable.
? Outer Write-Through and Outer Shareable.
? Inner Write-Back and Inner Shareable.
? Inner Write-Through and Inner Shareable.
? Outer Write-Back and Inner Shareable.
? Outer Write-Through and Inner Shareable.
Note
 You must ensure this bit is set to 1 before the caches and MMU are
enabled, or any cache and TLB
maintenance operations are performed. The only time this bit is set to
0 is during a processor power-down
sequence. See Power management on page 2-12.

I think It will be great for kernel to enable ACTLR.SMP for cortex-a7
before enable cache, otherwise the cache is not actually used.
And this is really hard to debug if haven't read the trm carefully.

Thanks
Bob

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
  2017-05-16  4:06 PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance Zhao Yibin
  2017-05-17 20:32 ` Fabio Estevam
@ 2017-05-28 19:17 ` Pavel Machek
  1 sibling, 0 replies; 8+ messages in thread
From: Pavel Machek @ 2017-05-28 19:17 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

> We met some DDR performance issue caused by armv7 cache policy, hope
> you can help.
> On a single armv7(Cortex-A7) processor system, the arm linux kernel,
> without CONFIG_SMP, the cache policy is set to write-back no-allocate,
> which lead to very low DRAM speed,
> around 7MB/s for write, 28MB/s for read.
> If enable CONFIG_SMP and CONFIG_SMP_ON_UP, then the cache policy is
> changed to write-back read-write-allocate, and the DRAM speed is
> improved a lot, around 120MB/s for write, 160MB/s for read.

Just out of curiosity, is not that still 10x lower than expected? What kind of
system is that?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
  2017-05-18  9:33     ` Russell King - ARM Linux
@ 2017-05-18 10:21       ` Zhao Yibin
  0 siblings, 0 replies; 8+ messages in thread
From: Zhao Yibin @ 2017-05-18 10:21 UTC (permalink / raw)
  To: linux-arm-kernel

HI, Russell,

Thanks for your explanation,
I did change the PMD_FLAGS.

Our CPU MIDR is 0x410FC075

I tried Fabio's suggestion about enable ACTLR.SMP, and the cache
behavior did changed,
the performance improved a lot.
According to cortex_a7 mpcore r0p5 trm,

[6] SMP Enables coherent requests to the processor:
0  Disables coherent requests to the processor. This is the reset value.
1  Enables coherent requests to the processor.
When coherent requests are disabled:
? loads to cacheable memory are not cached by the processor.
? Load-Exclusive instructions take a precise abort if the memory attributes are:
? Inner Write-Back and Outer Shareable.
? Inner Write-Through and Outer Shareable.
? Outer Write-Back and Outer Shareable.
? Outer Write-Through and Outer Shareable.
? Inner Write-Back and Inner Shareable.
? Inner Write-Through and Inner Shareable.
? Outer Write-Back and Inner Shareable.
? Outer Write-Through and Inner Shareable.
Note
 You must ensure this bit is set to 1 before the caches and MMU are
enabled, or any cache and TLB
maintenance operations are performed. The only time this bit is set to
0 is during a processor power-down
sequence. See Power management on page 2-12.

If you can enable ACTLR.SMP for cortex-a7 single processor in kernel,
that will be great,
since it's hard to know the need to enable smp for a single processer.

Thanks
Bob

2017-05-18 17:33 GMT+08:00 Russell King - ARM Linux <linux@armlinux.org.uk>:
> On Thu, May 18, 2017 at 04:25:12PM +0800, Zhao Yibin wrote:
>> Hi, Russell,
>>
>> I traced the page table of TTBR0,  and the map descriptor of the page
>> allocated from share ram,
>> TEX[0]-C-B is 0-1-1, LPAE is not enable, the TRE is 1, so TEX[0]-C-B
>> is mapped to the 3rd index of PRRR and NMRR.
>> PRRR register is 0xFF0A81A8, NMRR register is 0x40E040E0.
>> So the memory type is normal, and IR/OR is "Region is Write-Back, no
>> Write-Allocate." according to armv7 TRM
>>
>> I don't know what read-allocate can be, if cortex-a7 is simliar to
>> cortex-a15, then write-back read-allocate means
>> "Write-Back Read-Allocate  => Write-Back Read-Write-Allocate",
>
> The ARMv7 ARM gives details about how the PRRR and NMRR are decoded,
> giving pseudocode (see B3.19 for ConvertAttrsHints(), and B3.19.9 for
> the TEX remap decode pseudocode.)
>
> The PRRR and NMRR settings give write-back cache policy with a read-
> allocate _hint_.  The key thing here is that it's a _hint_, it doesn't
> mandate what the hardware does.  Different CPUs are free to use the
> hints in different ways.
>
> What this means is that while one CPU may interpret a "read-allocate"
> hint as meaning that it can allocate cache lines on read accesses,
> another CPU may do something different - it may either decide to
> augment that with "write-allocate" as well, or it may decide
> "no-allocate" (which seems to be your case.)
>
> There is no architected requirement here - it's implementation
> dependent, and that implementation dependence makes it difficult to
> deal with from a generic OS point of view.
>
> What may be right for one CPU may not be correct for another CPU.  In
> other words, we can't change this without risking causing regressions
> for the CPUs that we know work.  For example, if we changed to write-
> allocate mode (aka read-write-allocate), there could be some other a
> CPU out there which decides to implement that as no-allocate.
>
> However, if it's possible to identify your CPU uniquely, then it would
> be possible to change it just for your CPU.  What is the CPU MIDR value?
>
>> I tried change the value of TTB_FLAGS_UP to  the same as TTB_FLAGS_SMP
>> in arch/arm/mm/proc-v7-2level.S.
>
> You need to change PMD_FLAGS as well.  TTB determines the translation
> table base register values, which are the attributes used by the page
> table walker.  PMD determines what's used in the page tables themselves.
>
> --
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
  2017-05-18  8:25   ` Zhao Yibin
@ 2017-05-18  9:33     ` Russell King - ARM Linux
  2017-05-18 10:21       ` Zhao Yibin
  0 siblings, 1 reply; 8+ messages in thread
From: Russell King - ARM Linux @ 2017-05-18  9:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 18, 2017 at 04:25:12PM +0800, Zhao Yibin wrote:
> Hi, Russell,
> 
> I traced the page table of TTBR0,  and the map descriptor of the page
> allocated from share ram,
> TEX[0]-C-B is 0-1-1, LPAE is not enable, the TRE is 1, so TEX[0]-C-B
> is mapped to the 3rd index of PRRR and NMRR.
> PRRR register is 0xFF0A81A8, NMRR register is 0x40E040E0.
> So the memory type is normal, and IR/OR is "Region is Write-Back, no
> Write-Allocate." according to armv7 TRM
> 
> I don't know what read-allocate can be, if cortex-a7 is simliar to
> cortex-a15, then write-back read-allocate means
> "Write-Back Read-Allocate  => Write-Back Read-Write-Allocate",

The ARMv7 ARM gives details about how the PRRR and NMRR are decoded,
giving pseudocode (see B3.19 for ConvertAttrsHints(), and B3.19.9 for
the TEX remap decode pseudocode.)

The PRRR and NMRR settings give write-back cache policy with a read-
allocate _hint_.  The key thing here is that it's a _hint_, it doesn't
mandate what the hardware does.  Different CPUs are free to use the
hints in different ways.

What this means is that while one CPU may interpret a "read-allocate"
hint as meaning that it can allocate cache lines on read accesses,
another CPU may do something different - it may either decide to
augment that with "write-allocate" as well, or it may decide
"no-allocate" (which seems to be your case.)

There is no architected requirement here - it's implementation
dependent, and that implementation dependence makes it difficult to
deal with from a generic OS point of view.

What may be right for one CPU may not be correct for another CPU.  In
other words, we can't change this without risking causing regressions
for the CPUs that we know work.  For example, if we changed to write-
allocate mode (aka read-write-allocate), there could be some other a
CPU out there which decides to implement that as no-allocate.

However, if it's possible to identify your CPU uniquely, then it would
be possible to change it just for your CPU.  What is the CPU MIDR value?

> I tried change the value of TTB_FLAGS_UP to  the same as TTB_FLAGS_SMP
> in arch/arm/mm/proc-v7-2level.S.

You need to change PMD_FLAGS as well.  TTB determines the translation
table base register values, which are the attributes used by the page
table walker.  PMD determines what's used in the page tables themselves.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
  2017-05-16  8:29 ` Russell King - ARM Linux
@ 2017-05-18  8:25   ` Zhao Yibin
  2017-05-18  9:33     ` Russell King - ARM Linux
  0 siblings, 1 reply; 8+ messages in thread
From: Zhao Yibin @ 2017-05-18  8:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi, Russell,

I traced the page table of TTBR0,  and the map descriptor of the page
allocated from share ram,
TEX[0]-C-B is 0-1-1, LPAE is not enable, the TRE is 1, so TEX[0]-C-B
is mapped to the 3rd index of PRRR and NMRR.
PRRR register is 0xFF0A81A8, NMRR register is 0x40E040E0.
So the memory type is normal, and IR/OR is "Region is Write-Back, no
Write-Allocate." according to armv7 TRM

I don't know what read-allocate can be, if cortex-a7 is simliar to
cortex-a15, then write-back read-allocate means
"Write-Back Read-Allocate  => Write-Back Read-Write-Allocate",
according to this page:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438c/BABJDDBC.html
Is my guess right?

I tried change the value of TTB_FLAGS_UP to  the same as TTB_FLAGS_SMP
in arch/arm/mm/proc-v7-2level.S.
but kernel fail to boot-up, fail in cgroup_init_early() called by
start_kernel().

Thanks
Bob

2017-05-16 16:29 GMT+08:00 Russell King - ARM Linux <linux@armlinux.org.uk>:
> On Tue, May 16, 2017 at 11:59:30AM +0800, Zhao Yibin wrote:
>> We met some DDR performance issue caused by armv7 cache policy, hope you
>> can help.
>> On a single armv7(Cortex-A7) processor system, the arm linux kernel,
>> without CONFIG_SMP, the cache policy is set to write-back no-allocate,
>
> It should be write-back read-allocate.
>
> --
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance
       [not found] <CAOEkoLw3Y5h2P_aFeMKgHYyKZpQvsQsOPOLOuvaYZGOe+rHJWQ@mail.gmail.com>
@ 2017-05-16  8:29 ` Russell King - ARM Linux
  2017-05-18  8:25   ` Zhao Yibin
  0 siblings, 1 reply; 8+ messages in thread
From: Russell King - ARM Linux @ 2017-05-16  8:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 16, 2017 at 11:59:30AM +0800, Zhao Yibin wrote:
> We met some DDR performance issue caused by armv7 cache policy, hope you
> can help.
> On a single armv7(Cortex-A7) processor system, the arm linux kernel,
> without CONFIG_SMP, the cache policy is set to write-back no-allocate,

It should be write-back read-allocate.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-28 19:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-16  4:06 PROBLEM: ARM Cache policy on single armv7 processor lead to low DRAM performance Zhao Yibin
2017-05-17 20:32 ` Fabio Estevam
2017-05-18  9:35   ` Zhao Yibin
2017-05-28 19:17 ` Pavel Machek
     [not found] <CAOEkoLw3Y5h2P_aFeMKgHYyKZpQvsQsOPOLOuvaYZGOe+rHJWQ@mail.gmail.com>
2017-05-16  8:29 ` Russell King - ARM Linux
2017-05-18  8:25   ` Zhao Yibin
2017-05-18  9:33     ` Russell King - ARM Linux
2017-05-18 10:21       ` Zhao Yibin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).