[PATCH 0/2] OMAPDSS: write-through caching support for omapfb

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-22 19:54 ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel

This is a very simple few-liner patchset, which allows to optionally
enable write-through caching for OMAP DSS framebuffer. The problem with
the current writecombine cacheability attribute is that it only speeds
up writes. Uncached reads are slow, even though the use of NEON mitigates
this problem a bit.

Traditionally, xf86-video-fbdev DDX is using shadow framebuffer in the
system memory. Which contains a copy of the framebuffer data for the
purpose of providing fast read access to it when needed. Framebuffer
read access is required not so often, but it still gets used for
scrolling and moving windows around in Xorg server. And the users
perceive their linux desktop as rather sluggish when these operations
are not fast enough.

In the case of ARM hardware, framebuffer is typically physically
located in the main memory. And the processors still support
write-through cacheability attribute. According to ARM ARM, the writes
done to write-through cached memory inside the level of cache are
visible to all observers outside the level of cache without the need
of explicit cache maintenance (same rule as for non-cached memory).
So write-through cache is a perfect choice when only CPU is allowed
to modify the data in the framebuffer and everyone else (screen
refresh DMA) is only reading it. That is, assuming that write-through
cached memory provides good performance and there are no quirks.
As the framebuffer reads become fast, the need for shadow framebuffer
disappears.

And at least for ARM11 and Cortex-A8 processors, the performance of
write-through cache is really good. Cortex-A9 is another story, because
all pages marked as Write-Through are supposedly treated as Non-Cacheable:
    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388h/CBBFDIJD.html
So OMAP4 is out of luck. But OMAP3 based hardware can have a nice
graphics performance boost. And OMAP3 actually needs it a lot more.

PS. The xf86-video-omapfb-0.1.1 driver does not even use shadow
    framebuffer (ouch!). So its users, if any, should see an immediate
    speedup.

Siarhei Siamashka (2):
  ARM: pgtable: add pgprot_writethrough() macro
  OMAPDSS: Optionally enable write-through cache for the framebuffer

 Documentation/arm/OMAP/DSS               |   10 ++++++++++
 arch/arm/include/asm/pgtable.h           |    3 +++
 drivers/video/omap2/omapfb/omapfb-main.c |    7 ++++++-
 3 files changed, 19 insertions(+), 1 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-22 19:54 ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-omap, linux-fbdev, linux, tomi.valkeinen, Siarhei Siamashka

This is a very simple few-liner patchset, which allows to optionally
enable write-through caching for OMAP DSS framebuffer. The problem with
the current writecombine cacheability attribute is that it only speeds
up writes. Uncached reads are slow, even though the use of NEON mitigates
this problem a bit.

Traditionally, xf86-video-fbdev DDX is using shadow framebuffer in the
system memory. Which contains a copy of the framebuffer data for the
purpose of providing fast read access to it when needed. Framebuffer
read access is required not so often, but it still gets used for
scrolling and moving windows around in Xorg server. And the users
perceive their linux desktop as rather sluggish when these operations
are not fast enough.

In the case of ARM hardware, framebuffer is typically physically
located in the main memory. And the processors still support
write-through cacheability attribute. According to ARM ARM, the writes
done to write-through cached memory inside the level of cache are
visible to all observers outside the level of cache without the need
of explicit cache maintenance (same rule as for non-cached memory).
So write-through cache is a perfect choice when only CPU is allowed
to modify the data in the framebuffer and everyone else (screen
refresh DMA) is only reading it. That is, assuming that write-through
cached memory provides good performance and there are no quirks.
As the framebuffer reads become fast, the need for shadow framebuffer
disappears.

And at least for ARM11 and Cortex-A8 processors, the performance of
write-through cache is really good. Cortex-A9 is another story, because
all pages marked as Write-Through are supposedly treated as Non-Cacheable:
    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388h/CBBFDIJD.html
So OMAP4 is out of luck. But OMAP3 based hardware can have a nice
graphics performance boost. And OMAP3 actually needs it a lot more.

PS. The xf86-video-omapfb-0.1.1 driver does not even use shadow
    framebuffer (ouch!). So its users, if any, should see an immediate
    speedup.

Siarhei Siamashka (2):
  ARM: pgtable: add pgprot_writethrough() macro
  OMAPDSS: Optionally enable write-through cache for the framebuffer

 Documentation/arm/OMAP/DSS               |   10 ++++++++++
 arch/arm/include/asm/pgtable.h           |    3 +++
 drivers/video/omap2/omapfb/omapfb-main.c |    7 ++++++-
 3 files changed, 19 insertions(+), 1 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-22 19:54 ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel

This is a very simple few-liner patchset, which allows to optionally
enable write-through caching for OMAP DSS framebuffer. The problem with
the current writecombine cacheability attribute is that it only speeds
up writes. Uncached reads are slow, even though the use of NEON mitigates
this problem a bit.

Traditionally, xf86-video-fbdev DDX is using shadow framebuffer in the
system memory. Which contains a copy of the framebuffer data for the
purpose of providing fast read access to it when needed. Framebuffer
read access is required not so often, but it still gets used for
scrolling and moving windows around in Xorg server. And the users
perceive their linux desktop as rather sluggish when these operations
are not fast enough.

In the case of ARM hardware, framebuffer is typically physically
located in the main memory. And the processors still support
write-through cacheability attribute. According to ARM ARM, the writes
done to write-through cached memory inside the level of cache are
visible to all observers outside the level of cache without the need
of explicit cache maintenance (same rule as for non-cached memory).
So write-through cache is a perfect choice when only CPU is allowed
to modify the data in the framebuffer and everyone else (screen
refresh DMA) is only reading it. That is, assuming that write-through
cached memory provides good performance and there are no quirks.
As the framebuffer reads become fast, the need for shadow framebuffer
disappears.

And at least for ARM11 and Cortex-A8 processors, the performance of
write-through cache is really good. Cortex-A9 is another story, because
all pages marked as Write-Through are supposedly treated as Non-Cacheable:
    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388h/CBBFDIJD.html
So OMAP4 is out of luck. But OMAP3 based hardware can have a nice
graphics performance boost. And OMAP3 actually needs it a lot more.

PS. The xf86-video-omapfb-0.1.1 driver does not even use shadow
    framebuffer (ouch!). So its users, if any, should see an immediate
    speedup.

Siarhei Siamashka (2):
  ARM: pgtable: add pgprot_writethrough() macro
  OMAPDSS: Optionally enable write-through cache for the framebuffer

 Documentation/arm/OMAP/DSS               |   10 ++++++++++
 arch/arm/include/asm/pgtable.h           |    3 +++
 drivers/video/omap2/omapfb/omapfb-main.c |    7 ++++++-
 3 files changed, 19 insertions(+), 1 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] ARM: pgtable: add pgprot_writethrough() macro
  2012-05-22 19:54 ` Siarhei Siamashka
  (?)
@ 2012-05-22 19:54   ` Siarhei Siamashka
  -1 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel

Needed for remapping pages with write-through cacheable
attribute. May be useful for framebuffers.

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
---
 arch/arm/include/asm/pgtable.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index f66626d..04297fa 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -103,6 +103,9 @@ extern pgprot_t		pgprot_kernel;
 #define pgprot_stronglyordered(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
 
+#define pgprot_writethrough(prot) \
+	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_WRITETHROUGH)
+
 #ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
 #define pgprot_dmacoherent(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE | L_PTE_XN)
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/2] ARM: pgtable: add pgprot_writethrough() macro
@ 2012-05-22 19:54   ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-omap, linux-fbdev, linux, tomi.valkeinen, Siarhei Siamashka

Needed for remapping pages with write-through cacheable
attribute. May be useful for framebuffers.

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
---
 arch/arm/include/asm/pgtable.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index f66626d..04297fa 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -103,6 +103,9 @@ extern pgprot_t		pgprot_kernel;
 #define pgprot_stronglyordered(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
 
+#define pgprot_writethrough(prot) \
+	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_WRITETHROUGH)
+
 #ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
 #define pgprot_dmacoherent(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE | L_PTE_XN)
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/2] ARM: pgtable: add pgprot_writethrough() macro
@ 2012-05-22 19:54   ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel

Needed for remapping pages with write-through cacheable
attribute. May be useful for framebuffers.

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
---
 arch/arm/include/asm/pgtable.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index f66626d..04297fa 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -103,6 +103,9 @@ extern pgprot_t		pgprot_kernel;
 #define pgprot_stronglyordered(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
 
+#define pgprot_writethrough(prot) \
+	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_WRITETHROUGH)
+
 #ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
 #define pgprot_dmacoherent(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE | L_PTE_XN)
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] OMAPDSS: Optionally enable write-through cache for the framebuffer
  2012-05-22 19:54 ` Siarhei Siamashka
  (?)
@ 2012-05-22 19:54   ` Siarhei Siamashka
  -1 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel

Write-through cached framebuffer eliminates the need for using shadow
framebuffer in xf86-video-fbdev DDX. At the very least this reduces
memory footprint, but the performance is also the same or better
when moving windows or scrolling on ARM11 and Cortex-A8 hardware.

Benchmark with xf86-video-fbdev on IGEPv2 board (TI DM3730, 1GHz),
1280x1024 screen resolution and 32bpp desktop color depth:

$ x11perf -scroll500 -copywinwin500 -copypixpix500 \
          -copypixwin500 -copywinpix500

-- omapfb.vram_cache=n, Option "ShadowFB" "true" in xorg.conf
 10000 trep @   3.4583 msec ( 289.0/sec): Scroll 500x500 pixels
  6000 trep @   4.3255 msec ( 231.0/sec): Copy 500x500 from window to window
  8000 trep @   3.2738 msec ( 305.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1707 msec ( 315.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4761 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=n, Option "ShadowFB" "false" in xorg.conf
  5000 trep @   5.2357 msec ( 191.0/sec): Scroll 500x500 pixels
  1200 trep @  21.0346 msec (  47.5/sec): Copy 500x500 from window to window
  8000 trep @   3.1590 msec ( 317.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.5062 msec ( 222.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4767 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=y, Option "ShadowFB" "true" in xorg.conf
 10000 trep @   3.4580 msec ( 289.0/sec): Scroll 500x500 pixels
  6000 trep @   4.3424 msec ( 230.0/sec): Copy 500x500 from window to window
  8000 trep @   3.2673 msec ( 306.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1626 msec ( 316.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4733 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=y, Option "ShadowFB" "false" in xorg.conf
 10000 trep @   3.4893 msec ( 287.0/sec): Scroll 500x500 pixels
  8000 trep @   4.0600 msec ( 246.0/sec): Copy 500x500 from window to window
  8000 trep @   3.1565 msec ( 317.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1373 msec ( 319.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4631 msec ( 289.0/sec): Copy 500x500 from pixmap to pixmap

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
---
 Documentation/arm/OMAP/DSS               |   10 ++++++++++
 drivers/video/omap2/omapfb/omapfb-main.c |    7 ++++++-
 2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/Documentation/arm/OMAP/DSS b/Documentation/arm/OMAP/DSS
index 888ae7b..4e41a15 100644
--- a/Documentation/arm/OMAP/DSS
+++ b/Documentation/arm/OMAP/DSS
@@ -293,6 +293,16 @@ omapfb.rotate=<angle>
 omapfb.mirror=<y|n>
 	- Default mirror for all framebuffers. Only works with DMA rotation.
 
+omapfb.vram_cache=<y|n>
+	- Sets the framebuffer memory to be write-through cached. This may be
+	  useful in the configurations where only CPU is allowed to write to
+	  the framebuffer and eliminate the need for enabling shadow
+	  framebuffer in Xorg DDX drivers such as xf86-video-fbdev and
+	  xf86-video-omapfb. Enabling write-through cache is only useful
+	  for ARM11 and Cortex-A8 processors. Cortex-A9 does not support
+	  write-through cache well, see "Cortex-A9 behavior for Normal Memory
+	  Cacheable memory regions" section in Cortex-A9 TRM for more details.
+
 omapdss.def_disp=<display>
 	- Name of default display, to which all overlays will be connected.
 	  Common examples are "lcd" or "tv".
diff --git a/drivers/video/omap2/omapfb/omapfb-main.c b/drivers/video/omap2/omapfb/omapfb-main.c
index b00db40..a684920 100644
--- a/drivers/video/omap2/omapfb/omapfb-main.c
+++ b/drivers/video/omap2/omapfb/omapfb-main.c
@@ -46,6 +46,7 @@ static char *def_vram;
 static bool def_vrfb;
 static int def_rotate;
 static bool def_mirror;
+static bool def_vram_cache;
 static bool auto_update;
 static unsigned int auto_update_freq;
 module_param(auto_update, bool, 0);
@@ -1123,7 +1124,10 @@ static int omapfb_mmap(struct fb_info *fbi, struct vm_area_struct *vma)
 
 	vma->vm_pgoff = off >> PAGE_SHIFT;
 	vma->vm_flags |= VM_IO | VM_RESERVED;
-	vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+	if (def_vram_cache)
+		vma->vm_page_prot = pgprot_writethrough(vma->vm_page_prot);
+	else
+		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
 	vma->vm_ops = &mmap_user_ops;
 	vma->vm_private_data = rg;
 	if (io_remap_pfn_range(vma, vma->vm_start, off >> PAGE_SHIFT,
@@ -2493,6 +2497,7 @@ module_param_named(vram, def_vram, charp, 0);
 module_param_named(rotate, def_rotate, int, 0);
 module_param_named(vrfb, def_vrfb, bool, 0);
 module_param_named(mirror, def_mirror, bool, 0);
+module_param_named(vram_cache, def_vram_cache, bool, 0444);
 
 /* late_initcall to let panel/ctrl drivers loaded first.
  * I guess better option would be a more dynamic approach,
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] OMAPDSS: Optionally enable write-through cache for the framebuffer
@ 2012-05-22 19:54   ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-omap, linux-fbdev, linux, tomi.valkeinen, Siarhei Siamashka

Write-through cached framebuffer eliminates the need for using shadow
framebuffer in xf86-video-fbdev DDX. At the very least this reduces
memory footprint, but the performance is also the same or better
when moving windows or scrolling on ARM11 and Cortex-A8 hardware.

Benchmark with xf86-video-fbdev on IGEPv2 board (TI DM3730, 1GHz),
1280x1024 screen resolution and 32bpp desktop color depth:

$ x11perf -scroll500 -copywinwin500 -copypixpix500 \
          -copypixwin500 -copywinpix500

-- omapfb.vram_cache=n, Option "ShadowFB" "true" in xorg.conf
 10000 trep @   3.4583 msec ( 289.0/sec): Scroll 500x500 pixels
  6000 trep @   4.3255 msec ( 231.0/sec): Copy 500x500 from window to window
  8000 trep @   3.2738 msec ( 305.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1707 msec ( 315.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4761 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=n, Option "ShadowFB" "false" in xorg.conf
  5000 trep @   5.2357 msec ( 191.0/sec): Scroll 500x500 pixels
  1200 trep @  21.0346 msec (  47.5/sec): Copy 500x500 from window to window
  8000 trep @   3.1590 msec ( 317.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.5062 msec ( 222.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4767 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=y, Option "ShadowFB" "true" in xorg.conf
 10000 trep @   3.4580 msec ( 289.0/sec): Scroll 500x500 pixels
  6000 trep @   4.3424 msec ( 230.0/sec): Copy 500x500 from window to window
  8000 trep @   3.2673 msec ( 306.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1626 msec ( 316.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4733 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=y, Option "ShadowFB" "false" in xorg.conf
 10000 trep @   3.4893 msec ( 287.0/sec): Scroll 500x500 pixels
  8000 trep @   4.0600 msec ( 246.0/sec): Copy 500x500 from window to window
  8000 trep @   3.1565 msec ( 317.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1373 msec ( 319.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4631 msec ( 289.0/sec): Copy 500x500 from pixmap to pixmap

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
---
 Documentation/arm/OMAP/DSS               |   10 ++++++++++
 drivers/video/omap2/omapfb/omapfb-main.c |    7 ++++++-
 2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/Documentation/arm/OMAP/DSS b/Documentation/arm/OMAP/DSS
index 888ae7b..4e41a15 100644
--- a/Documentation/arm/OMAP/DSS
+++ b/Documentation/arm/OMAP/DSS
@@ -293,6 +293,16 @@ omapfb.rotate=<angle>
 omapfb.mirror=<y|n>
 	- Default mirror for all framebuffers. Only works with DMA rotation.
 
+omapfb.vram_cache=<y|n>
+	- Sets the framebuffer memory to be write-through cached. This may be
+	  useful in the configurations where only CPU is allowed to write to
+	  the framebuffer and eliminate the need for enabling shadow
+	  framebuffer in Xorg DDX drivers such as xf86-video-fbdev and
+	  xf86-video-omapfb. Enabling write-through cache is only useful
+	  for ARM11 and Cortex-A8 processors. Cortex-A9 does not support
+	  write-through cache well, see "Cortex-A9 behavior for Normal Memory
+	  Cacheable memory regions" section in Cortex-A9 TRM for more details.
+
 omapdss.def_disp=<display>
 	- Name of default display, to which all overlays will be connected.
 	  Common examples are "lcd" or "tv".
diff --git a/drivers/video/omap2/omapfb/omapfb-main.c b/drivers/video/omap2/omapfb/omapfb-main.c
index b00db40..a684920 100644
--- a/drivers/video/omap2/omapfb/omapfb-main.c
+++ b/drivers/video/omap2/omapfb/omapfb-main.c
@@ -46,6 +46,7 @@ static char *def_vram;
 static bool def_vrfb;
 static int def_rotate;
 static bool def_mirror;
+static bool def_vram_cache;
 static bool auto_update;
 static unsigned int auto_update_freq;
 module_param(auto_update, bool, 0);
@@ -1123,7 +1124,10 @@ static int omapfb_mmap(struct fb_info *fbi, struct vm_area_struct *vma)
 
 	vma->vm_pgoff = off >> PAGE_SHIFT;
 	vma->vm_flags |= VM_IO | VM_RESERVED;
-	vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+	if (def_vram_cache)
+		vma->vm_page_prot = pgprot_writethrough(vma->vm_page_prot);
+	else
+		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
 	vma->vm_ops = &mmap_user_ops;
 	vma->vm_private_data = rg;
 	if (io_remap_pfn_range(vma, vma->vm_start, off >> PAGE_SHIFT,
@@ -2493,6 +2497,7 @@ module_param_named(vram, def_vram, charp, 0);
 module_param_named(rotate, def_rotate, int, 0);
 module_param_named(vrfb, def_vrfb, bool, 0);
 module_param_named(mirror, def_mirror, bool, 0);
+module_param_named(vram_cache, def_vram_cache, bool, 0444);
 
 /* late_initcall to let panel/ctrl drivers loaded first.
  * I guess better option would be a more dynamic approach,
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] OMAPDSS: Optionally enable write-through cache for the framebuffer
@ 2012-05-22 19:54   ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 19:54 UTC (permalink / raw)
  To: linux-arm-kernel

Write-through cached framebuffer eliminates the need for using shadow
framebuffer in xf86-video-fbdev DDX. At the very least this reduces
memory footprint, but the performance is also the same or better
when moving windows or scrolling on ARM11 and Cortex-A8 hardware.

Benchmark with xf86-video-fbdev on IGEPv2 board (TI DM3730, 1GHz),
1280x1024 screen resolution and 32bpp desktop color depth:

$ x11perf -scroll500 -copywinwin500 -copypixpix500 \
          -copypixwin500 -copywinpix500

-- omapfb.vram_cache=n, Option "ShadowFB" "true" in xorg.conf
 10000 trep @   3.4583 msec ( 289.0/sec): Scroll 500x500 pixels
  6000 trep @   4.3255 msec ( 231.0/sec): Copy 500x500 from window to window
  8000 trep @   3.2738 msec ( 305.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1707 msec ( 315.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4761 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=n, Option "ShadowFB" "false" in xorg.conf
  5000 trep @   5.2357 msec ( 191.0/sec): Scroll 500x500 pixels
  1200 trep @  21.0346 msec (  47.5/sec): Copy 500x500 from window to window
  8000 trep @   3.1590 msec ( 317.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.5062 msec ( 222.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4767 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=y, Option "ShadowFB" "true" in xorg.conf
 10000 trep @   3.4580 msec ( 289.0/sec): Scroll 500x500 pixels
  6000 trep @   4.3424 msec ( 230.0/sec): Copy 500x500 from window to window
  8000 trep @   3.2673 msec ( 306.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1626 msec ( 316.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4733 msec ( 288.0/sec): Copy 500x500 from pixmap to pixmap

-- omapfb.vram_cache=y, Option "ShadowFB" "false" in xorg.conf
 10000 trep @   3.4893 msec ( 287.0/sec): Scroll 500x500 pixels
  8000 trep @   4.0600 msec ( 246.0/sec): Copy 500x500 from window to window
  8000 trep @   3.1565 msec ( 317.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.1373 msec ( 319.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4631 msec ( 289.0/sec): Copy 500x500 from pixmap to pixmap

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
---
 Documentation/arm/OMAP/DSS               |   10 ++++++++++
 drivers/video/omap2/omapfb/omapfb-main.c |    7 ++++++-
 2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/Documentation/arm/OMAP/DSS b/Documentation/arm/OMAP/DSS
index 888ae7b..4e41a15 100644
--- a/Documentation/arm/OMAP/DSS
+++ b/Documentation/arm/OMAP/DSS
@@ -293,6 +293,16 @@ omapfb.rotate=<angle>
 omapfb.mirror=<y|n>
 	- Default mirror for all framebuffers. Only works with DMA rotation.
 
+omapfb.vram_cache=<y|n>
+	- Sets the framebuffer memory to be write-through cached. This may be
+	  useful in the configurations where only CPU is allowed to write to
+	  the framebuffer and eliminate the need for enabling shadow
+	  framebuffer in Xorg DDX drivers such as xf86-video-fbdev and
+	  xf86-video-omapfb. Enabling write-through cache is only useful
+	  for ARM11 and Cortex-A8 processors. Cortex-A9 does not support
+	  write-through cache well, see "Cortex-A9 behavior for Normal Memory
+	  Cacheable memory regions" section in Cortex-A9 TRM for more details.
+
 omapdss.def_disp=<display>
 	- Name of default display, to which all overlays will be connected.
 	  Common examples are "lcd" or "tv".
diff --git a/drivers/video/omap2/omapfb/omapfb-main.c b/drivers/video/omap2/omapfb/omapfb-main.c
index b00db40..a684920 100644
--- a/drivers/video/omap2/omapfb/omapfb-main.c
+++ b/drivers/video/omap2/omapfb/omapfb-main.c
@@ -46,6 +46,7 @@ static char *def_vram;
 static bool def_vrfb;
 static int def_rotate;
 static bool def_mirror;
+static bool def_vram_cache;
 static bool auto_update;
 static unsigned int auto_update_freq;
 module_param(auto_update, bool, 0);
@@ -1123,7 +1124,10 @@ static int omapfb_mmap(struct fb_info *fbi, struct vm_area_struct *vma)
 
 	vma->vm_pgoff = off >> PAGE_SHIFT;
 	vma->vm_flags |= VM_IO | VM_RESERVED;
-	vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+	if (def_vram_cache)
+		vma->vm_page_prot = pgprot_writethrough(vma->vm_page_prot);
+	else
+		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
 	vma->vm_ops = &mmap_user_ops;
 	vma->vm_private_data = rg;
 	if (io_remap_pfn_range(vma, vma->vm_start, off >> PAGE_SHIFT,
@@ -2493,6 +2497,7 @@ module_param_named(vram, def_vram, charp, 0);
 module_param_named(rotate, def_rotate, int, 0);
 module_param_named(vrfb, def_vrfb, bool, 0);
 module_param_named(mirror, def_mirror, bool, 0);
+module_param_named(vram_cache, def_vram_cache, bool, 0444);
 
 /* late_initcall to let panel/ctrl drivers loaded first.
  * I guess better option would be a more dynamic approach,
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
  2012-05-22 19:54 ` Siarhei Siamashka
  (?)
@ 2012-05-22 20:25   ` Siarhei Siamashka
  -1 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 20:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 22, 2012 at 10:54 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
> And at least for ARM11 and Cortex-A8 processors, the performance of
> write-through cache is really good. Cortex-A9 is another story, because
> all pages marked as Write-Through are supposedly treated as Non-Cacheable:
>    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388h/CBBFDIJD.html
> So OMAP4 is out of luck.

I don't have Pandaboard ES, but still tried to experiment changing the
following line in the kernel sources to benchmark different types of
caching for the framebuffer on Origen board (Exynos 4210):
    https://github.com/torvalds/linux/blob/v3.4/drivers/media/video/videobuf2-memops.c#L158

It was not a totally clean experiment, because 500x500 16bpp pixel
buffer is much smaller than 1MiB L2 cache and the performance numbers
may be a bit odd. Also I have not checked whether the same buffer may
be mapped with different cacheability attributes anywhere else (which
would be bad). But still it was interesting to see whether
write-through cache is of any use and whether it could serve as a
replacement for shadowfb.

Origen board, Exynos 4210, Cortex-A9 1.2GHz, 1920x1080 screen
resolution, 16bpp desktop color depth (I did not find any obvious way
how to change it to 32bpp yet):

$ x11perf -scroll500 -copywinwin500 -copypixpix500 \
          -copypixwin500 -copywinpix500

-- pgprot_noncached + shadowfb
 100000 trep @  0.2708 msec ( 3690.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7307 msec ( 1370.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5471 msec ( 1830.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5822 msec ( 1720.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6584 msec ( 1520.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writecombine + shadowfb
 100000 trep @  0.2612 msec ( 3830.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7058 msec ( 1420.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5262 msec ( 1900.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5797 msec ( 1730.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6554 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writethrough + shadowfb
 100000 trep @  0.2609 msec ( 3830.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7018 msec ( 1420.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5260 msec ( 1900.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5758 msec ( 1740.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6569 msec ( 1520.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_noncached
   3500 trep @  7.5972 msec (  132.0/sec): Scroll 500x500 pixels
   1800 trep @ 14.7146 msec (   68.0/sec): Copy 500x500 from window to window
   6000 trep @  4.6501 msec (  215.0/sec): Copy 500x500 from pixmap to window
   8000 trep @  3.3500 msec (  299.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6546 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writecombine
  10000 trep @  2.9439 msec (  340.0/sec): Scroll 500x500 pixels
   6000 trep @  5.7246 msec (  175.0/sec): Copy 500x500 from window to window
  60000 trep @  0.4213 msec ( 2370.0/sec): Copy 500x500 from pixmap to window
  12000 trep @  2.2423 msec (  446.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6648 msec ( 1500.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writethrough
  40000 trep @  0.7103 msec ( 1410.0/sec): Scroll 500x500 pixels
  20000 trep @  1.3024 msec (  768.0/sec): Copy 500x500 from window to window
  80000 trep @  0.3933 msec ( 2540.0/sec): Copy 500x500 from pixmap to window
  18000 trep @  1.3967 msec (  716.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6548 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

Without shadowfb, the performance of "writecombine" looks to be better
than "noncached". And "writethrough" is clearly the fastest. Still
even "writethrough" is no match for shadowfb on Cortex-A9 (unlike
ARM11 and Cortex-A8).

So is Cortex-A9 a lost cause? Maybe experimenting with page table
entries and tweaking inner/outer cacheability attributes could provide
something? From the first glance, it looks like read performance for
write-through cached memory is rather bad on Cortex-A9. But still
there is some speedup, so it does not seem to be treated as totally
non-cached. And at least PL310 L2 cache controller has some support
for "Cacheable write-through, allocate on read":
    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/ch02s03s01.html

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-22 20:25   ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 20:25 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-omap, linux-fbdev, linux, tomi.valkeinen, Siarhei Siamashka

On Tue, May 22, 2012 at 10:54 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
> And at least for ARM11 and Cortex-A8 processors, the performance of
> write-through cache is really good. Cortex-A9 is another story, because
> all pages marked as Write-Through are supposedly treated as Non-Cacheable:
>    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388h/CBBFDIJD.html
> So OMAP4 is out of luck.

I don't have Pandaboard ES, but still tried to experiment changing the
following line in the kernel sources to benchmark different types of
caching for the framebuffer on Origen board (Exynos 4210):
    https://github.com/torvalds/linux/blob/v3.4/drivers/media/video/videobuf2-memops.c#L158

It was not a totally clean experiment, because 500x500 16bpp pixel
buffer is much smaller than 1MiB L2 cache and the performance numbers
may be a bit odd. Also I have not checked whether the same buffer may
be mapped with different cacheability attributes anywhere else (which
would be bad). But still it was interesting to see whether
write-through cache is of any use and whether it could serve as a
replacement for shadowfb.

Origen board, Exynos 4210, Cortex-A9 1.2GHz, 1920x1080 screen
resolution, 16bpp desktop color depth (I did not find any obvious way
how to change it to 32bpp yet):

$ x11perf -scroll500 -copywinwin500 -copypixpix500 \
          -copypixwin500 -copywinpix500

-- pgprot_noncached + shadowfb
 100000 trep @  0.2708 msec ( 3690.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7307 msec ( 1370.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5471 msec ( 1830.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5822 msec ( 1720.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6584 msec ( 1520.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writecombine + shadowfb
 100000 trep @  0.2612 msec ( 3830.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7058 msec ( 1420.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5262 msec ( 1900.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5797 msec ( 1730.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6554 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writethrough + shadowfb
 100000 trep @  0.2609 msec ( 3830.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7018 msec ( 1420.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5260 msec ( 1900.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5758 msec ( 1740.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6569 msec ( 1520.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_noncached
   3500 trep @  7.5972 msec (  132.0/sec): Scroll 500x500 pixels
   1800 trep @ 14.7146 msec (   68.0/sec): Copy 500x500 from window to window
   6000 trep @  4.6501 msec (  215.0/sec): Copy 500x500 from pixmap to window
   8000 trep @  3.3500 msec (  299.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6546 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writecombine
  10000 trep @  2.9439 msec (  340.0/sec): Scroll 500x500 pixels
   6000 trep @  5.7246 msec (  175.0/sec): Copy 500x500 from window to window
  60000 trep @  0.4213 msec ( 2370.0/sec): Copy 500x500 from pixmap to window
  12000 trep @  2.2423 msec (  446.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6648 msec ( 1500.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writethrough
  40000 trep @  0.7103 msec ( 1410.0/sec): Scroll 500x500 pixels
  20000 trep @  1.3024 msec (  768.0/sec): Copy 500x500 from window to window
  80000 trep @  0.3933 msec ( 2540.0/sec): Copy 500x500 from pixmap to window
  18000 trep @  1.3967 msec (  716.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6548 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

Without shadowfb, the performance of "writecombine" looks to be better
than "noncached". And "writethrough" is clearly the fastest. Still
even "writethrough" is no match for shadowfb on Cortex-A9 (unlike
ARM11 and Cortex-A8).

So is Cortex-A9 a lost cause? Maybe experimenting with page table
entries and tweaking inner/outer cacheability attributes could provide
something? From the first glance, it looks like read performance for
write-through cached memory is rather bad on Cortex-A9. But still
there is some speedup, so it does not seem to be treated as totally
non-cached. And at least PL310 L2 cache controller has some support
for "Cacheable write-through, allocate on read":
    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/ch02s03s01.html

-- 
Best regards,
Siarhei Siamashka
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-22 20:25   ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-22 20:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 22, 2012 at 10:54 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
> And at least for ARM11 and Cortex-A8 processors, the performance of
> write-through cache is really good. Cortex-A9 is another story, because
> all pages marked as Write-Through are supposedly treated as Non-Cacheable:
> ? ?http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388h/CBBFDIJD.html
> So OMAP4 is out of luck.

I don't have Pandaboard ES, but still tried to experiment changing the
following line in the kernel sources to benchmark different types of
caching for the framebuffer on Origen board (Exynos 4210):
    https://github.com/torvalds/linux/blob/v3.4/drivers/media/video/videobuf2-memops.c#L158

It was not a totally clean experiment, because 500x500 16bpp pixel
buffer is much smaller than 1MiB L2 cache and the performance numbers
may be a bit odd. Also I have not checked whether the same buffer may
be mapped with different cacheability attributes anywhere else (which
would be bad). But still it was interesting to see whether
write-through cache is of any use and whether it could serve as a
replacement for shadowfb.

Origen board, Exynos 4210, Cortex-A9 1.2GHz, 1920x1080 screen
resolution, 16bpp desktop color depth (I did not find any obvious way
how to change it to 32bpp yet):

$ x11perf -scroll500 -copywinwin500 -copypixpix500 \
          -copypixwin500 -copywinpix500

-- pgprot_noncached + shadowfb
 100000 trep @  0.2708 msec ( 3690.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7307 msec ( 1370.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5471 msec ( 1830.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5822 msec ( 1720.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6584 msec ( 1520.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writecombine + shadowfb
 100000 trep @  0.2612 msec ( 3830.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7058 msec ( 1420.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5262 msec ( 1900.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5797 msec ( 1730.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6554 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writethrough + shadowfb
 100000 trep @  0.2609 msec ( 3830.0/sec): Scroll 500x500 pixels
  40000 trep @  0.7018 msec ( 1420.0/sec): Copy 500x500 from window to window
  60000 trep @  0.5260 msec ( 1900.0/sec): Copy 500x500 from pixmap to window
  60000 trep @  0.5758 msec ( 1740.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6569 msec ( 1520.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_noncached
   3500 trep @  7.5972 msec (  132.0/sec): Scroll 500x500 pixels
   1800 trep @ 14.7146 msec (   68.0/sec): Copy 500x500 from window to window
   6000 trep @  4.6501 msec (  215.0/sec): Copy 500x500 from pixmap to window
   8000 trep @  3.3500 msec (  299.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6546 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writecombine
  10000 trep @  2.9439 msec (  340.0/sec): Scroll 500x500 pixels
   6000 trep @  5.7246 msec (  175.0/sec): Copy 500x500 from window to window
  60000 trep @  0.4213 msec ( 2370.0/sec): Copy 500x500 from pixmap to window
  12000 trep @  2.2423 msec (  446.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6648 msec ( 1500.0/sec): Copy 500x500 from pixmap to pixmap

-- pgprot_writethrough
  40000 trep @  0.7103 msec ( 1410.0/sec): Scroll 500x500 pixels
  20000 trep @  1.3024 msec (  768.0/sec): Copy 500x500 from window to window
  80000 trep @  0.3933 msec ( 2540.0/sec): Copy 500x500 from pixmap to window
  18000 trep @  1.3967 msec (  716.0/sec): Copy 500x500 from window to pixmap
  40000 trep @  0.6548 msec ( 1530.0/sec): Copy 500x500 from pixmap to pixmap

Without shadowfb, the performance of "writecombine" looks to be better
than "noncached". And "writethrough" is clearly the fastest. Still
even "writethrough" is no match for shadowfb on Cortex-A9 (unlike
ARM11 and Cortex-A8).

So is Cortex-A9 a lost cause? Maybe experimenting with page table
entries and tweaking inner/outer cacheability attributes could provide
something? From the first glance, it looks like read performance for
write-through cached memory is rather bad on Cortex-A9. But still
there is some speedup, so it does not seem to be treated as totally
non-cached. And at least PL310 L2 cache controller has some support
for "Cacheable write-through, allocate on read":
    http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246f/ch02s03s01.html

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
  2012-05-22 19:54 ` Siarhei Siamashka
  (?)
@ 2012-05-24  7:43   ` Tomi Valkeinen
  -1 siblings, 0 replies; 21+ messages in thread
From: Tomi Valkeinen @ 2012-05-24  7:43 UTC (permalink / raw)
  To: linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 4056 bytes --]

On Tue, 2012-05-22 at 22:54 +0300, Siarhei Siamashka wrote:
> This is a very simple few-liner patchset, which allows to optionally
> enable write-through caching for OMAP DSS framebuffer. The problem with
> the current writecombine cacheability attribute is that it only speeds
> up writes. Uncached reads are slow, even though the use of NEON mitigates
> this problem a bit.
> 
> Traditionally, xf86-video-fbdev DDX is using shadow framebuffer in the
> system memory. Which contains a copy of the framebuffer data for the
> purpose of providing fast read access to it when needed. Framebuffer
> read access is required not so often, but it still gets used for
> scrolling and moving windows around in Xorg server. And the users
> perceive their linux desktop as rather sluggish when these operations
> are not fast enough.
> 
> In the case of ARM hardware, framebuffer is typically physically
> located in the main memory. And the processors still support
> write-through cacheability attribute. According to ARM ARM, the writes
> done to write-through cached memory inside the level of cache are
> visible to all observers outside the level of cache without the need
> of explicit cache maintenance (same rule as for non-cached memory).
> So write-through cache is a perfect choice when only CPU is allowed
> to modify the data in the framebuffer and everyone else (screen
> refresh DMA) is only reading it. That is, assuming that write-through
> cached memory provides good performance and there are no quirks.
> As the framebuffer reads become fast, the need for shadow framebuffer
> disappears.

I ran my own fb perf test on omap3 overo board ("perf" test in
https://gitorious.org/linux-omap-dss2/omapfb-tests) :

vram_cache=n:

sequential_horiz_singlepixel_read: 25198080 pix, 4955475 us, 5084897 pix/s
sequential_horiz_singlepixel_write: 434634240 pix, 4081146 us, 106498086 pix/s
sequential_vert_singlepixel_read: 20106240 pix, 4970611 us, 4045023 pix/s
sequential_vert_singlepixel_write: 98572800 pix, 4985748 us, 19770915 pix/s
sequential_line_read: 40734720 pix, 4977906 us, 8183103 pix/s
sequential_line_write: 1058580480 pix, 5024628 us, 210678378 pix/s
nonsequential_singlepixel_write: 17625600 pix, 4992828 us, 3530183 pix/s
nonsequential_singlepixel_read: 9661440 pix, 4952973 us, 1950634 pix/s

vram_cache=y:

sequential_horiz_singlepixel_read: 270389760 pix, 4994154 us, 54141253 pix/s
sequential_horiz_singlepixel_write: 473149440 pix, 3932801 us, 120308512 pix/s
sequential_vert_singlepixel_read: 18147840 pix, 4976226 us, 3646908 pix/s
sequential_vert_singlepixel_write: 100661760 pix, 4993164 us, 20159914 pix/s
sequential_line_read: 285143040 pix, 4917267 us, 57988114 pix/s
sequential_line_write: 876710400 pix, 5012146 us, 174917171 pix/s
nonsequential_singlepixel_write: 17625600 pix, 4977967 us, 3540722 pix/s
nonsequential_singlepixel_read: 9661440 pix, 4944885 us, 1953825 pix/s

These also show quite a bit of improvement in some read cases.
Interestingly some of the write cases are also faster.

Reading pixels vertically is slower with vram_cache. I guess this is
because the cache causes some overhead, and we always miss the cache so
the caching is just wasted time.

I would've also presumed the difference in sequential_line_write would
be bigger. write-through is effectively no-cache for writes, right?

If the user of the fb just writes to the fb and vram_cache=y, it means
that the cache is filled with pixel data that is never used, thus
lowering the performance of all other programs?

I have to say I don't know much of the cpu caches, but the read speed
improvements are very big, so I think this is definitely interesting
patch. So if you get the first patch accepted I see no problem with
adding this to omapfb as an optional feature.

However, "vram_cache" is not a very good name for the option.
"vram_writethrough", or something?

Did you test this with VRFB (omap3) or TILER (omap4)? I wonder how those
are affected.

 Tomi


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-24  7:43   ` Tomi Valkeinen
  0 siblings, 0 replies; 21+ messages in thread
From: Tomi Valkeinen @ 2012-05-24  7:43 UTC (permalink / raw)
  To: Siarhei Siamashka; +Cc: linux-arm-kernel, linux-omap, linux-fbdev, linux

[-- Attachment #1: Type: text/plain, Size: 4056 bytes --]

On Tue, 2012-05-22 at 22:54 +0300, Siarhei Siamashka wrote:
> This is a very simple few-liner patchset, which allows to optionally
> enable write-through caching for OMAP DSS framebuffer. The problem with
> the current writecombine cacheability attribute is that it only speeds
> up writes. Uncached reads are slow, even though the use of NEON mitigates
> this problem a bit.
> 
> Traditionally, xf86-video-fbdev DDX is using shadow framebuffer in the
> system memory. Which contains a copy of the framebuffer data for the
> purpose of providing fast read access to it when needed. Framebuffer
> read access is required not so often, but it still gets used for
> scrolling and moving windows around in Xorg server. And the users
> perceive their linux desktop as rather sluggish when these operations
> are not fast enough.
> 
> In the case of ARM hardware, framebuffer is typically physically
> located in the main memory. And the processors still support
> write-through cacheability attribute. According to ARM ARM, the writes
> done to write-through cached memory inside the level of cache are
> visible to all observers outside the level of cache without the need
> of explicit cache maintenance (same rule as for non-cached memory).
> So write-through cache is a perfect choice when only CPU is allowed
> to modify the data in the framebuffer and everyone else (screen
> refresh DMA) is only reading it. That is, assuming that write-through
> cached memory provides good performance and there are no quirks.
> As the framebuffer reads become fast, the need for shadow framebuffer
> disappears.

I ran my own fb perf test on omap3 overo board ("perf" test in
https://gitorious.org/linux-omap-dss2/omapfb-tests) :

vram_cache=n:

sequential_horiz_singlepixel_read: 25198080 pix, 4955475 us, 5084897 pix/s
sequential_horiz_singlepixel_write: 434634240 pix, 4081146 us, 106498086 pix/s
sequential_vert_singlepixel_read: 20106240 pix, 4970611 us, 4045023 pix/s
sequential_vert_singlepixel_write: 98572800 pix, 4985748 us, 19770915 pix/s
sequential_line_read: 40734720 pix, 4977906 us, 8183103 pix/s
sequential_line_write: 1058580480 pix, 5024628 us, 210678378 pix/s
nonsequential_singlepixel_write: 17625600 pix, 4992828 us, 3530183 pix/s
nonsequential_singlepixel_read: 9661440 pix, 4952973 us, 1950634 pix/s

vram_cache=y:

sequential_horiz_singlepixel_read: 270389760 pix, 4994154 us, 54141253 pix/s
sequential_horiz_singlepixel_write: 473149440 pix, 3932801 us, 120308512 pix/s
sequential_vert_singlepixel_read: 18147840 pix, 4976226 us, 3646908 pix/s
sequential_vert_singlepixel_write: 100661760 pix, 4993164 us, 20159914 pix/s
sequential_line_read: 285143040 pix, 4917267 us, 57988114 pix/s
sequential_line_write: 876710400 pix, 5012146 us, 174917171 pix/s
nonsequential_singlepixel_write: 17625600 pix, 4977967 us, 3540722 pix/s
nonsequential_singlepixel_read: 9661440 pix, 4944885 us, 1953825 pix/s

These also show quite a bit of improvement in some read cases.
Interestingly some of the write cases are also faster.

Reading pixels vertically is slower with vram_cache. I guess this is
because the cache causes some overhead, and we always miss the cache so
the caching is just wasted time.

I would've also presumed the difference in sequential_line_write would
be bigger. write-through is effectively no-cache for writes, right?

If the user of the fb just writes to the fb and vram_cache=y, it means
that the cache is filled with pixel data that is never used, thus
lowering the performance of all other programs?

I have to say I don't know much of the cpu caches, but the read speed
improvements are very big, so I think this is definitely interesting
patch. So if you get the first patch accepted I see no problem with
adding this to omapfb as an optional feature.

However, "vram_cache" is not a very good name for the option.
"vram_writethrough", or something?

Did you test this with VRFB (omap3) or TILER (omap4)? I wonder how those
are affected.

 Tomi


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-24  7:43   ` Tomi Valkeinen
  0 siblings, 0 replies; 21+ messages in thread
From: Tomi Valkeinen @ 2012-05-24  7:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2012-05-22 at 22:54 +0300, Siarhei Siamashka wrote:
> This is a very simple few-liner patchset, which allows to optionally
> enable write-through caching for OMAP DSS framebuffer. The problem with
> the current writecombine cacheability attribute is that it only speeds
> up writes. Uncached reads are slow, even though the use of NEON mitigates
> this problem a bit.
> 
> Traditionally, xf86-video-fbdev DDX is using shadow framebuffer in the
> system memory. Which contains a copy of the framebuffer data for the
> purpose of providing fast read access to it when needed. Framebuffer
> read access is required not so often, but it still gets used for
> scrolling and moving windows around in Xorg server. And the users
> perceive their linux desktop as rather sluggish when these operations
> are not fast enough.
> 
> In the case of ARM hardware, framebuffer is typically physically
> located in the main memory. And the processors still support
> write-through cacheability attribute. According to ARM ARM, the writes
> done to write-through cached memory inside the level of cache are
> visible to all observers outside the level of cache without the need
> of explicit cache maintenance (same rule as for non-cached memory).
> So write-through cache is a perfect choice when only CPU is allowed
> to modify the data in the framebuffer and everyone else (screen
> refresh DMA) is only reading it. That is, assuming that write-through
> cached memory provides good performance and there are no quirks.
> As the framebuffer reads become fast, the need for shadow framebuffer
> disappears.

I ran my own fb perf test on omap3 overo board ("perf" test in
https://gitorious.org/linux-omap-dss2/omapfb-tests) :

vram_cache=n:

sequential_horiz_singlepixel_read: 25198080 pix, 4955475 us, 5084897 pix/s
sequential_horiz_singlepixel_write: 434634240 pix, 4081146 us, 106498086 pix/s
sequential_vert_singlepixel_read: 20106240 pix, 4970611 us, 4045023 pix/s
sequential_vert_singlepixel_write: 98572800 pix, 4985748 us, 19770915 pix/s
sequential_line_read: 40734720 pix, 4977906 us, 8183103 pix/s
sequential_line_write: 1058580480 pix, 5024628 us, 210678378 pix/s
nonsequential_singlepixel_write: 17625600 pix, 4992828 us, 3530183 pix/s
nonsequential_singlepixel_read: 9661440 pix, 4952973 us, 1950634 pix/s

vram_cache=y:

sequential_horiz_singlepixel_read: 270389760 pix, 4994154 us, 54141253 pix/s
sequential_horiz_singlepixel_write: 473149440 pix, 3932801 us, 120308512 pix/s
sequential_vert_singlepixel_read: 18147840 pix, 4976226 us, 3646908 pix/s
sequential_vert_singlepixel_write: 100661760 pix, 4993164 us, 20159914 pix/s
sequential_line_read: 285143040 pix, 4917267 us, 57988114 pix/s
sequential_line_write: 876710400 pix, 5012146 us, 174917171 pix/s
nonsequential_singlepixel_write: 17625600 pix, 4977967 us, 3540722 pix/s
nonsequential_singlepixel_read: 9661440 pix, 4944885 us, 1953825 pix/s

These also show quite a bit of improvement in some read cases.
Interestingly some of the write cases are also faster.

Reading pixels vertically is slower with vram_cache. I guess this is
because the cache causes some overhead, and we always miss the cache so
the caching is just wasted time.

I would've also presumed the difference in sequential_line_write would
be bigger. write-through is effectively no-cache for writes, right?

If the user of the fb just writes to the fb and vram_cache=y, it means
that the cache is filled with pixel data that is never used, thus
lowering the performance of all other programs?

I have to say I don't know much of the cpu caches, but the read speed
improvements are very big, so I think this is definitely interesting
patch. So if you get the first patch accepted I see no problem with
adding this to omapfb as an optional feature.

However, "vram_cache" is not a very good name for the option.
"vram_writethrough", or something?

Did you test this with VRFB (omap3) or TILER (omap4)? I wonder how those
are affected.

 Tomi

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20120524/badcc553/attachment.sig>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
  2012-05-24  7:43   ` Tomi Valkeinen
  (?)
@ 2012-05-25  9:00     ` Siarhei Siamashka
  -1 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-25  9:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 24, 2012 at 10:43 AM, Tomi Valkeinen <tomi.valkeinen@ti.com> wrote:
> On Tue, 2012-05-22 at 22:54 +0300, Siarhei Siamashka wrote:
> I ran my own fb perf test on omap3 overo board ("perf" test in
> https://gitorious.org/linux-omap-dss2/omapfb-tests) :
>
> vram_cache=n:
>
> sequential_horiz_singlepixel_read: 25198080 pix, 4955475 us, 5084897 pix/s
> sequential_horiz_singlepixel_write: 434634240 pix, 4081146 us, 106498086 pix/s
> sequential_vert_singlepixel_read: 20106240 pix, 4970611 us, 4045023 pix/s
> sequential_vert_singlepixel_write: 98572800 pix, 4985748 us, 19770915 pix/s
> sequential_line_read: 40734720 pix, 4977906 us, 8183103 pix/s
> sequential_line_write: 1058580480 pix, 5024628 us, 210678378 pix/s
> nonsequential_singlepixel_write: 17625600 pix, 4992828 us, 3530183 pix/s
> nonsequential_singlepixel_read: 9661440 pix, 4952973 us, 1950634 pix/s
>
> vram_cache=y:
>
> sequential_horiz_singlepixel_read: 270389760 pix, 4994154 us, 54141253 pix/s
> sequential_horiz_singlepixel_write: 473149440 pix, 3932801 us, 120308512 pix/s
> sequential_vert_singlepixel_read: 18147840 pix, 4976226 us, 3646908 pix/s
> sequential_vert_singlepixel_write: 100661760 pix, 4993164 us, 20159914 pix/s
> sequential_line_read: 285143040 pix, 4917267 us, 57988114 pix/s
> sequential_line_write: 876710400 pix, 5012146 us, 174917171 pix/s
> nonsequential_singlepixel_write: 17625600 pix, 4977967 us, 3540722 pix/s
> nonsequential_singlepixel_read: 9661440 pix, 4944885 us, 1953825 pix/s
>
> These also show quite a bit of improvement in some read cases.
> Interestingly some of the write cases are also faster.
>
> Reading pixels vertically is slower with vram_cache. I guess this is
> because the cache causes some overhead, and we always miss the cache so
> the caching is just wasted time.

On a positive side, nobody is normally accessing memory in this way.
It is a well known performance anti-pattern.

> I would've also presumed the difference in sequential_line_write would
> be bigger. write-through is effectively no-cache for writes, right?

Write-through cache is still using write combining buffer for memory
writes. So I would actually expect the performance to be the same.

> If the user of the fb just writes to the fb and vram_cache=y, it means
> that the cache is filled with pixel data that is never used, thus
> lowering the performance of all other programs?

This is true only for write-allocate. We want the framebuffer to be
cacheable as write-through, allocate on read, no allocate on write.

Sure, when we are reading from the cached framebuffer, some useful
data may be evicted from cache. But if we are not caching the
framebuffer, any readers suffer from a huge performance penalty.
That's the reason why shadow framebuffer (a poor man's software
workaround) is implemented in Xorg server. And if we use a shadow
framebuffer (in normal write-back cached memory), we already have the
same or worse cache eviction problems when compared to the cached
framebuffer. I could not find any use case or benchmark where shadow
framebuffer would perform better than write-through cached framebuffer
on OMAP3 hardware.

Maybe a bit more details about the shadow framebuffer would be useful.
That's just one level above the framebuffer in the graphics stack for
linux desktop. And many of the performance issues happen exactly on
the boundary between different pieces of software when they are
developed independently. Knowing how the framebuffer is used in the
real world applications (and I assume X server is one of them) may
provide some insights about how to improve the framebuffer on the
kernel side.

Here the shadow framebuffer is initialized:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/shadow/shadow.c?id=xorg-server-1.12.1#n136
And it also enables damage extension to track all the drawing
operations performed with the shadow buffer in order to copy the
updated areas to the real framebuffer from time to time. The update
itself happens here:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/shadow/shpacked.c?id=xorg-server-1.12.1#n43
With the damage extension active, we go through damageCopyArea:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/damage/damage.c?id=xorg-server-1.12.1#n801
before reaching fbCopyArea:
    http://cgit.freedesktop.org/xorg/xserver/tree/fb/fbcopy.c?id=xorg-server-1.12.1#n265

While running x11perf test and doing copying/scrolling tests, the
function calls inside of X server look more or less like this:
    62. [00:29:27.779] fbCopyArea()
    63. [00:29:27.779] DamageReportDamage()
    64. [00:29:27.783] fbCopyArea()
    65. [00:29:27.783] DamageReportDamage()
    66. [00:29:27.787] fbCopyArea()
    67. [00:29:27.792] shadowUpdatePacked()
    68. [00:29:27.792] DamageReportDamage()
    69. [00:29:27.795] fbCopyArea()
    70. [00:29:27.795] DamageReportDamage()
    71. [00:29:27.799] fbCopyArea()
    72. [00:29:27.800] DamageReportDamage()
    73. [00:29:27.803] fbCopyArea()
    74. [00:29:27.803] DamageReportDamage()
    75. [00:29:27.807] fbCopyArea()
    76. [00:29:27.807] DamageReportDamage()
    77. [00:29:27.811] fbCopyArea()
    78. [00:29:27.811] DamageReportDamage()
    79. [00:29:27.815] fbCopyArea()
    80. [00:29:27.820] shadowUpdatePacked()

As can be seen in the log above, shadowUpdatePacked() is used much
less frequently than fbCopyArea(). Which means that shadow framebuffer
is cheating a bit by accumulating damage and updating the real
framebuffer much less frequently.

The write-through cached framebuffer beats the shadow framebuffer in every way:
- It needs less RAM
- No need for damage tracking overhead (important for small drawing operations)
- No screen updates are skipped, which means smoother animation

> I have to say I don't know much of the cpu caches, but the read speed
> improvements are very big, so I think this is definitely interesting
> patch.

Yes, it is definitely providing a significant performance improvement
for software rendering in Xorg server. But there is unfortunately no
free lunch. Using write-through cache means that if anything other
than CPU is writing to the framebuffer, then it must invalidate CPU
cache for this area. And this means that it makes sense to review how
the SGX integration is done and fix it if needed. I tried some tests
with X11WSEGL (a binary blob provided in GFX SDK for X11 integration)
and it did not seem to exhibit any obvious screen corruption issues,
but without having the sources it's hard to say for sure. I would
assume that the ideal setup for OMAP3 would be to use GFX plane
exclusively by the CPU for 2D graphics, and render SGX 3D graphics to
one of the VID planes. In this case DISPC can do the compositing and
CPU cache invalidate operations become unnecessary.

There is also DSP, ISP and maybe some other hardware blocks, but these
can be handled on case by case basis. My primary interest is a little
personal hobby project to get linux desktop running with acceptable
performance.

> So if you get the first patch accepted I see no problem with
> adding this to omapfb as an optional feature.

Yes, a review from ARM memory management subsystem experts is definitely needed.

> However, "vram_cache" is not a very good name for the option.
> "vram_writethrough", or something?

Still having "cache" in the name would be useful. Just in order to
imply that there might be coherency issues to consider. By the way,
vesafb actually uses some numeric codes for different types of caching
in mtrr:n option:
    https://github.com/torvalds/linux/blob/v3.4/Documentation/fb/vesafb.txt#L147

Write-through caching is bad on Cortex-A9 as promised in TRM and as
confirmed by tests. Still for Cortex-A9 it *might* be interesting to
experiment with enabling write-back cache for the framebuffer, but
instead of using shadow framebuffer just do CPU cache flushes based on
all the same damage tracking borrowed from shadow framebuffer code.
This even might be somehow related to the "manual" update mode :)
Except that we can't change framebuffer caching attributes at runtime.

> Did you test this with VRFB (omap3) or TILER (omap4)? I wonder how those
> are affected.

That's a good point. I tried VRFB on OMAP3 1GHz and got the following
results with x11perf (shadow framebuffer disabled in
xf86-video-fbdev):

------------------ rotate 90, write-through cached
  3500 trep @   8.0242 msec ( 125.0/sec): Scroll 500x500 pixels
  4000 trep @   8.7027 msec ( 115.0/sec): Copy 500x500 from window to window
  6000 trep @   5.4885 msec ( 182.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.9806 msec ( 201.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4231 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 90,  non-cached writecombining
  3000 trep @   8.9732 msec ( 111.0/sec): Scroll 500x500 pixels
  1000 trep @  26.3218 msec (  38.0/sec): Copy 500x500 from window to window
  6000 trep @   5.5002 msec ( 182.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   6.2368 msec ( 160.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4219 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 180, write-through cached
 10000 trep @   3.5219 msec ( 284.0/sec): Scroll 500x500 pixels
  6000 trep @   4.8829 msec ( 205.0/sec): Copy 500x500 from window to window
  8000 trep @   3.4772 msec ( 288.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.2554 msec ( 307.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4196 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 180, non-cached writecombining
 10000 trep @   4.5777 msec ( 218.0/sec): Scroll 500x500 pixels
  1000 trep @  24.9100 msec (  40.1/sec): Copy 500x500 from window to window
  8000 trep @   3.4763 msec ( 288.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.8676 msec ( 205.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4205 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

The 90 degrees rotation significantly reduces performance. Shadow
framebuffer is "faster", because it is skipping some work. But
write-through caching is at least not worse than the default
writecombine.

Regarding OMAP4. I only have an old pre-production Pandaboard EA1,
which only runs memory at half speed. It is useless for any
benchmarks.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-25  9:00     ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-25  9:00 UTC (permalink / raw)
  To: Tomi Valkeinen; +Cc: linux-arm-kernel, linux-omap, linux-fbdev, linux

On Thu, May 24, 2012 at 10:43 AM, Tomi Valkeinen <tomi.valkeinen@ti.com> wrote:
> On Tue, 2012-05-22 at 22:54 +0300, Siarhei Siamashka wrote:
> I ran my own fb perf test on omap3 overo board ("perf" test in
> https://gitorious.org/linux-omap-dss2/omapfb-tests) :
>
> vram_cache=n:
>
> sequential_horiz_singlepixel_read: 25198080 pix, 4955475 us, 5084897 pix/s
> sequential_horiz_singlepixel_write: 434634240 pix, 4081146 us, 106498086 pix/s
> sequential_vert_singlepixel_read: 20106240 pix, 4970611 us, 4045023 pix/s
> sequential_vert_singlepixel_write: 98572800 pix, 4985748 us, 19770915 pix/s
> sequential_line_read: 40734720 pix, 4977906 us, 8183103 pix/s
> sequential_line_write: 1058580480 pix, 5024628 us, 210678378 pix/s
> nonsequential_singlepixel_write: 17625600 pix, 4992828 us, 3530183 pix/s
> nonsequential_singlepixel_read: 9661440 pix, 4952973 us, 1950634 pix/s
>
> vram_cache=y:
>
> sequential_horiz_singlepixel_read: 270389760 pix, 4994154 us, 54141253 pix/s
> sequential_horiz_singlepixel_write: 473149440 pix, 3932801 us, 120308512 pix/s
> sequential_vert_singlepixel_read: 18147840 pix, 4976226 us, 3646908 pix/s
> sequential_vert_singlepixel_write: 100661760 pix, 4993164 us, 20159914 pix/s
> sequential_line_read: 285143040 pix, 4917267 us, 57988114 pix/s
> sequential_line_write: 876710400 pix, 5012146 us, 174917171 pix/s
> nonsequential_singlepixel_write: 17625600 pix, 4977967 us, 3540722 pix/s
> nonsequential_singlepixel_read: 9661440 pix, 4944885 us, 1953825 pix/s
>
> These also show quite a bit of improvement in some read cases.
> Interestingly some of the write cases are also faster.
>
> Reading pixels vertically is slower with vram_cache. I guess this is
> because the cache causes some overhead, and we always miss the cache so
> the caching is just wasted time.

On a positive side, nobody is normally accessing memory in this way.
It is a well known performance anti-pattern.

> I would've also presumed the difference in sequential_line_write would
> be bigger. write-through is effectively no-cache for writes, right?

Write-through cache is still using write combining buffer for memory
writes. So I would actually expect the performance to be the same.

> If the user of the fb just writes to the fb and vram_cache=y, it means
> that the cache is filled with pixel data that is never used, thus
> lowering the performance of all other programs?

This is true only for write-allocate. We want the framebuffer to be
cacheable as write-through, allocate on read, no allocate on write.

Sure, when we are reading from the cached framebuffer, some useful
data may be evicted from cache. But if we are not caching the
framebuffer, any readers suffer from a huge performance penalty.
That's the reason why shadow framebuffer (a poor man's software
workaround) is implemented in Xorg server. And if we use a shadow
framebuffer (in normal write-back cached memory), we already have the
same or worse cache eviction problems when compared to the cached
framebuffer. I could not find any use case or benchmark where shadow
framebuffer would perform better than write-through cached framebuffer
on OMAP3 hardware.

Maybe a bit more details about the shadow framebuffer would be useful.
That's just one level above the framebuffer in the graphics stack for
linux desktop. And many of the performance issues happen exactly on
the boundary between different pieces of software when they are
developed independently. Knowing how the framebuffer is used in the
real world applications (and I assume X server is one of them) may
provide some insights about how to improve the framebuffer on the
kernel side.

Here the shadow framebuffer is initialized:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/shadow/shadow.c?id=xorg-server-1.12.1#n136
And it also enables damage extension to track all the drawing
operations performed with the shadow buffer in order to copy the
updated areas to the real framebuffer from time to time. The update
itself happens here:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/shadow/shpacked.c?id=xorg-server-1.12.1#n43
With the damage extension active, we go through damageCopyArea:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/damage/damage.c?id=xorg-server-1.12.1#n801
before reaching fbCopyArea:
    http://cgit.freedesktop.org/xorg/xserver/tree/fb/fbcopy.c?id=xorg-server-1.12.1#n265

While running x11perf test and doing copying/scrolling tests, the
function calls inside of X server look more or less like this:
    62. [00:29:27.779] fbCopyArea()
    63. [00:29:27.779] DamageReportDamage()
    64. [00:29:27.783] fbCopyArea()
    65. [00:29:27.783] DamageReportDamage()
    66. [00:29:27.787] fbCopyArea()
    67. [00:29:27.792] shadowUpdatePacked()
    68. [00:29:27.792] DamageReportDamage()
    69. [00:29:27.795] fbCopyArea()
    70. [00:29:27.795] DamageReportDamage()
    71. [00:29:27.799] fbCopyArea()
    72. [00:29:27.800] DamageReportDamage()
    73. [00:29:27.803] fbCopyArea()
    74. [00:29:27.803] DamageReportDamage()
    75. [00:29:27.807] fbCopyArea()
    76. [00:29:27.807] DamageReportDamage()
    77. [00:29:27.811] fbCopyArea()
    78. [00:29:27.811] DamageReportDamage()
    79. [00:29:27.815] fbCopyArea()
    80. [00:29:27.820] shadowUpdatePacked()

As can be seen in the log above, shadowUpdatePacked() is used much
less frequently than fbCopyArea(). Which means that shadow framebuffer
is cheating a bit by accumulating damage and updating the real
framebuffer much less frequently.

The write-through cached framebuffer beats the shadow framebuffer in every way:
- It needs less RAM
- No need for damage tracking overhead (important for small drawing operations)
- No screen updates are skipped, which means smoother animation

> I have to say I don't know much of the cpu caches, but the read speed
> improvements are very big, so I think this is definitely interesting
> patch.

Yes, it is definitely providing a significant performance improvement
for software rendering in Xorg server. But there is unfortunately no
free lunch. Using write-through cache means that if anything other
than CPU is writing to the framebuffer, then it must invalidate CPU
cache for this area. And this means that it makes sense to review how
the SGX integration is done and fix it if needed. I tried some tests
with X11WSEGL (a binary blob provided in GFX SDK for X11 integration)
and it did not seem to exhibit any obvious screen corruption issues,
but without having the sources it's hard to say for sure. I would
assume that the ideal setup for OMAP3 would be to use GFX plane
exclusively by the CPU for 2D graphics, and render SGX 3D graphics to
one of the VID planes. In this case DISPC can do the compositing and
CPU cache invalidate operations become unnecessary.

There is also DSP, ISP and maybe some other hardware blocks, but these
can be handled on case by case basis. My primary interest is a little
personal hobby project to get linux desktop running with acceptable
performance.

> So if you get the first patch accepted I see no problem with
> adding this to omapfb as an optional feature.

Yes, a review from ARM memory management subsystem experts is definitely needed.

> However, "vram_cache" is not a very good name for the option.
> "vram_writethrough", or something?

Still having "cache" in the name would be useful. Just in order to
imply that there might be coherency issues to consider. By the way,
vesafb actually uses some numeric codes for different types of caching
in mtrr:n option:
    https://github.com/torvalds/linux/blob/v3.4/Documentation/fb/vesafb.txt#L147

Write-through caching is bad on Cortex-A9 as promised in TRM and as
confirmed by tests. Still for Cortex-A9 it *might* be interesting to
experiment with enabling write-back cache for the framebuffer, but
instead of using shadow framebuffer just do CPU cache flushes based on
all the same damage tracking borrowed from shadow framebuffer code.
This even might be somehow related to the "manual" update mode :)
Except that we can't change framebuffer caching attributes at runtime.

> Did you test this with VRFB (omap3) or TILER (omap4)? I wonder how those
> are affected.

That's a good point. I tried VRFB on OMAP3 1GHz and got the following
results with x11perf (shadow framebuffer disabled in
xf86-video-fbdev):

------------------ rotate 90, write-through cached
  3500 trep @   8.0242 msec ( 125.0/sec): Scroll 500x500 pixels
  4000 trep @   8.7027 msec ( 115.0/sec): Copy 500x500 from window to window
  6000 trep @   5.4885 msec ( 182.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.9806 msec ( 201.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4231 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 90,  non-cached writecombining
  3000 trep @   8.9732 msec ( 111.0/sec): Scroll 500x500 pixels
  1000 trep @  26.3218 msec (  38.0/sec): Copy 500x500 from window to window
  6000 trep @   5.5002 msec ( 182.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   6.2368 msec ( 160.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4219 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 180, write-through cached
 10000 trep @   3.5219 msec ( 284.0/sec): Scroll 500x500 pixels
  6000 trep @   4.8829 msec ( 205.0/sec): Copy 500x500 from window to window
  8000 trep @   3.4772 msec ( 288.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.2554 msec ( 307.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4196 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 180, non-cached writecombining
 10000 trep @   4.5777 msec ( 218.0/sec): Scroll 500x500 pixels
  1000 trep @  24.9100 msec (  40.1/sec): Copy 500x500 from window to window
  8000 trep @   3.4763 msec ( 288.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.8676 msec ( 205.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4205 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

The 90 degrees rotation significantly reduces performance. Shadow
framebuffer is "faster", because it is skipping some work. But
write-through caching is at least not worse than the default
writecombine.

Regarding OMAP4. I only have an old pre-production Pandaboard EA1,
which only runs memory at half speed. It is useless for any
benchmarks.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 0/2] OMAPDSS: write-through caching support for omapfb
@ 2012-05-25  9:00     ` Siarhei Siamashka
  0 siblings, 0 replies; 21+ messages in thread
From: Siarhei Siamashka @ 2012-05-25  9:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 24, 2012 at 10:43 AM, Tomi Valkeinen <tomi.valkeinen@ti.com> wrote:
> On Tue, 2012-05-22 at 22:54 +0300, Siarhei Siamashka wrote:
> I ran my own fb perf test on omap3 overo board ("perf" test in
> https://gitorious.org/linux-omap-dss2/omapfb-tests) :
>
> vram_cache=n:
>
> sequential_horiz_singlepixel_read: 25198080 pix, 4955475 us, 5084897 pix/s
> sequential_horiz_singlepixel_write: 434634240 pix, 4081146 us, 106498086 pix/s
> sequential_vert_singlepixel_read: 20106240 pix, 4970611 us, 4045023 pix/s
> sequential_vert_singlepixel_write: 98572800 pix, 4985748 us, 19770915 pix/s
> sequential_line_read: 40734720 pix, 4977906 us, 8183103 pix/s
> sequential_line_write: 1058580480 pix, 5024628 us, 210678378 pix/s
> nonsequential_singlepixel_write: 17625600 pix, 4992828 us, 3530183 pix/s
> nonsequential_singlepixel_read: 9661440 pix, 4952973 us, 1950634 pix/s
>
> vram_cache=y:
>
> sequential_horiz_singlepixel_read: 270389760 pix, 4994154 us, 54141253 pix/s
> sequential_horiz_singlepixel_write: 473149440 pix, 3932801 us, 120308512 pix/s
> sequential_vert_singlepixel_read: 18147840 pix, 4976226 us, 3646908 pix/s
> sequential_vert_singlepixel_write: 100661760 pix, 4993164 us, 20159914 pix/s
> sequential_line_read: 285143040 pix, 4917267 us, 57988114 pix/s
> sequential_line_write: 876710400 pix, 5012146 us, 174917171 pix/s
> nonsequential_singlepixel_write: 17625600 pix, 4977967 us, 3540722 pix/s
> nonsequential_singlepixel_read: 9661440 pix, 4944885 us, 1953825 pix/s
>
> These also show quite a bit of improvement in some read cases.
> Interestingly some of the write cases are also faster.
>
> Reading pixels vertically is slower with vram_cache. I guess this is
> because the cache causes some overhead, and we always miss the cache so
> the caching is just wasted time.

On a positive side, nobody is normally accessing memory in this way.
It is a well known performance anti-pattern.

> I would've also presumed the difference in sequential_line_write would
> be bigger. write-through is effectively no-cache for writes, right?

Write-through cache is still using write combining buffer for memory
writes. So I would actually expect the performance to be the same.

> If the user of the fb just writes to the fb and vram_cache=y, it means
> that the cache is filled with pixel data that is never used, thus
> lowering the performance of all other programs?

This is true only for write-allocate. We want the framebuffer to be
cacheable as write-through, allocate on read, no allocate on write.

Sure, when we are reading from the cached framebuffer, some useful
data may be evicted from cache. But if we are not caching the
framebuffer, any readers suffer from a huge performance penalty.
That's the reason why shadow framebuffer (a poor man's software
workaround) is implemented in Xorg server. And if we use a shadow
framebuffer (in normal write-back cached memory), we already have the
same or worse cache eviction problems when compared to the cached
framebuffer. I could not find any use case or benchmark where shadow
framebuffer would perform better than write-through cached framebuffer
on OMAP3 hardware.

Maybe a bit more details about the shadow framebuffer would be useful.
That's just one level above the framebuffer in the graphics stack for
linux desktop. And many of the performance issues happen exactly on
the boundary between different pieces of software when they are
developed independently. Knowing how the framebuffer is used in the
real world applications (and I assume X server is one of them) may
provide some insights about how to improve the framebuffer on the
kernel side.

Here the shadow framebuffer is initialized:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/shadow/shadow.c?id=xorg-server-1.12.1#n136
And it also enables damage extension to track all the drawing
operations performed with the shadow buffer in order to copy the
updated areas to the real framebuffer from time to time. The update
itself happens here:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/shadow/shpacked.c?id=xorg-server-1.12.1#n43
With the damage extension active, we go through damageCopyArea:
    http://cgit.freedesktop.org/xorg/xserver/tree/miext/damage/damage.c?id=xorg-server-1.12.1#n801
before reaching fbCopyArea:
    http://cgit.freedesktop.org/xorg/xserver/tree/fb/fbcopy.c?id=xorg-server-1.12.1#n265

While running x11perf test and doing copying/scrolling tests, the
function calls inside of X server look more or less like this:
    62. [00:29:27.779] fbCopyArea()
    63. [00:29:27.779] DamageReportDamage()
    64. [00:29:27.783] fbCopyArea()
    65. [00:29:27.783] DamageReportDamage()
    66. [00:29:27.787] fbCopyArea()
    67. [00:29:27.792] shadowUpdatePacked()
    68. [00:29:27.792] DamageReportDamage()
    69. [00:29:27.795] fbCopyArea()
    70. [00:29:27.795] DamageReportDamage()
    71. [00:29:27.799] fbCopyArea()
    72. [00:29:27.800] DamageReportDamage()
    73. [00:29:27.803] fbCopyArea()
    74. [00:29:27.803] DamageReportDamage()
    75. [00:29:27.807] fbCopyArea()
    76. [00:29:27.807] DamageReportDamage()
    77. [00:29:27.811] fbCopyArea()
    78. [00:29:27.811] DamageReportDamage()
    79. [00:29:27.815] fbCopyArea()
    80. [00:29:27.820] shadowUpdatePacked()

As can be seen in the log above, shadowUpdatePacked() is used much
less frequently than fbCopyArea(). Which means that shadow framebuffer
is cheating a bit by accumulating damage and updating the real
framebuffer much less frequently.

The write-through cached framebuffer beats the shadow framebuffer in every way:
- It needs less RAM
- No need for damage tracking overhead (important for small drawing operations)
- No screen updates are skipped, which means smoother animation

> I have to say I don't know much of the cpu caches, but the read speed
> improvements are very big, so I think this is definitely interesting
> patch.

Yes, it is definitely providing a significant performance improvement
for software rendering in Xorg server. But there is unfortunately no
free lunch. Using write-through cache means that if anything other
than CPU is writing to the framebuffer, then it must invalidate CPU
cache for this area. And this means that it makes sense to review how
the SGX integration is done and fix it if needed. I tried some tests
with X11WSEGL (a binary blob provided in GFX SDK for X11 integration)
and it did not seem to exhibit any obvious screen corruption issues,
but without having the sources it's hard to say for sure. I would
assume that the ideal setup for OMAP3 would be to use GFX plane
exclusively by the CPU for 2D graphics, and render SGX 3D graphics to
one of the VID planes. In this case DISPC can do the compositing and
CPU cache invalidate operations become unnecessary.

There is also DSP, ISP and maybe some other hardware blocks, but these
can be handled on case by case basis. My primary interest is a little
personal hobby project to get linux desktop running with acceptable
performance.

> So if you get the first patch accepted I see no problem with
> adding this to omapfb as an optional feature.

Yes, a review from ARM memory management subsystem experts is definitely needed.

> However, "vram_cache" is not a very good name for the option.
> "vram_writethrough", or something?

Still having "cache" in the name would be useful. Just in order to
imply that there might be coherency issues to consider. By the way,
vesafb actually uses some numeric codes for different types of caching
in mtrr:n option:
    https://github.com/torvalds/linux/blob/v3.4/Documentation/fb/vesafb.txt#L147

Write-through caching is bad on Cortex-A9 as promised in TRM and as
confirmed by tests. Still for Cortex-A9 it *might* be interesting to
experiment with enabling write-back cache for the framebuffer, but
instead of using shadow framebuffer just do CPU cache flushes based on
all the same damage tracking borrowed from shadow framebuffer code.
This even might be somehow related to the "manual" update mode :)
Except that we can't change framebuffer caching attributes at runtime.

> Did you test this with VRFB (omap3) or TILER (omap4)? I wonder how those
> are affected.

That's a good point. I tried VRFB on OMAP3 1GHz and got the following
results with x11perf (shadow framebuffer disabled in
xf86-video-fbdev):

------------------ rotate 90, write-through cached
  3500 trep @   8.0242 msec ( 125.0/sec): Scroll 500x500 pixels
  4000 trep @   8.7027 msec ( 115.0/sec): Copy 500x500 from window to window
  6000 trep @   5.4885 msec ( 182.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.9806 msec ( 201.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4231 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 90,  non-cached writecombining
  3000 trep @   8.9732 msec ( 111.0/sec): Scroll 500x500 pixels
  1000 trep @  26.3218 msec (  38.0/sec): Copy 500x500 from window to window
  6000 trep @   5.5002 msec ( 182.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   6.2368 msec ( 160.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4219 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 180, write-through cached
 10000 trep @   3.5219 msec ( 284.0/sec): Scroll 500x500 pixels
  6000 trep @   4.8829 msec ( 205.0/sec): Copy 500x500 from window to window
  8000 trep @   3.4772 msec ( 288.0/sec): Copy 500x500 from pixmap to window
  8000 trep @   3.2554 msec ( 307.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4196 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

------------------ rotate 180, non-cached writecombining
 10000 trep @   4.5777 msec ( 218.0/sec): Scroll 500x500 pixels
  1000 trep @  24.9100 msec (  40.1/sec): Copy 500x500 from window to window
  8000 trep @   3.4763 msec ( 288.0/sec): Copy 500x500 from pixmap to window
  6000 trep @   4.8676 msec ( 205.0/sec): Copy 500x500 from window to pixmap
  8000 trep @   3.4205 msec ( 292.0/sec): Copy 500x500 from pixmap to pixmap

The 90 degrees rotation significantly reduces performance. Shadow
framebuffer is "faster", because it is skipping some work. But
write-through caching is at least not worse than the default
writecombine.

Regarding OMAP4. I only have an old pre-production Pandaboard EA1,
which only runs memory at half speed. It is useless for any
benchmarks.

-- 
Best regards,
Siarhei Siamashka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] ARM: pgtable: add pgprot_writethrough() macro
  2012-05-22 19:54   ` Siarhei Siamashka
  (?)
@ 2012-08-08 11:07     ` Grazvydas Ignotas
  -1 siblings, 0 replies; 21+ messages in thread
From: Grazvydas Ignotas @ 2012-08-08 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 22, 2012 at 10:54 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
> Needed for remapping pages with write-through cacheable
> attribute. May be useful for framebuffers.

With this series applied, some of my framebuffer programs get over 70%
performance improvement, could we get this patch in? OMAP DSS
maintainer agreed to take the second patch if this can be merged.

> Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
> ---
>  arch/arm/include/asm/pgtable.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index f66626d..04297fa 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -103,6 +103,9 @@ extern pgprot_t             pgprot_kernel;
>  #define pgprot_stronglyordered(prot) \
>         __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
>
> +#define pgprot_writethrough(prot) \
> +       __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_WRITETHROUGH)
> +
>  #ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
>  #define pgprot_dmacoherent(prot) \
>         __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE | L_PTE_XN)
> --
> 1.7.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
GraÅ¾vydas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] ARM: pgtable: add pgprot_writethrough() macro
@ 2012-08-08 11:07     ` Grazvydas Ignotas
  0 siblings, 0 replies; 21+ messages in thread
From: Grazvydas Ignotas @ 2012-08-08 11:07 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Siarhei Siamashka, linux-omap, linux-fbdev, tomi.valkeinen

On Tue, May 22, 2012 at 10:54 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
> Needed for remapping pages with write-through cacheable
> attribute. May be useful for framebuffers.

With this series applied, some of my framebuffer programs get over 70%
performance improvement, could we get this patch in? OMAP DSS
maintainer agreed to take the second patch if this can be merged.

> Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
> ---
>  arch/arm/include/asm/pgtable.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index f66626d..04297fa 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -103,6 +103,9 @@ extern pgprot_t             pgprot_kernel;
>  #define pgprot_stronglyordered(prot) \
>         __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
>
> +#define pgprot_writethrough(prot) \
> +       __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_WRITETHROUGH)
> +
>  #ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
>  #define pgprot_dmacoherent(prot) \
>         __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE | L_PTE_XN)
> --
> 1.7.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Gražvydas
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] ARM: pgtable: add pgprot_writethrough() macro
@ 2012-08-08 11:07     ` Grazvydas Ignotas
  0 siblings, 0 replies; 21+ messages in thread
From: Grazvydas Ignotas @ 2012-08-08 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 22, 2012 at 10:54 PM, Siarhei Siamashka
<siarhei.siamashka@gmail.com> wrote:
> Needed for remapping pages with write-through cacheable
> attribute. May be useful for framebuffers.

With this series applied, some of my framebuffer programs get over 70%
performance improvement, could we get this patch in? OMAP DSS
maintainer agreed to take the second patch if this can be merged.

> Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
> ---
>  arch/arm/include/asm/pgtable.h |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index f66626d..04297fa 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -103,6 +103,9 @@ extern pgprot_t             pgprot_kernel;
>  #define pgprot_stronglyordered(prot) \
>         __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
>
> +#define pgprot_writethrough(prot) \
> +       __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_WRITETHROUGH)
> +
>  #ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
>  #define pgprot_dmacoherent(prot) \
>         __pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE | L_PTE_XN)
> --
> 1.7.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-omap" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Gra?vydas

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-08-08 11:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-22 19:54 [PATCH 0/2] OMAPDSS: write-through caching support for omapfb Siarhei Siamashka
2012-05-22 19:54 ` Siarhei Siamashka
2012-05-22 19:54 ` Siarhei Siamashka
2012-05-22 19:54 ` [PATCH 1/2] ARM: pgtable: add pgprot_writethrough() macro Siarhei Siamashka
2012-05-22 19:54   ` Siarhei Siamashka
2012-05-22 19:54   ` Siarhei Siamashka
2012-08-08 11:07   ` Grazvydas Ignotas
2012-08-08 11:07     ` Grazvydas Ignotas
2012-08-08 11:07     ` Grazvydas Ignotas
2012-05-22 19:54 ` [PATCH 2/2] OMAPDSS: Optionally enable write-through cache for the framebuffer Siarhei Siamashka
2012-05-22 19:54   ` Siarhei Siamashka
2012-05-22 19:54   ` Siarhei Siamashka
2012-05-22 20:25 ` [PATCH 0/2] OMAPDSS: write-through caching support for omapfb Siarhei Siamashka
2012-05-22 20:25   ` Siarhei Siamashka
2012-05-22 20:25   ` Siarhei Siamashka
2012-05-24  7:43 ` Tomi Valkeinen
2012-05-24  7:43   ` Tomi Valkeinen
2012-05-24  7:43   ` Tomi Valkeinen
2012-05-25  9:00   ` Siarhei Siamashka
2012-05-25  9:00     ` Siarhei Siamashka
2012-05-25  9:00     ` Siarhei Siamashka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.