Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user?
@ 2010-09-06 10:02 Wolfgang Wegner
  2010-09-06 13:46 ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user? saeed bishara
  2010-09-06 14:03 ` Russell King - ARM Linux
  0 siblings, 2 replies; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-06 10:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi list,

I am trying to improve performance of a very basic framebuffer
device connected to a Marvell Kirkwood MV88F6281 via a PCIe->
PCI bridge (88SB2211). The kernel I am using is 2.6.32.

Mapping the PCI memory space via mmap() resulted in some
disappointing ~6.5 MBytes/second. I tried to modify page
protection to pgprot_writecombine or pgprot_cached, but while
this did reproducably change performance, it was only in
some sub-percentage range. I am not sure if I understand
correctly how other framebuffers handle this, but it seems
the "raw" mmapped write performance is not cared about too
much or maybe not that bad with most x86 chip sets?
However, the idea left over after some trying and looking
around is to use the DMA engine to speed up write() (and
also read(), but this is not so important) system calls
instead of using mmap.

Looking around for example code on how to set up the DMA engine
to perform transfers from user buffers, I found this Kconfig
seemingly showing exactly the feature I am looking for:
http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/linux-2.6.15/arch/arm/mach-mv88fxx81/LSP/Kconfig
(config MV_DMA_COPYUSER
	bool "Support DMA copy_to_user() and copy_from_user"
	depends on (ARCH_MV88f5181) && EXPERIMENTAL)

However, I could not find any patch or similar how this is
implemented. So here my questions:

- Is this feature available as an unofficial patch somewhere?
- Is the idea of directly setting up a transfer from user pages
  to PCI memory space possible at all?
- Why am I the only one who wants such a thing? ;-)

In case of coding stuff myself, I was thinking about something
like this for write():
- get list of page[s], first page offset, last page transfer size
  from user buffer + size
- set up DMA engine to transfer list of [partial] pages
- when done, return from write

Sounds easy, but I am still puzzled by all the different types
of memory in this case, and - much more worrying me - I would think
there should be many devices/drivers using such a thing, but
I did not find them yet.

Any hints are greatly appreciated!

Regards,
Wolfgang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-06 10:02 Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user? Wolfgang Wegner
@ 2010-09-06 13:46 ` saeed bishara
  2010-09-06 13:58   ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user? Wolfgang Wegner
  2010-09-06 14:03 ` Russell King - ARM Linux
  1 sibling, 1 reply; 16+ messages in thread
From: saeed bishara @ 2010-09-06 13:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 6, 2010 at 1:02 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> Hi list,
>
> I am trying to improve performance of a very basic framebuffer
> device connected to a Marvell Kirkwood MV88F6281 via a PCIe->
> PCI bridge (88SB2211). The kernel I am using is 2.6.32.
>
> Mapping the PCI memory space via mmap() resulted in some
> disappointing ~6.5 MBytes/second. I tried to modify page
> protection to pgprot_writecombine or pgprot_cached, but while
> this did reproducably change performance, it was only in
> some sub-percentage range.
weird, marking those pages as bufferable (none-cachable) should boost
the throughput.
you may also try to set the u-boot variable pcieTune to yes, make sure
to save the env. variables then reboot the system.
 I am not sure if I understand
> correctly how other framebuffers handle this, but it seems
> the "raw" mmapped write performance is not cared about too
> much or maybe not that bad with most x86 chip sets?
> However, the idea left over after some trying and looking
> around is to use the DMA engine to speed up write() (and
> also read(), but this is not so important) system calls
> instead of using mmap.
>
> Looking around for example code on how to set up the DMA engine
> to perform transfers from user buffers, I found this Kconfig
> seemingly showing exactly the feature I am looking for:
> http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/linux-2.6.15/arch/arm/mach-mv88fxx81/LSP/Kconfig
> (config MV_DMA_COPYUSER
> ? ? ? ?bool "Support DMA copy_to_user() and copy_from_user"
> ? ? ? ?depends on (ARCH_MV88f5181) && EXPERIMENTAL)
>
> However, I could not find any patch or similar how this is
> implemented. So here my questions:
>
> - Is this feature available as an unofficial patch somewhere?
yes, but the kernel has the drivers/dma/mv_xor that implements the DMA
Engine interface, you may use that driver for the DMA offloading.
> - Is the idea of directly setting up a transfer from user pages
> ?to PCI memory space possible at all?
you need to do some hacking for that, but I'm almost sure that it
won't help. using the cpu should be enough.
> - Why am I the only one who wants such a thing? ;-)
>
> In case of coding stuff myself, I was thinking about something
> like this for write():
> - get list of page[s], first page offset, last page transfer size
> ?from user buffer + size
> - set up DMA engine to transfer list of [partial] pages
> - when done, return from write
>
> Sounds easy, but I am still puzzled by all the different types
> of memory in this case, and - much more worrying me - I would think
> there should be many devices/drivers using such a thing, but
> I did not find them yet.
>
> Any hints are greatly appreciated!
>
> Regards,
> Wolfgang
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user?
  2010-09-06 13:46 ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user? saeed bishara
@ 2010-09-06 13:58   ` Wolfgang Wegner
  0 siblings, 0 replies; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-06 13:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 06, 2010 at 04:46:07PM +0300, saeed bishara wrote:
> On Mon, Sep 6, 2010 at 1:02 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> >
> > Mapping the PCI memory space via mmap() resulted in some
> > disappointing ~6.5 MBytes/second. I tried to modify page
> > protection to pgprot_writecombine or pgprot_cached, but while
> > this did reproducably change performance, it was only in
> > some sub-percentage range.
> weird, marking those pages as bufferable (none-cachable) should boost
> the throughput.

Maybe I have some misunderstanding here. My idea was that
I have to enable any sort of caching to get the writes combined
into a burst because else every single write causes a single PCI
transaction.
In my naive thought, writecombine was something like a write-only,
read transparent cache. Is there any thorough explanation of their
meaning somewhere?
Anyways, neither of pgprot_nocache, pgprot_cached or
pgprot_writecombine made a significant difference.

> you may also try to set the u-boot variable pcieTune to yes, make sure
> to save the env. variables then reboot the system.

I set "setenv pcieTune yes" and saved the environment, but it
did not change anything. Who would be responsible for caring
about this variable, U-Boot itself or the kernel? I am asking
because the variable was not present at all in my environment
until now.

[...]
> > - Is this feature available as an unofficial patch somewhere?
> yes, but the kernel has the drivers/dma/mv_xor that implements the DMA
> Engine interface, you may use that driver for the DMA offloading.

Yes, I found this interface, but was hoping there would be some
more framework that already does the copy_{to,from}_user handling
for me so I do not have to fiddle around with the user pages and
stuff. But while implementing I saw it is probably too device-
dependent to have such a thing around.

> > - Is the idea of directly setting up a transfer from user pages
> > ?to PCI memory space possible at all?
> you need to do some hacking for that, but I'm almost sure that it
> won't help. using the cpu should be enough.

Sounds bad. :-(

I will see if I can get my hacking to work, just to hopefully
prove you wrong here. ;-) Well, obviously the point is not to
prove somebody else wrong but to somehow get a reasonable speed...

Regards,
Wolfgang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user?
  2010-09-06 10:02 Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user? Wolfgang Wegner
  2010-09-06 13:46 ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user? saeed bishara
@ 2010-09-06 14:03 ` Russell King - ARM Linux
  2010-09-06 14:11   ` Wolfgang Wegner
  2010-09-06 14:14   ` Wolfgang Wegner
  1 sibling, 2 replies; 16+ messages in thread
From: Russell King - ARM Linux @ 2010-09-06 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 06, 2010 at 12:02:44PM +0200, Wolfgang Wegner wrote:
> Mapping the PCI memory space via mmap() resulted in some
> disappointing ~6.5 MBytes/second. I tried to modify page
> protection to pgprot_writecombine or pgprot_cached, but while
> this did reproducably change performance, it was only in
> some sub-percentage range. I am not sure if I understand
> correctly how other framebuffers handle this, but it seems
> the "raw" mmapped write performance is not cared about too
> much or maybe not that bad with most x86 chip sets?
> However, the idea left over after some trying and looking
> around is to use the DMA engine to speed up write() (and
> also read(), but this is not so important) system calls
> instead of using mmap.

Framebuffer applications such as Xorg/Qt do not use read/write calls
to access their buffers because that will be painfully slow.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user?
  2010-09-06 14:03 ` Russell King - ARM Linux
@ 2010-09-06 14:11   ` Wolfgang Wegner
  2010-09-06 14:14   ` Wolfgang Wegner
  1 sibling, 0 replies; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-06 14:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 06, 2010 at 03:03:47PM +0100, Russell King - ARM Linux wrote:
> On Mon, Sep 06, 2010 at 12:02:44PM +0200, Wolfgang Wegner wrote:
> > Mapping the PCI memory space via mmap() resulted in some
> > disappointing ~6.5 MBytes/second. I tried to modify page
> > protection to pgprot_writecombine or pgprot_cached, but while
> > this did reproducably change performance, it was only in
> > some sub-percentage range. I am not sure if I understand
> > correctly how other framebuffers handle this, but it seems
> > the "raw" mmapped write performance is not cared about too
> > much or maybe not that bad with most x86 chip sets?
> > However, the idea left over after some trying and looking
> > around is to use the DMA engine to speed up write() (and
> > also read(), but this is not so important) system calls
> > instead of using mmap.
> 
> Framebuffer applications such as Xorg/Qt do not use read/write calls
> to access their buffers because that will be painfully slow.

That was my impression, too. However, I did not figure out
how they do actually speed up write access for mmapped
operation - although in many places the speed/latency of
PCI transactions is complained about.
Some really simple frame buffers have a shadow buffer in system
RAM to avoid read access for framebuffer-internal operations
that operate on the frame buffer like bitblt etc., but this
does not help for the mmapped case.

Finding my way through the framebuffer jungle is not the
easiest thing, so maybe I am missing something really
obvious here, or is it simply a configuration problem on
my board and the write performance when using mmap is 
good when configured properly?

Regards,
Wolfgang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user?
  2010-09-06 14:03 ` Russell King - ARM Linux
  2010-09-06 14:11   ` Wolfgang Wegner
@ 2010-09-06 14:14   ` Wolfgang Wegner
  2010-09-07  7:58     ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user? saeed bishara
  1 sibling, 1 reply; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-06 14:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 06, 2010 at 03:03:47PM +0100, Russell King - ARM Linux wrote:
> On Mon, Sep 06, 2010 at 12:02:44PM +0200, Wolfgang Wegner wrote:
> > Mapping the PCI memory space via mmap() resulted in some
> > disappointing ~6.5 MBytes/second. I tried to modify page
> > protection to pgprot_writecombine or pgprot_cached, but while
> > this did reproducably change performance, it was only in
> > some sub-percentage range. I am not sure if I understand
> > correctly how other framebuffers handle this, but it seems
> > the "raw" mmapped write performance is not cared about too
> > much or maybe not that bad with most x86 chip sets?
> > However, the idea left over after some trying and looking
> > around is to use the DMA engine to speed up write() (and
> > also read(), but this is not so important) system calls
> > instead of using mmap.
> 
> Framebuffer applications such as Xorg/Qt do not use read/write calls
> to access their buffers because that will be painfully slow.

BTW, the throughput I get with a "dd if=bitmap of=/dev/fb0 bs=512"
is the same I get from my test application writing longwords
sequentially to the mmapped frame buffer.

Regards,
Wolfgang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-06 14:14   ` Wolfgang Wegner
@ 2010-09-07  7:58     ` saeed bishara
  2010-09-07  9:52       ` saeed bishara
  0 siblings, 1 reply; 16+ messages in thread
From: saeed bishara @ 2010-09-07  7:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 6, 2010 at 5:14 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> On Mon, Sep 06, 2010 at 03:03:47PM +0100, Russell King - ARM Linux wrote:
>> On Mon, Sep 06, 2010 at 12:02:44PM +0200, Wolfgang Wegner wrote:
>> > Mapping the PCI memory space via mmap() resulted in some
>> > disappointing ~6.5 MBytes/second. I tried to modify page
>> > protection to pgprot_writecombine or pgprot_cached, but while
>> > this did reproducably change performance, it was only in
>> > some sub-percentage range. I am not sure if I understand
>> > correctly how other framebuffers handle this, but it seems
>> > the "raw" mmapped write performance is not cared about too
>> > much or maybe not that bad with most x86 chip sets?
>> > However, the idea left over after some trying and looking
>> > around is to use the DMA engine to speed up write() (and
>> > also read(), but this is not so important) system calls
>> > instead of using mmap.
>>
>> Framebuffer applications such as Xorg/Qt do not use read/write calls
>> to access their buffers because that will be painfully slow.
>
> BTW, the throughput I get with a "dd if=bitmap of=/dev/fb0 bs=512"
> is the same I get from my test application writing longwords
> sequentially to the mmapped frame buffer.
I'm not sure the writecombine is enabled properly, can you test that on DRAM?
you can do that be reserving some memory (mem=<dram size - 8M>), then
try to test throughput with and without writecombine.

saeed

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-07  7:58     ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user? saeed bishara
@ 2010-09-07  9:52       ` saeed bishara
  2010-09-07 16:11         ` Wolfgang Wegner
  2010-09-07 18:38         ` Nicolas Pitre
  0 siblings, 2 replies; 16+ messages in thread
From: saeed bishara @ 2010-09-07  9:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 7, 2010 at 10:58 AM, saeed bishara <saeed.bishara@gmail.com> wrote:
> On Mon, Sep 6, 2010 at 5:14 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
>> On Mon, Sep 06, 2010 at 03:03:47PM +0100, Russell King - ARM Linux wrote:
>>> On Mon, Sep 06, 2010 at 12:02:44PM +0200, Wolfgang Wegner wrote:
>>> > Mapping the PCI memory space via mmap() resulted in some
>>> > disappointing ~6.5 MBytes/second. I tried to modify page
>>> > protection to pgprot_writecombine or pgprot_cached, but while
>>> > this did reproducably change performance, it was only in
>>> > some sub-percentage range. I am not sure if I understand
>>> > correctly how other framebuffers handle this, but it seems
>>> > the "raw" mmapped write performance is not cared about too
>>> > much or maybe not that bad with most x86 chip sets?
>>> > However, the idea left over after some trying and looking
>>> > around is to use the DMA engine to speed up write() (and
>>> > also read(), but this is not so important) system calls
>>> > instead of using mmap.
>>>
>>> Framebuffer applications such as Xorg/Qt do not use read/write calls
>>> to access their buffers because that will be painfully slow.
>>
>> BTW, the throughput I get with a "dd if=bitmap of=/dev/fb0 bs=512"
>> is the same I get from my test application writing longwords
>> sequentially to the mmapped frame buffer.
> I'm not sure the writecombine is enabled properly, can you test that on DRAM?
> you can do that be reserving some memory (mem=<dram size - 8M>), then
> try to test throughput with and without writecombine.
>
also, in order to sent bursts, make sure that the stm instruction is
used, preferred with 8 registers with address aligned to 8*4 bytes.
 saeed
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-07  9:52       ` saeed bishara
@ 2010-09-07 16:11         ` Wolfgang Wegner
  2010-09-07 19:14           ` Nicolas Pitre
  2010-09-07 18:38         ` Nicolas Pitre
  1 sibling, 1 reply; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-07 16:11 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Saeed,

On Tue, Sep 07, 2010 at 12:52:39PM +0300, saeed bishara wrote:
> On Tue, Sep 7, 2010 at 10:58 AM, saeed bishara <saeed.bishara@gmail.com> wrote:
> > On Mon, Sep 6, 2010 at 5:14 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> >> On Mon, Sep 06, 2010 at 03:03:47PM +0100, Russell King - ARM Linux wrote:
> >>> On Mon, Sep 06, 2010 at 12:02:44PM +0200, Wolfgang Wegner wrote:
> >>> > Mapping the PCI memory space via mmap() resulted in some
> >>> > disappointing ~6.5 MBytes/second. I tried to modify page
> >>> > protection to pgprot_writecombine or pgprot_cached, but while
> >>> > this did reproducably change performance, it was only in
> >>> > some sub-percentage range. I am not sure if I understand
> >>> > correctly how other framebuffers handle this, but it seems
> >>> > the "raw" mmapped write performance is not cared about too
> >>> > much or maybe not that bad with most x86 chip sets?
> >>> > However, the idea left over after some trying and looking
> >>> > around is to use the DMA engine to speed up write() (and
> >>> > also read(), but this is not so important) system calls
> >>> > instead of using mmap.
> >>>
> >>> Framebuffer applications such as Xorg/Qt do not use read/write calls
> >>> to access their buffers because that will be painfully slow.
> >>
> >> BTW, the throughput I get with a "dd if=bitmap of=/dev/fb0 bs=512"
> >> is the same I get from my test application writing longwords
> >> sequentially to the mmapped frame buffer.
> > I'm not sure the writecombine is enabled properly, can you test that on DRAM?
> > you can do that be reserving some memory (mem=<dram size - 8M>), then
> > try to test throughput with and without writecombine.
> >
> also, in order to sent bursts, make sure that the stm instruction is
> used, preferred with 8 registers with address aligned to 8*4 bytes.

thanks for your hints and patience!

I am not sure if I did things correctly. I set up an 8MB mapping
from system RAM as you proposed, and am getting 240 MBytes/second
write data rate with a simple test application filling the whole
region with 0x12345678 mapped through my driver.

In the driver, I used the following combinations for ioremap and
pgprot modification during remap_pfn_range:
ioremap_wc + pgprot_writecombine(vma->vm_page_prot)
ioremap_cached + <no modification of vma->vm_page_prot>
ioremap_nocache + pgprot_noncached(vma->vm_page_prot)

In contrast to my previous test, I could not see even a minimal
reproducible difference between any of the tests, the absolute
values of "time" varied only between 0.033 and 0.035 statistically.
(around 1.299 for writing to my PCI device's memory in the same
test)

However, I am not getting the writes to use the stm instruction,
so maybe this is the real limitation.
Here is my very basic test program:

#define MEMSIZE 0x800000

int main() {
  int fbfd;
  unsigned long *fbp, *sfbp;
  unsigned long i;
  unsigned long fill_val = 0x12345678;

  fbfd = open("/dev/fb0", O_RDWR);
  if (!fbfd) {
    printf("Error: cannot open framebuffer device.\n");
    exit(1);
  }
  fbp = (unsigned long *)mmap(0, MEMSIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
                              fbfd, 0);
  sfbp = fbp;
  if(!fbp) {
    printf("Error: cannot mmap framebuffer\n");
    exit(1);
  }
#if 0
  for (i = 0; i < (MEMSIZE / 4); i += 8) {
    *(fbp + i) = fill_val;
    *(fbp + i + 1) = fill_val;
    *(fbp + i + 2) = fill_val;
    *(fbp + i + 3) = fill_val;
    *(fbp + i + 4) = fill_val;
    *(fbp + i + 5) = fill_val;
    *(fbp + i + 6) = fill_val;
    *(fbp + i + 7) = fill_val;
  }
#else
  for (i = MEMSIZE/32; i; i--) {
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
    *(fbp++) = fill_val;
  }
#endif
  munmap(sfbp, MEMSIZE);
  close(fbfd);
  return 0;
}

Neither of the cases results in stm being used, all use
str.
(I am not so deep into assembler, let alone ARM assembler,
so please bear with my ignorance about what stm is or does
for now...)

I compiled with CodeSourcery arm-2010q1 toolchain with these
settings:
arm-none-linux-gnueabi-gcc -O3  -DARM -std=gnu99 -fgnu89-inline -Wall -Wno-format -pedantic

Could you provide any hint what I could do to get the
compiler to use stm?

Regards,
Wolfgang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-07  9:52       ` saeed bishara
  2010-09-07 16:11         ` Wolfgang Wegner
@ 2010-09-07 18:38         ` Nicolas Pitre
  1 sibling, 0 replies; 16+ messages in thread
From: Nicolas Pitre @ 2010-09-07 18:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 7 Sep 2010, saeed bishara wrote:

> On Tue, Sep 7, 2010 at 10:58 AM, saeed bishara <saeed.bishara@gmail.com> wrote:
> > On Mon, Sep 6, 2010 at 5:14 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> >> On Mon, Sep 06, 2010 at 03:03:47PM +0100, Russell King - ARM Linux wrote:
> >>> On Mon, Sep 06, 2010 at 12:02:44PM +0200, Wolfgang Wegner wrote:
> >>> > Mapping the PCI memory space via mmap() resulted in some
> >>> > disappointing ~6.5 MBytes/second. I tried to modify page
> >>> > protection to pgprot_writecombine or pgprot_cached, but while
> >>> > this did reproducably change performance, it was only in
> >>> > some sub-percentage range. I am not sure if I understand
> >>> > correctly how other framebuffers handle this, but it seems
> >>> > the "raw" mmapped write performance is not cared about too
> >>> > much or maybe not that bad with most x86 chip sets?
> >>> > However, the idea left over after some trying and looking
> >>> > around is to use the DMA engine to speed up write() (and
> >>> > also read(), but this is not so important) system calls
> >>> > instead of using mmap.
> >>>
> >>> Framebuffer applications such as Xorg/Qt do not use read/write calls
> >>> to access their buffers because that will be painfully slow.
> >>
> >> BTW, the throughput I get with a "dd if=bitmap of=/dev/fb0 bs=512"
> >> is the same I get from my test application writing longwords
> >> sequentially to the mmapped frame buffer.
> > I'm not sure the writecombine is enabled properly, can you test that on DRAM?
> > you can do that be reserving some memory (mem=<dram size - 8M>), then
> > try to test throughput with and without writecombine.
> >
> also, in order to sent bursts, make sure that the stm instruction is
> used, preferred with 8 registers with address aligned to 8*4 bytes.

The copy_from_user code (through the write() system call) already takes 
care of this with large enough buffers.  Of course with mmap()'d memory 
you are responsible for optimizing this yourself.


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-07 16:11         ` Wolfgang Wegner
@ 2010-09-07 19:14           ` Nicolas Pitre
  2010-09-08  8:35             ` Wolfgang Wegner
  0 siblings, 1 reply; 16+ messages in thread
From: Nicolas Pitre @ 2010-09-07 19:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 7 Sep 2010, Wolfgang Wegner wrote:

> However, I am not getting the writes to use the stm instruction,
> so maybe this is the real limitation.
> Here is my very basic test program:
> 
> #define MEMSIZE 0x800000
> 
> int main() {
>   int fbfd;
>   unsigned long *fbp, *sfbp;
>   unsigned long i;
>   unsigned long fill_val = 0x12345678;
> 
>   fbfd = open("/dev/fb0", O_RDWR);
>   if (!fbfd) {
>     printf("Error: cannot open framebuffer device.\n");
>     exit(1);
>   }
>   fbp = (unsigned long *)mmap(0, MEMSIZE, PROT_READ | PROT_WRITE, MAP_SHARED,
>                               fbfd, 0);
>   sfbp = fbp;
>   if(!fbp) {
>     printf("Error: cannot mmap framebuffer\n");
>     exit(1);
>   }
> #if 0
>   for (i = 0; i < (MEMSIZE / 4); i += 8) {
>     *(fbp + i) = fill_val;
>     *(fbp + i + 1) = fill_val;
>     *(fbp + i + 2) = fill_val;
>     *(fbp + i + 3) = fill_val;
>     *(fbp + i + 4) = fill_val;
>     *(fbp + i + 5) = fill_val;
>     *(fbp + i + 6) = fill_val;
>     *(fbp + i + 7) = fill_val;
>   }
> #else
>   for (i = MEMSIZE/32; i; i--) {
>     *(fbp++) = fill_val;
>     *(fbp++) = fill_val;
>     *(fbp++) = fill_val;
>     *(fbp++) = fill_val;
>     *(fbp++) = fill_val;
>     *(fbp++) = fill_val;
>     *(fbp++) = fill_val;
>     *(fbp++) = fill_val;
>   }
> #endif
>   munmap(sfbp, MEMSIZE);
>   close(fbfd);
>   return 0;
> }
> 
> Neither of the cases results in stm being used, all use
> str.
> (I am not so deep into assembler, let alone ARM assembler,
> so please bear with my ignorance about what stm is or does
> for now...)

The STM instruction means store-multiple i.e. it takes a set of 
registers and write them to memory in one go.  You could try using 
memset() which should be optimized to use STM in that case:

	memset(fbp, fill_val, MEMSIZE);

(although memset() works with chars, so only the LSBs of fill_val will 
be stored.)

Otherwise you could open code this test like this:


	register long __r0 asm("r0") = fill_val;
	register long __r1 asm("r1") = fill_val;
	register long __r2 asm("r2") = fill_val;
	register long __r3 asm("r3") = fill_val;
	register long __r4 asm("r4") = fill_val;
	register long __r5 asm("r5") = fill_val;
	register long __r6 asm("r6") = fill_val;
	register long __r7 asm("r7") = fill_val;
	for (i = 0; i < MEMSIZE/4; i += 8) {
		asm volatile(
			"stmia %0!, {r0 - r7}"
			: "+r" (fbp)
			: "r" (__r0), "r" (__r1), "r" (__r2), "r" (__r3),
			  "r" (__r4), "r" (__r5), "r" (__r6), "r" (__r7));
	}


Nicolas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-07 19:14           ` Nicolas Pitre
@ 2010-09-08  8:35             ` Wolfgang Wegner
  2010-09-09 16:21               ` Wolfgang Wegner
  0 siblings, 1 reply; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-08  8:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 07, 2010 at 03:14:08PM -0400, Nicolas Pitre wrote:
> The STM instruction means store-multiple i.e. it takes a set of 
> registers and write them to memory in one go.  You could try using 
> memset() which should be optimized to use STM in that case:
[...]

Thank you for the explanation and code example!

Using write() (dd if=/dev/zero of=/dev/fb0 bs=1024 count=8192) the 
throughput was slightly lower in either case, but this may as well
be some other overhead (0.035s->0.052s for RAM, 1.299s->1.310s for
PCI frame buffer).

Using your assembler code, I get almost double throughput (0.035s->
0.018s, meaning around 466 MBytes/s) for RAM and a system lockup
for my PCI device. Hmm...

I will now set up some eval boards to see if I get an "off-the-shelf"
framebuffer with a stock PCI graphics card up and running for a
comparison.

Regards,
Wolfgang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-08  8:35             ` Wolfgang Wegner
@ 2010-09-09 16:21               ` Wolfgang Wegner
  2010-09-13 17:10                 ` Leon Woestenberg
  0 siblings, 1 reply; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-09 16:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 08, 2010 at 10:35:58AM +0200, Wolfgang Wegner wrote:
> 
> Using your assembler code, I get almost double throughput (0.035s->
> 0.018s, meaning around 466 MBytes/s) for RAM and a system lockup
> for my PCI device. Hmm...
> 
> I will now set up some eval boards to see if I get an "off-the-shelf"
> framebuffer with a stock PCI graphics card up and running for a
> comparison.

The only memory-mapped PCI device I managed to get to run in the
PCIe->PCI bridge eval board was the FPGA evaluation board, together
with the manufacturer-supplied evaluation code. (The PCI
graphics cards were either too old (5V) or ATI-based, whose
driver seems to have been "improved" resulting in failure
without a BIOS. *sigh*)

With the FPGA evaluation board I get:
- around 38 MBytes/second with Nicolas' inline assembly code
- around 6 MBytes/second with any other C code (mmapped) as
  well as write() via dd

Regardless of using ioremap_{wc,nocache,cached} and
pgprot_writecombine/pgprot_noncached.

So the main problem seems to be either our board implementation
of the PCIe->PCI bridge or the FPGA. However, I am still wondering
how a framebuffer-based application can attain reasonable performance,
as (to my understanding) in most of the cases using such an
throughput-optimized assembly code will not be possible.

On a side note: can anybody give a hint how to enable
ASYNC_CORE/ASYNC_MEMCPY? I see the options in crypto/async_tx/Kconfig
but can not find them via menuconfig? I would still like to try
using the DMA engine for transferring complete frames...

Regards,
Wolfgang

PS: another PCI device I tried via the PCIe->PCI bridge was
    a Intel 82574L GBit NIC, which was able to reach >600MBit/s
    throughput when tested with netio or netperf

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-09 16:21               ` Wolfgang Wegner
@ 2010-09-13 17:10                 ` Leon Woestenberg
  2010-09-14  7:03                   ` Wolfgang Wegner
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Woestenberg @ 2010-09-13 17:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Wolfgang,

On Thu, Sep 9, 2010 at 6:21 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> On Wed, Sep 08, 2010 at 10:35:58AM +0200, Wolfgang Wegner wrote:
>>
> With the FPGA evaluation board I get:
> - around 38 MBytes/second with Nicolas' inline assembly code
> - around 6 MBytes/second with any other C code (mmapped) as
> ?well as write() via dd
>
> So the main problem seems to be either our board implementation
> of the PCIe->PCI bridge or the FPGA. However, I am still wondering
> how a framebuffer-based application can attain reasonable performance,
>
Having implemented a framebuffer demo on an FPGA recently using PCI
Express, I think the main performance gain is made by having the DMA
done by the endpoint (FPGA) rather than by the CPU.

> PS: another PCI device I tried via the PCIe->PCI bridge was
> ? ?a Intel 82574L GBit NIC, which was able to reach >600MBit/s
> ? ?throughput when tested with netio or netperf
>
Such devices (typically) use endpoint initiated DMA, i.e. they do not
involve much overhead/latencies on the CPU/host/root complex side.

Regards,

Leon.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-13 17:10                 ` Leon Woestenberg
@ 2010-09-14  7:03                   ` Wolfgang Wegner
  2010-09-15 23:39                     ` Leon Woestenberg
  0 siblings, 1 reply; 16+ messages in thread
From: Wolfgang Wegner @ 2010-09-14  7:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Leon,

On Mon, Sep 13, 2010 at 07:10:59PM +0200, Leon Woestenberg wrote:
> Hello Wolfgang,
> 
> On Thu, Sep 9, 2010 at 6:21 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> > On Wed, Sep 08, 2010 at 10:35:58AM +0200, Wolfgang Wegner wrote:
> >>
> > With the FPGA evaluation board I get:
> > - around 38 MBytes/second with Nicolas' inline assembly code
> > - around 6 MBytes/second with any other C code (mmapped) as
> > ?well as write() via dd
> >
> > So the main problem seems to be either our board implementation
> > of the PCIe->PCI bridge or the FPGA. However, I am still wondering
> > how a framebuffer-based application can attain reasonable performance,
> >
> Having implemented a framebuffer demo on an FPGA recently using PCI
> Express, I think the main performance gain is made by having the DMA
> done by the endpoint (FPGA) rather than by the CPU.

this is what I read all around, however, I do not see how this
can improve anything when using an mmap()ed frame buffer for
pixel-oriented operations...
This is why I thought about reverting to write() and simply transfer
complete frames, which would be sufficient for about 90% of my
application scenarios - and for the other 10% I could live with
the lower performance.

Regards,
Wolfgang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user?
  2010-09-14  7:03                   ` Wolfgang Wegner
@ 2010-09-15 23:39                     ` Leon Woestenberg
  0 siblings, 0 replies; 16+ messages in thread
From: Leon Woestenberg @ 2010-09-15 23:39 UTC (permalink / raw)
  To: linux-arm-kernel

Hello Wolfgang,

On Tue, Sep 14, 2010 at 9:03 AM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
> On Mon, Sep 13, 2010 at 07:10:59PM +0200, Leon Woestenberg wrote:
>> Hello Wolfgang,
>>
>> On Thu, Sep 9, 2010 at 6:21 PM, Wolfgang Wegner <ww-ml@gmx.de> wrote:
>> > On Wed, Sep 08, 2010 at 10:35:58AM +0200, Wolfgang Wegner wrote:
>> >>
>> > With the FPGA evaluation board I get:
>> > - around 38 MBytes/second with Nicolas' inline assembly code
>> > - around 6 MBytes/second with any other C code (mmapped) as
>> > ?well as write() via dd
>> >
>> > So the main problem seems to be either our board implementation
>> > of the PCIe->PCI bridge or the FPGA. However, I am still wondering
>> > how a framebuffer-based application can attain reasonable performance,
>> >
>> Having implemented a framebuffer demo on an FPGA recently using PCI
>> Express, I think the main performance gain is made by having the DMA
>> done by the endpoint (FPGA) rather than by the CPU.
>
> this is what I read all around, however, I do not see how this
> can improve anything when using an mmap()ed frame buffer for
> pixel-oriented operations...
>
I haven't seen a SoC yet that can reach the bandwidth of PCIe using
its DMA controller to push data out the PCIe bus. Most of these do not
support the same type of large read requests that an endpoint may
perform, typically 512 bytes or even 4096 per read request over PCI
Express. Couple that with an efficient endpoint  SGDMA controller and
you reach PCIe full bandwidth, no cycles wasted.

mmap() backed by a good SoC DMA controller may come close, but the max
payload is usually less (128 bytes).

Also, most SoC DMA controllers are too limited to set up the kind of
DMA you want.

> This is why I thought about reverting to write() and simply transfer
> complete frames, which would be sufficient for about 90% of my
> application scenarios - and for the other 10% I could live with
> the lower performance.
>
Anything that works. How much burden do you want to put on the CPU though?

Regards,
-- 
Leon

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-09-15 23:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-06 10:02 Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user? Wolfgang Wegner
2010-09-06 13:46 ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user? saeed bishara
2010-09-06 13:58   ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to,from}_user? Wolfgang Wegner
2010-09-06 14:03 ` Russell King - ARM Linux
2010-09-06 14:11   ` Wolfgang Wegner
2010-09-06 14:14   ` Wolfgang Wegner
2010-09-07  7:58     ` Kirkwood PCI(e) write performance and DMA engine support for copy_{to, from}_user? saeed bishara
2010-09-07  9:52       ` saeed bishara
2010-09-07 16:11         ` Wolfgang Wegner
2010-09-07 19:14           ` Nicolas Pitre
2010-09-08  8:35             ` Wolfgang Wegner
2010-09-09 16:21               ` Wolfgang Wegner
2010-09-13 17:10                 ` Leon Woestenberg
2010-09-14  7:03                   ` Wolfgang Wegner
2010-09-15 23:39                     ` Leon Woestenberg
2010-09-07 18:38         ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).