* simple framebuffer slower by factor of 20, on socfpga (arm) platform @ 2015-04-07 12:12 Pavel Machek 2015-04-07 12:19 ` Geert Uytterhoeven 0 siblings, 1 reply; 18+ messages in thread From: Pavel Machek @ 2015-04-07 12:12 UTC (permalink / raw) To: Marek Vasut, kernel list, dinh.linux Cc: plagnioj, tomi.valkeinen, grant.likely, robh+dt, jg1.han, robdclark, geert, linux-fbdev, devicetree Hi! I have an socfpga board, which uses has simple framebuffer implemented in the FPGA. On 3.15, framebuffer is fast: root@wagabuibui:~# time cat /dev/fb0 > /dev/null real 0m 0.00s user 0m 0.00s sys 0m 0.00s on 3.18, this takes 220msec. Similar slowdown exists for writes. Simple framebuffer did not change at all between 3.15 and 3.18; resource flags of the framebuffer are still same (0x200). If I enable caching on 3.18, it speeds up a bit, to 70msec or so... Which means problem is not only in caching. Any ideas? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-07 12:12 simple framebuffer slower by factor of 20, on socfpga (arm) platform Pavel Machek @ 2015-04-07 12:19 ` Geert Uytterhoeven 2015-04-07 14:24 ` Marek Vasut 2015-04-09 11:06 ` Pavel Machek 0 siblings, 2 replies; 18+ messages in thread From: Geert Uytterhoeven @ 2015-04-07 12:19 UTC (permalink / raw) To: Pavel Machek Cc: Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Tomi Valkeinen, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree Hi Pavel, On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <pavel@ucw.cz> wrote: > I have an socfpga board, which uses has simple framebuffer implemented > in the FPGA. On 3.15, framebuffer is fast: > > root@wagabuibui:~# time cat /dev/fb0 > /dev/null > real 0m 0.00s > user 0m 0.00s > sys 0m 0.00s > > on 3.18, this takes 220msec. Similar slowdown exists for > writes. Simple framebuffer did not change at all between 3.15 and > 3.18; resource flags of the framebuffer are still same (0x200). > > If I enable caching on 3.18, it speeds up a bit, to 70msec or > so... Which means problem is not only in caching. > > Any ideas? My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer copying optimisations"), but this was introduced only in v4.0-rc1. Just in case you encounter another performance regression after upgrading to a more modern kernel ;-) Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-07 12:19 ` Geert Uytterhoeven @ 2015-04-07 14:24 ` Marek Vasut 2015-04-09 11:06 ` Pavel Machek 1 sibling, 0 replies; 18+ messages in thread From: Marek Vasut @ 2015-04-07 14:24 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Pavel Machek, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Tomi Valkeinen, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree On Tuesday, April 07, 2015 at 02:19:33 PM, Geert Uytterhoeven wrote: > Hi Pavel, > > On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <pavel@ucw.cz> wrote: > > I have an socfpga board, which uses has simple framebuffer implemented > > in the FPGA. On 3.15, framebuffer is fast: > > > > root@wagabuibui:~# time cat /dev/fb0 > /dev/null > > real 0m 0.00s > > user 0m 0.00s > > sys 0m 0.00s > > > > on 3.18, this takes 220msec. Similar slowdown exists for > > writes. Simple framebuffer did not change at all between 3.15 and > > 3.18; resource flags of the framebuffer are still same (0x200). > > > > If I enable caching on 3.18, it speeds up a bit, to 70msec or > > so... Which means problem is not only in caching. > > > > Any ideas? > > My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer > copying optimisations"), but this was introduced only in v4.0-rc1. > > Just in case you encounter another performance regression after upgrading > to a more modern kernel ;-) Why don't you use the Altera VIP FB on SoCFPGA ? Best regards, Marek Vasut ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-07 12:19 ` Geert Uytterhoeven 2015-04-07 14:24 ` Marek Vasut @ 2015-04-09 11:06 ` Pavel Machek 2015-04-09 11:21 ` Tomi Valkeinen 1 sibling, 1 reply; 18+ messages in thread From: Pavel Machek @ 2015-04-09 11:06 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Tomi Valkeinen, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, archit, shc_work, linux, hsweeten On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote: > Hi Pavel, > > On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <pavel@ucw.cz> wrote: > > I have an socfpga board, which uses has simple framebuffer implemented > > in the FPGA. On 3.15, framebuffer is fast: > > > > root@wagabuibui:~# time cat /dev/fb0 > /dev/null > > real 0m 0.00s > > user 0m 0.00s > > sys 0m 0.00s > > > > on 3.18, this takes 220msec. Similar slowdown exists for > > writes. Simple framebuffer did not change at all between 3.15 and > > 3.18; resource flags of the framebuffer are still same (0x200). > > > > If I enable caching on 3.18, it speeds up a bit, to 70msec or > > so... Which means problem is not only in caching. > > > > Any ideas? > > My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer > copying optimisations"), but this was introduced only in v4.0-rc1. > > Just in case you encounter another performance regression after upgrading > to a more modern kernel ;-) :-). I did a git bisect, and it pointed to this. And reverting it indeed fixes the problem in 3.18. Problem is still there in 4.0. Archit do you know what is going on there? Should the revert be filled for 4.0? Pavel commit 981409b25e2a99409b26daa67293ca1cfd5ea0a0 Author: Archit Taneja <archit@ti.com> Date: Fri Nov 16 14:46:04 2012 +0530 fbdev: arm has __raw I/O accessors, use them in fb.h This removes the sparse warnings on arm platforms: warning: cast removes address space of expression Signed-off-by: Archit Taneja <archit@ti.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: H Hartley Sweeten <hsweeten at visionengravers.com> Cc: Alexander Shiyan <shc_work@mail.ru> Cc: Russell King <linux@arm.linux.org.uk> > Gr{oetje,eeting}s, > > Geert > > -- > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org > > In personal conversations with technical people, I call myself a hacker. But > when I'm talking to journalists I just say "programmer" or something like that. > -- Linus Torvalds -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-09 11:06 ` Pavel Machek @ 2015-04-09 11:21 ` Tomi Valkeinen 2015-04-09 11:34 ` Tomi Valkeinen 0 siblings, 1 reply; 18+ messages in thread From: Tomi Valkeinen @ 2015-04-09 11:21 UTC (permalink / raw) To: Pavel Machek Cc: Geert Uytterhoeven, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, shc_work, linux, hsweeten, Archit Taneja [-- Attachment #1: Type: text/plain, Size: 2259 bytes --] On 09/04/15 14:06, Pavel Machek wrote: > On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote: >> Hi Pavel, >> >> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <pavel@ucw.cz> wrote: >>> I have an socfpga board, which uses has simple framebuffer implemented >>> in the FPGA. On 3.15, framebuffer is fast: >>> >>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null >>> real 0m 0.00s >>> user 0m 0.00s >>> sys 0m 0.00s >>> >>> on 3.18, this takes 220msec. Similar slowdown exists for >>> writes. Simple framebuffer did not change at all between 3.15 and >>> 3.18; resource flags of the framebuffer are still same (0x200). >>> >>> If I enable caching on 3.18, it speeds up a bit, to 70msec or >>> so... Which means problem is not only in caching. >>> >>> Any ideas? >> >> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer >> copying optimisations"), but this was introduced only in v4.0-rc1. >> >> Just in case you encounter another performance regression after upgrading >> to a more modern kernel ;-) > > :-). I did a git bisect, and it pointed to this. And reverting it > indeed fixes the problem in 3.18. Problem is still there in 4.0. Interesting. I can reproduce this also on TI's AM437x board, on 3.14 kernel. Without the patch: # time cat /dev/fb0 > /dev/null real 0m 0.01s user 0m 0.00s sys 0m 0.01s With the patch: # time cat /dev/fb0 > /dev/null real 0m 0.19s user 0m 0.01s sys 0m 0.17s > Archit do you know what is going on there? Should the revert be filled > for 4.0? (Cc'ing Archit's new email) > > Pavel > > commit 981409b25e2a99409b26daa67293ca1cfd5ea0a0 > Author: Archit Taneja <archit@ti.com> > Date: Fri Nov 16 14:46:04 2012 +0530 > > fbdev: arm has __raw I/O accessors, use them in fb.h > > This removes the sparse warnings on arm platforms: > > warning: cast removes address space of expression > > Signed-off-by: Archit Taneja <archit@ti.com> > Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> > Cc: H Hartley Sweeten <hsweeten at visionengravers.com> > Cc: Alexander Shiyan <shc_work@mail.ru> > Cc: Russell King <linux@arm.linux.org.uk> [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-09 11:21 ` Tomi Valkeinen @ 2015-04-09 11:34 ` Tomi Valkeinen 2015-04-09 19:51 ` Arnd Bergmann 2015-04-10 7:05 ` Archit Taneja 0 siblings, 2 replies; 18+ messages in thread From: Tomi Valkeinen @ 2015-04-09 11:34 UTC (permalink / raw) To: Pavel Machek Cc: Geert Uytterhoeven, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, shc_work, linux, hsweeten, Archit Taneja [-- Attachment #1: Type: text/plain, Size: 1663 bytes --] On 09/04/15 14:21, Tomi Valkeinen wrote: > On 09/04/15 14:06, Pavel Machek wrote: >> On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote: >>> Hi Pavel, >>> >>> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <pavel@ucw.cz> wrote: >>>> I have an socfpga board, which uses has simple framebuffer implemented >>>> in the FPGA. On 3.15, framebuffer is fast: >>>> >>>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null >>>> real 0m 0.00s >>>> user 0m 0.00s >>>> sys 0m 0.00s >>>> >>>> on 3.18, this takes 220msec. Similar slowdown exists for >>>> writes. Simple framebuffer did not change at all between 3.15 and >>>> 3.18; resource flags of the framebuffer are still same (0x200). >>>> >>>> If I enable caching on 3.18, it speeds up a bit, to 70msec or >>>> so... Which means problem is not only in caching. >>>> >>>> Any ideas? >>> >>> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer >>> copying optimisations"), but this was introduced only in v4.0-rc1. >>> >>> Just in case you encounter another performance regression after upgrading >>> to a more modern kernel ;-) >> >> :-). I did a git bisect, and it pointed to this. And reverting it >> indeed fixes the problem in 3.18. Problem is still there in 4.0. The difference is probably caused by memcpy() vs memcpy_fromio(). The comment above memcpy_fromio() says "This needs to be optimized". I think generally speaking memcpy_fromio() is correct for a framebuffer. That said, if the fb is in RAM, and is only written by the CPU, I think a normal memcpy() for fb_memcpy_fromfb() should be fine... Tomi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-09 11:34 ` Tomi Valkeinen @ 2015-04-09 19:51 ` Arnd Bergmann 2015-04-24 13:31 ` Pavel Machek 2015-04-10 7:05 ` Archit Taneja 1 sibling, 1 reply; 18+ messages in thread From: Arnd Bergmann @ 2015-04-09 19:51 UTC (permalink / raw) To: Tomi Valkeinen Cc: Pavel Machek, Geert Uytterhoeven, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, shc_work, linux, hsweeten, Archit Taneja On Thursday 09 April 2015 14:34:26 Tomi Valkeinen wrote: > On 09/04/15 14:21, Tomi Valkeinen wrote: > > On 09/04/15 14:06, Pavel Machek wrote: > >> On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote: > >>> Hi Pavel, > >>> > >>> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <pavel@ucw.cz> wrote: > >>>> I have an socfpga board, which uses has simple framebuffer implemented > >>>> in the FPGA. On 3.15, framebuffer is fast: > >>>> > >>>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null > >>>> real 0m 0.00s > >>>> user 0m 0.00s > >>>> sys 0m 0.00s > >>>> > >>>> on 3.18, this takes 220msec. Similar slowdown exists for > >>>> writes. Simple framebuffer did not change at all between 3.15 and > >>>> 3.18; resource flags of the framebuffer are still same (0x200). > >>>> > >>>> If I enable caching on 3.18, it speeds up a bit, to 70msec or > >>>> so... Which means problem is not only in caching. > >>>> > >>>> Any ideas? > >>> > >>> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer > >>> copying optimisations"), but this was introduced only in v4.0-rc1. > >>> > >>> Just in case you encounter another performance regression after upgrading > >>> to a more modern kernel > >> > >> :-). I did a git bisect, and it pointed to this. And reverting it > >> indeed fixes the problem in 3.18. Problem is still there in 4.0. > > The difference is probably caused by memcpy() vs memcpy_fromio(). The > comment above memcpy_fromio() says "This needs to be optimized". I think > generally speaking memcpy_fromio() is correct for a framebuffer. > > That said, if the fb is in RAM, and is only written by the CPU, I think > a normal memcpy() for fb_memcpy_fromfb() should be fine... Could memcpy() cause alignment traps here if the fb pointer is unaligned and uncached? Arnd ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-09 19:51 ` Arnd Bergmann @ 2015-04-24 13:31 ` Pavel Machek 0 siblings, 0 replies; 18+ messages in thread From: Pavel Machek @ 2015-04-24 13:31 UTC (permalink / raw) To: Arnd Bergmann Cc: Tomi Valkeinen, Geert Uytterhoeven, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, shc_work, linux, hsweeten, Archit Taneja > > The difference is probably caused by memcpy() vs memcpy_fromio(). The > > comment above memcpy_fromio() says "This needs to be optimized". I think > > generally speaking memcpy_fromio() is correct for a framebuffer. > > > > That said, if the fb is in RAM, and is only written by the CPU, I think > > a normal memcpy() for fb_memcpy_fromfb() should be fine... > > Could memcpy() cause alignment traps here if the fb pointer is unaligned > and uncached? Original commit did not comment on any failure, so I expect that is not a problem here... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-09 11:34 ` Tomi Valkeinen 2015-04-09 19:51 ` Arnd Bergmann @ 2015-04-10 7:05 ` Archit Taneja 2015-04-24 13:29 ` Pavel Machek 1 sibling, 1 reply; 18+ messages in thread From: Archit Taneja @ 2015-04-10 7:05 UTC (permalink / raw) To: Tomi Valkeinen, Pavel Machek Cc: Geert Uytterhoeven, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, shc_work, linux, hsweeten On 04/09/2015 05:04 PM, Tomi Valkeinen wrote: > On 09/04/15 14:21, Tomi Valkeinen wrote: >> On 09/04/15 14:06, Pavel Machek wrote: >>> On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote: >>>> Hi Pavel, >>>> >>>> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <pavel@ucw.cz> wrote: >>>>> I have an socfpga board, which uses has simple framebuffer implemented >>>>> in the FPGA. On 3.15, framebuffer is fast: >>>>> >>>>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null >>>>> real 0m 0.00s >>>>> user 0m 0.00s >>>>> sys 0m 0.00s >>>>> >>>>> on 3.18, this takes 220msec. Similar slowdown exists for >>>>> writes. Simple framebuffer did not change at all between 3.15 and >>>>> 3.18; resource flags of the framebuffer are still same (0x200). >>>>> >>>>> If I enable caching on 3.18, it speeds up a bit, to 70msec or >>>>> so... Which means problem is not only in caching. >>>>> >>>>> Any ideas? >>>> >>>> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer >>>> copying optimisations"), but this was introduced only in v4.0-rc1. >>>> >>>> Just in case you encounter another performance regression after upgrading >>>> to a more modern kernel ;-) >>> >>> :-). I did a git bisect, and it pointed to this. And reverting it >>> indeed fixes the problem in 3.18. Problem is still there in 4.0. > > The difference is probably caused by memcpy() vs memcpy_fromio(). The > comment above memcpy_fromio() says "This needs to be optimized". I think > generally speaking memcpy_fromio() is correct for a framebuffer. > > That said, if the fb is in RAM, and is only written by the CPU, I think > a normal memcpy() for fb_memcpy_fromfb() should be fine... I didn't test for performance regressions when I posted this patch. A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used all the time, even when the source and destination addresses are aligned for larger reads to be possible. Other archs seem to use readl() or readq() when they can. Maybe that makes memcpy_fromio slower than the implementation of memcpy on arm? Thanks, Archit -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-10 7:05 ` Archit Taneja @ 2015-04-24 13:29 ` Pavel Machek 2015-04-24 13:40 ` Tomi Valkeinen 0 siblings, 1 reply; 18+ messages in thread From: Pavel Machek @ 2015-04-24 13:29 UTC (permalink / raw) To: Archit Taneja Cc: Tomi Valkeinen, Geert Uytterhoeven, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, shc_work, linux, hsweeten Hi! On Fri 2015-04-10 12:35:52, Archit Taneja wrote: > >That said, if the fb is in RAM, and is only written by the CPU, I think > >a normal memcpy() for fb_memcpy_fromfb() should be fine... > > I didn't test for performance regressions when I posted this patch. > > A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used > all the time, even when the source and destination addresses are aligned for > larger reads to be possible. Other archs seem to use readl() or readq() when > they can. Maybe that makes memcpy_fromio slower than the implementation of > memcpy on arm? Ok, can you prepare a patch for me to try? Or should we just revert the original commit? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-24 13:29 ` Pavel Machek @ 2015-04-24 13:40 ` Tomi Valkeinen 2015-04-24 13:46 ` Geert Uytterhoeven 0 siblings, 1 reply; 18+ messages in thread From: Tomi Valkeinen @ 2015-04-24 13:40 UTC (permalink / raw) To: Pavel Machek, Archit Taneja Cc: Geert Uytterhoeven, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, shc_work, linux, hsweeten [-- Attachment #1: Type: text/plain, Size: 932 bytes --] On 24/04/15 16:29, Pavel Machek wrote: > Hi! > > On Fri 2015-04-10 12:35:52, Archit Taneja wrote: >>> That said, if the fb is in RAM, and is only written by the CPU, I think >>> a normal memcpy() for fb_memcpy_fromfb() should be fine... >> >> I didn't test for performance regressions when I posted this patch. >> >> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used >> all the time, even when the source and destination addresses are aligned for >> larger reads to be possible. Other archs seem to use readl() or readq() when >> they can. Maybe that makes memcpy_fromio slower than the implementation of >> memcpy on arm? > > Ok, can you prepare a patch for me to try? Or should we just revert > the original commit? The old way worked fine, afaik, so maybe we can revert. But still, isn't it more correct to use memcpy_fromio? It's (possibly) io memory we have here. Tomi [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-24 13:40 ` Tomi Valkeinen @ 2015-04-24 13:46 ` Geert Uytterhoeven 2015-04-26 19:31 ` Pavel Machek 2015-04-28 13:48 ` Russell King - ARM Linux 0 siblings, 2 replies; 18+ messages in thread From: Geert Uytterhoeven @ 2015-04-24 13:46 UTC (permalink / raw) To: Tomi Valkeinen Cc: Pavel Machek, Archit Taneja, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, Alexander Shiyan, Russell King, H Hartley Sweeten On Fri, Apr 24, 2015 at 3:40 PM, Tomi Valkeinen <tomi.valkeinen@ti.com> wrote: > On 24/04/15 16:29, Pavel Machek wrote: >> On Fri 2015-04-10 12:35:52, Archit Taneja wrote: >>>> That said, if the fb is in RAM, and is only written by the CPU, I think >>>> a normal memcpy() for fb_memcpy_fromfb() should be fine... >>> >>> I didn't test for performance regressions when I posted this patch. >>> >>> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used >>> all the time, even when the source and destination addresses are aligned for >>> larger reads to be possible. Other archs seem to use readl() or readq() when >>> they can. Maybe that makes memcpy_fromio slower than the implementation of >>> memcpy on arm? >> >> Ok, can you prepare a patch for me to try? Or should we just revert >> the original commit? > > The old way worked fine, afaik, so maybe we can revert. But still, isn't > it more correct to use memcpy_fromio? It's (possibly) io memory we have > here. Yes it is. So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io(). That will benefit other drivers on ARM, too. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-24 13:46 ` Geert Uytterhoeven @ 2015-04-26 19:31 ` Pavel Machek 2015-04-28 13:48 ` Russell King - ARM Linux 1 sibling, 0 replies; 18+ messages in thread From: Pavel Machek @ 2015-04-26 19:31 UTC (permalink / raw) To: Geert Uytterhoeven, Russell King Cc: Tomi Valkeinen, Archit Taneja, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, Alexander Shiyan, Russell King, H Hartley Sweeten On Fri 2015-04-24 15:46:56, Geert Uytterhoeven wrote: > On Fri, Apr 24, 2015 at 3:40 PM, Tomi Valkeinen <tomi.valkeinen@ti.com> wrote: > > On 24/04/15 16:29, Pavel Machek wrote: > >> On Fri 2015-04-10 12:35:52, Archit Taneja wrote: > >>>> That said, if the fb is in RAM, and is only written by the CPU, I think > >>>> a normal memcpy() for fb_memcpy_fromfb() should be fine... > >>> > >>> I didn't test for performance regressions when I posted this patch. > >>> > >>> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used > >>> all the time, even when the source and destination addresses are aligned for > >>> larger reads to be possible. Other archs seem to use readl() or readq() when > >>> they can. Maybe that makes memcpy_fromio slower than the implementation of > >>> memcpy on arm? > >> > >> Ok, can you prepare a patch for me to try? Or should we just revert > >> the original commit? > > > > The old way worked fine, afaik, so maybe we can revert. But still, isn't > > it more correct to use memcpy_fromio? It's (possibly) io memory we have > > here. > > Yes it is. > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io(). > That will benefit other drivers on ARM, too. No, sorry. Yes, I could "optimize" memcpy_toio... just by sticking memcpy there, as for example asm-generic/io.h suggests. Maybe it would break something. Maybe not, but potential for that clearly is there... since this is very seldom used function. Or do you know drivers that would benefit from this? void _memcpy_toio(volatile void __iomem *to, const void *from, size_t count) { const unsigned char *f = from; while (count) { count--; writeb(*f, to); f++; to++; } } We have a regression, we have a patch that causes the regression. Right fix at this point is to revert a "cleanup" that causes this, not try to "optimize" otherwise unused piece of code. commit 981409b25e2a99409b26daa67293ca1cfd5ea0a0 Author: Archit Taneja <archit@ti.com> Date: Fri Nov 16 14:46:04 2012 +0530 fbdev: arm has __raw I/O accessors, use them in fb.h This removes the sparse warnings on arm platforms: warning: cast removes address space of expression Signed-off-by: Archit Taneja <archit@ti.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: H Hartley Sweeten <hsweeten at visionengravers.com> Cc: Alexander Shiyan <shc_work@mail.ru> Cc: Russell King <linux@arm.linux.org.uk> Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-24 13:46 ` Geert Uytterhoeven 2015-04-26 19:31 ` Pavel Machek @ 2015-04-28 13:48 ` Russell King - ARM Linux 2015-04-28 15:28 ` Nicolas Pitre 1 sibling, 1 reply; 18+ messages in thread From: Russell King - ARM Linux @ 2015-04-28 13:48 UTC (permalink / raw) To: Geert Uytterhoeven Cc: Tomi Valkeinen, Pavel Machek, Archit Taneja, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, Alexander Shiyan, H Hartley Sweeten On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote: > On Fri, Apr 24, 2015 at 3:40 PM, Tomi Valkeinen <tomi.valkeinen@ti.com> wrote: > > On 24/04/15 16:29, Pavel Machek wrote: > >> On Fri 2015-04-10 12:35:52, Archit Taneja wrote: > >>>> That said, if the fb is in RAM, and is only written by the CPU, I think > >>>> a normal memcpy() for fb_memcpy_fromfb() should be fine... > >>> > >>> I didn't test for performance regressions when I posted this patch. > >>> > >>> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used > >>> all the time, even when the source and destination addresses are aligned for > >>> larger reads to be possible. Other archs seem to use readl() or readq() when > >>> they can. Maybe that makes memcpy_fromio slower than the implementation of > >>> memcpy on arm? > >> > >> Ok, can you prepare a patch for me to try? Or should we just revert > >> the original commit? > > > > The old way worked fine, afaik, so maybe we can revert. But still, isn't > > it more correct to use memcpy_fromio? It's (possibly) io memory we have > > here. > > Yes it is. > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io(). > That will benefit other drivers on ARM, too. That's not going to happen. I've had a patch which does that, but people are concerned that it changes the behaviour of the functions by changing the access size, which could cause regressions. It seems people are far too worried about that to even consider trying. :( -- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-28 13:48 ` Russell King - ARM Linux @ 2015-04-28 15:28 ` Nicolas Pitre 2015-05-06 10:45 ` Russell King - ARM Linux 0 siblings, 1 reply; 18+ messages in thread From: Nicolas Pitre @ 2015-04-28 15:28 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Geert Uytterhoeven, Tomi Valkeinen, Pavel Machek, Archit Taneja, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, Alexander Shiyan, H Hartley Sweeten On Tue, 28 Apr 2015, Russell King - ARM Linux wrote: > On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote: > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io(). > > That will benefit other drivers on ARM, too. > > That's not going to happen. > > I've had a patch which does that, but people are concerned that it changes > the behaviour of the functions by changing the access size, which could > cause regressions. It seems people are far too worried about that to even > consider trying. :( What about making the optimized implementation available via kconfig? Nicolas ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-04-28 15:28 ` Nicolas Pitre @ 2015-05-06 10:45 ` Russell King - ARM Linux 2015-05-06 20:32 ` Nicolas Pitre 2015-05-12 8:52 ` Pavel Machek 0 siblings, 2 replies; 18+ messages in thread From: Russell King - ARM Linux @ 2015-05-06 10:45 UTC (permalink / raw) To: Nicolas Pitre Cc: Geert Uytterhoeven, Tomi Valkeinen, Pavel Machek, Archit Taneja, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, Alexander Shiyan, H Hartley Sweeten On Tue, Apr 28, 2015 at 11:28:53AM -0400, Nicolas Pitre wrote: > On Tue, 28 Apr 2015, Russell King - ARM Linux wrote: > > > On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote: > > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io(). > > > That will benefit other drivers on ARM, too. > > > > That's not going to happen. > > > > I've had a patch which does that, but people are concerned that it changes > > the behaviour of the functions by changing the access size, which could > > cause regressions. It seems people are far too worried about that to even > > consider trying. :( > > What about making the optimized implementation available via kconfig? I'd prefer not to. My personal feeling is to put the patch in and just be done with it - these functions are supposed to be used on IO areas which don't care about access size (in other words, are memory-like rather than being register-like.) Here's the rather old patch: From: Russell King <rmk+kernel@arm.linux.org.uk> Subject: [PATCH] ARM: optimize memset_io()/memcpy_fromio()/memcpy_toio() If we are building for a LE platform, and we haven't overriden the MMIO ops, then we can optimize the mem*io operations using the standard string functions. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> --- arch/arm/include/asm/io.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h index d070741b2b37..358c8206419b 100644 --- a/arch/arm/include/asm/io.h +++ b/arch/arm/include/asm/io.h @@ -23,6 +23,7 @@ #ifdef __KERNEL__ +#include <linux/string.h> #include <linux/types.h> #include <asm/byteorder.h> #include <asm/memory.h> @@ -312,9 +313,33 @@ extern void _memset_io(volatile void __iomem *, int, size_t); #define writesw(p,d,l) __raw_writesw(p,d,l) #define writesl(p,d,l) __raw_writesl(p,d,l) +#ifndef __ARMBE__ +static inline void memset_io(volatile void __iomem *dst, unsigned c, + size_t count) +{ + memset((void __force *)dst, c, count); +} +#define memset_io(dst,c,count) memset_io(dst,c,count) + +static inline void memcpy_fromio(void *to, const volatile void __iomem *from, + size_t count) +{ + memcpy(to, (const void __force *)from, count); +} +#define memcpy_fromio(to,from,count) memcpy_fromio(to,from,count) + +static inline void memcpy_toio(volatile void __iomem *to, const void *from, + size_t count) +{ + memcpy((void __force *)to, from, count); +} +#define memcpy_toio(to,from,count) memcpy_toio(to,from,count) + +#else #define memset_io(c,v,l) _memset_io(c,(v),(l)) #define memcpy_fromio(a,c,l) _memcpy_fromio((a),c,(l)) #define memcpy_toio(c,a,l) _memcpy_toio(c,(a),(l)) +#endif #endif /* readl */ -- 1.8.3.1 -- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-05-06 10:45 ` Russell King - ARM Linux @ 2015-05-06 20:32 ` Nicolas Pitre 2015-05-12 8:52 ` Pavel Machek 1 sibling, 0 replies; 18+ messages in thread From: Nicolas Pitre @ 2015-05-06 20:32 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Geert Uytterhoeven, Tomi Valkeinen, Pavel Machek, Archit Taneja, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, Alexander Shiyan, H Hartley Sweeten On Wed, 6 May 2015, Russell King - ARM Linux wrote: > On Tue, Apr 28, 2015 at 11:28:53AM -0400, Nicolas Pitre wrote: > > On Tue, 28 Apr 2015, Russell King - ARM Linux wrote: > > > > > On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote: > > > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io(). > > > > That will benefit other drivers on ARM, too. > > > > > > That's not going to happen. > > > > > > I've had a patch which does that, but people are concerned that it changes > > > the behaviour of the functions by changing the access size, which could > > > cause regressions. It seems people are far too worried about that to even > > > consider trying. :( > > > > What about making the optimized implementation available via kconfig? > > I'd prefer not to. My personal feeling is to put the patch in and just be > done with it - these functions are supposed to be used on IO areas which > don't care about access size (in other words, are memory-like rather than > being register-like.) Here's the rather old patch: > > From: Russell King <rmk+kernel@arm.linux.org.uk> > Subject: [PATCH] ARM: optimize memset_io()/memcpy_fromio()/memcpy_toio() > > If we are building for a LE platform, and we haven't overriden the > MMIO ops, then we can optimize the mem*io operations using the > standard string functions. > > Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Nicolas Pitre <nico@linaro.org> > --- > arch/arm/include/asm/io.h | 25 +++++++++++++++++++++++++ > 1 file changed, 25 insertions(+) > > diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h > index d070741b2b37..358c8206419b 100644 > --- a/arch/arm/include/asm/io.h > +++ b/arch/arm/include/asm/io.h > @@ -23,6 +23,7 @@ > > #ifdef __KERNEL__ > > +#include <linux/string.h> > #include <linux/types.h> > #include <asm/byteorder.h> > #include <asm/memory.h> > @@ -312,9 +313,33 @@ extern void _memset_io(volatile void __iomem *, int, size_t); > #define writesw(p,d,l) __raw_writesw(p,d,l) > #define writesl(p,d,l) __raw_writesl(p,d,l) > > +#ifndef __ARMBE__ > +static inline void memset_io(volatile void __iomem *dst, unsigned c, > + size_t count) > +{ > + memset((void __force *)dst, c, count); > +} > +#define memset_io(dst,c,count) memset_io(dst,c,count) > + > +static inline void memcpy_fromio(void *to, const volatile void __iomem *from, > + size_t count) > +{ > + memcpy(to, (const void __force *)from, count); > +} > +#define memcpy_fromio(to,from,count) memcpy_fromio(to,from,count) > + > +static inline void memcpy_toio(volatile void __iomem *to, const void *from, > + size_t count) > +{ > + memcpy((void __force *)to, from, count); > +} > +#define memcpy_toio(to,from,count) memcpy_toio(to,from,count) > + > +#else > #define memset_io(c,v,l) _memset_io(c,(v),(l)) > #define memcpy_fromio(a,c,l) _memcpy_fromio((a),c,(l)) > #define memcpy_toio(c,a,l) _memcpy_toio(c,(a),(l)) > +#endif > > #endif /* readl */ > > -- > 1.8.3.1 > > > > -- > FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up > according to speedtest.net. > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform 2015-05-06 10:45 ` Russell King - ARM Linux 2015-05-06 20:32 ` Nicolas Pitre @ 2015-05-12 8:52 ` Pavel Machek 1 sibling, 0 replies; 18+ messages in thread From: Pavel Machek @ 2015-05-12 8:52 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Nicolas Pitre, Geert Uytterhoeven, Tomi Valkeinen, Archit Taneja, Marek Vasut, kernel list, Dinh Nguyen, Jean-Christophe PLAGNIOL-VILLARD, Grant Likely, Rob Herring, Jingoo Han, Rob Clark, Linux Fbdev development list, devicetree, Alexander Shiyan, H Hartley Sweeten On Wed 2015-05-06 11:45:04, Russell King - ARM Linux wrote: > On Tue, Apr 28, 2015 at 11:28:53AM -0400, Nicolas Pitre wrote: > > On Tue, 28 Apr 2015, Russell King - ARM Linux wrote: > > > > > On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote: > > > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io(). > > > > That will benefit other drivers on ARM, too. > > > > > > That's not going to happen. > > > > > > I've had a patch which does that, but people are concerned that it changes > > > the behaviour of the functions by changing the access size, which could > > > cause regressions. It seems people are far too worried about that to even > > > consider trying. :( > > > > What about making the optimized implementation available via kconfig? > > I'd prefer not to. My personal feeling is to put the patch in and just be > done with it - these functions are supposed to be used on IO areas which > don't care about access size (in other words, are memory-like rather than > being register-like.) Here's the rather old patch: > > From: Russell King <rmk+kernel@arm.linux.org.uk> > Subject: [PATCH] ARM: optimize memset_io()/memcpy_fromio()/memcpy_toio() > > If we are building for a LE platform, and we haven't overriden the > MMIO ops, then we can optimize the mem*io operations using the > standard string functions. > > Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Tested-by: Pavel Machek <pavel@ucw.cz> Acked-by: Pavel Machek <pavel@ucw.cz> Works for me, framebuffer performance is back in "too fast to measure" range. When this is merged, should 981409b25e2a99409b26daa67293ca1cfd5ea0a0 be reverted in -stable? Thanks, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-05-12 8:52 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-04-07 12:12 simple framebuffer slower by factor of 20, on socfpga (arm) platform Pavel Machek 2015-04-07 12:19 ` Geert Uytterhoeven 2015-04-07 14:24 ` Marek Vasut 2015-04-09 11:06 ` Pavel Machek 2015-04-09 11:21 ` Tomi Valkeinen 2015-04-09 11:34 ` Tomi Valkeinen 2015-04-09 19:51 ` Arnd Bergmann 2015-04-24 13:31 ` Pavel Machek 2015-04-10 7:05 ` Archit Taneja 2015-04-24 13:29 ` Pavel Machek 2015-04-24 13:40 ` Tomi Valkeinen 2015-04-24 13:46 ` Geert Uytterhoeven 2015-04-26 19:31 ` Pavel Machek 2015-04-28 13:48 ` Russell King - ARM Linux 2015-04-28 15:28 ` Nicolas Pitre 2015-05-06 10:45 ` Russell King - ARM Linux 2015-05-06 20:32 ` Nicolas Pitre 2015-05-12 8:52 ` Pavel Machek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).