All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] fbdev: Improve performance of fbdev console
@ 2022-02-21 19:54 ` Thomas Zimmermann
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: dri-devel, linux-fbdev, Thomas Zimmermann

Optimize performance of the fbdev console for the common case of
software-based clearing and image blitting.

The commit descripton of each patch contains resuls os a simple
microbenchmark. I also tested the full patchset's effect on the
console output by printing directory listings (i7-4790, FullHD,
simpledrm, kernel with debugging).

  > time find /usr/share/doc -type f

In the unoptimized case:

  real    0m6.173s
  user    0m0.044s
  sys     0m6.107s

With optimizations applied:

  real    0m4.754s
  user    0m0.044s
  sys     0m4.698s

In the optimized case, printing the directory listing is ~25% faster
than before.

In v2 of the patchset, after implementing Sam's suggestion to update
cfb_imageblit() as well, it turns out that the compiled code in
sys_imageblit() is still significantly slower than the CFB version. A
fix is probably a larger task and would include architecture-specific
changes. A new TODO item suggests to investigate the performance of the
various helpers and format-conversion functions in DRM and fbdev.

v2:
	* improve readability for sys_imageblit() (Gerd, Sam)
	* new TODO item for further optimization

Thomas Zimmermann (5):
  fbdev: Improve performance of sys_fillrect()
  fbdev: Improve performance of sys_imageblit()
  fbdev: Remove trailing whitespaces from cfbimgblt.c
  fbdev: Improve performance of cfb_imageblit()
  drm: Add TODO item for optimizing format helpers

 Documentation/gpu/todo.rst             |  22 +++++
 drivers/video/fbdev/core/cfbimgblt.c   | 107 ++++++++++++++++---------
 drivers/video/fbdev/core/sysfillrect.c |  16 +---
 drivers/video/fbdev/core/sysimgblt.c   |  49 ++++++++---
 4 files changed, 133 insertions(+), 61 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 0/5] fbdev: Improve performance of fbdev console
@ 2022-02-21 19:54 ` Thomas Zimmermann
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: linux-fbdev, Thomas Zimmermann, dri-devel

Optimize performance of the fbdev console for the common case of
software-based clearing and image blitting.

The commit descripton of each patch contains resuls os a simple
microbenchmark. I also tested the full patchset's effect on the
console output by printing directory listings (i7-4790, FullHD,
simpledrm, kernel with debugging).

  > time find /usr/share/doc -type f

In the unoptimized case:

  real    0m6.173s
  user    0m0.044s
  sys     0m6.107s

With optimizations applied:

  real    0m4.754s
  user    0m0.044s
  sys     0m4.698s

In the optimized case, printing the directory listing is ~25% faster
than before.

In v2 of the patchset, after implementing Sam's suggestion to update
cfb_imageblit() as well, it turns out that the compiled code in
sys_imageblit() is still significantly slower than the CFB version. A
fix is probably a larger task and would include architecture-specific
changes. A new TODO item suggests to investigate the performance of the
various helpers and format-conversion functions in DRM and fbdev.

v2:
	* improve readability for sys_imageblit() (Gerd, Sam)
	* new TODO item for further optimization

Thomas Zimmermann (5):
  fbdev: Improve performance of sys_fillrect()
  fbdev: Improve performance of sys_imageblit()
  fbdev: Remove trailing whitespaces from cfbimgblt.c
  fbdev: Improve performance of cfb_imageblit()
  drm: Add TODO item for optimizing format helpers

 Documentation/gpu/todo.rst             |  22 +++++
 drivers/video/fbdev/core/cfbimgblt.c   | 107 ++++++++++++++++---------
 drivers/video/fbdev/core/sysfillrect.c |  16 +---
 drivers/video/fbdev/core/sysimgblt.c   |  49 ++++++++---
 4 files changed, 133 insertions(+), 61 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/5] fbdev: Improve performance of sys_fillrect()
  2022-02-21 19:54 ` Thomas Zimmermann
@ 2022-02-21 19:54   ` Thomas Zimmermann
  -1 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: dri-devel, linux-fbdev, Thomas Zimmermann

Improve the performance of sys_fillrect() by using word-aligned
32/64-bit mov instructions. While the code tried to implement this,
the compiler failed to create fast instructions. The resulting
binary instructions were even slower than cfb_fillrect(), which
uses the same algorithm, but operates on I/O memory.

A microbenchmark measures the average number of CPU cycles
for sys_fillrect() after a stabilizing period of a few minutes
(i7-4790, FullHD, simpledrm, kernel with debugging). The value
for CFB is given as a reference.

  sys_fillrect(), new:  26586 cycles
  sys_fillrect(), old: 166603 cycles
  cfb_fillrect():       41012 cycles

In the optimized case, sys_fillrect() is now ~6x faster than before
and ~1.5x faster than the CFB implementation.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
---
 drivers/video/fbdev/core/sysfillrect.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/video/fbdev/core/sysfillrect.c b/drivers/video/fbdev/core/sysfillrect.c
index 33ee3d34f9d2..bcdcaeae6538 100644
--- a/drivers/video/fbdev/core/sysfillrect.c
+++ b/drivers/video/fbdev/core/sysfillrect.c
@@ -50,19 +50,9 @@ bitfill_aligned(struct fb_info *p, unsigned long *dst, int dst_idx,
 
 		/* Main chunk */
 		n /= bits;
-		while (n >= 8) {
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			n -= 8;
-		}
-		while (n--)
-			*dst++ = pat;
+		memset_l(dst, pat, n);
+		dst += n;
+
 		/* Trailing bits */
 		if (last)
 			*dst = comp(pat, *dst, last);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 1/5] fbdev: Improve performance of sys_fillrect()
@ 2022-02-21 19:54   ` Thomas Zimmermann
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: linux-fbdev, Thomas Zimmermann, dri-devel

Improve the performance of sys_fillrect() by using word-aligned
32/64-bit mov instructions. While the code tried to implement this,
the compiler failed to create fast instructions. The resulting
binary instructions were even slower than cfb_fillrect(), which
uses the same algorithm, but operates on I/O memory.

A microbenchmark measures the average number of CPU cycles
for sys_fillrect() after a stabilizing period of a few minutes
(i7-4790, FullHD, simpledrm, kernel with debugging). The value
for CFB is given as a reference.

  sys_fillrect(), new:  26586 cycles
  sys_fillrect(), old: 166603 cycles
  cfb_fillrect():       41012 cycles

In the optimized case, sys_fillrect() is now ~6x faster than before
and ~1.5x faster than the CFB implementation.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
---
 drivers/video/fbdev/core/sysfillrect.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/video/fbdev/core/sysfillrect.c b/drivers/video/fbdev/core/sysfillrect.c
index 33ee3d34f9d2..bcdcaeae6538 100644
--- a/drivers/video/fbdev/core/sysfillrect.c
+++ b/drivers/video/fbdev/core/sysfillrect.c
@@ -50,19 +50,9 @@ bitfill_aligned(struct fb_info *p, unsigned long *dst, int dst_idx,
 
 		/* Main chunk */
 		n /= bits;
-		while (n >= 8) {
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			*dst++ = pat;
-			n -= 8;
-		}
-		while (n--)
-			*dst++ = pat;
+		memset_l(dst, pat, n);
+		dst += n;
+
 		/* Trailing bits */
 		if (last)
 			*dst = comp(pat, *dst, last);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/5] fbdev: Improve performance of sys_imageblit()
  2022-02-21 19:54 ` Thomas Zimmermann
@ 2022-02-21 19:54   ` Thomas Zimmermann
  -1 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: dri-devel, linux-fbdev, Thomas Zimmermann

Improve the performance of sys_imageblit() by manually unrolling
the inner blitting loop and moving some invariants out. The compiler
failed to do this automatically. The resulting binary code was even
slower than the cfb_imageblit() helper, which uses the same algorithm,
but operates on I/O memory.

A microbenchmark measures the average number of CPU cycles
for sys_imageblit() after a stabilizing period of a few minutes
(i7-4790, FullHD, simpledrm, kernel with debugging). The value
for CFB is given as a reference.

  sys_imageblit(), new: 25934 cycles
  sys_imageblit(), old: 35944 cycles
  cfb_imageblit():      30566 cycles

In the optimized case, sys_imageblit() is now ~30% faster than before
and ~20% faster than cfb_imageblit().

v2:
	* move switch out of inner loop (Gerd)
	* remove test for alignment of dst1 (Sam)

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
---
 drivers/video/fbdev/core/sysimgblt.c | 49 +++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/video/fbdev/core/sysimgblt.c b/drivers/video/fbdev/core/sysimgblt.c
index a4d05b1b17d7..722c327a381b 100644
--- a/drivers/video/fbdev/core/sysimgblt.c
+++ b/drivers/video/fbdev/core/sysimgblt.c
@@ -188,23 +188,29 @@ static void fast_imageblit(const struct fb_image *image, struct fb_info *p,
 {
 	u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel;
 	u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
-	u32 bit_mask, end_mask, eorx, shift;
+	u32 bit_mask, eorx;
 	const char *s = image->data, *src;
 	u32 *dst;
-	const u32 *tab = NULL;
+	const u32 *tab;
+	size_t tablen;
+	u32 colortab[16];
 	int i, j, k;
 
 	switch (bpp) {
 	case 8:
 		tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le;
+		tablen = 16;
 		break;
 	case 16:
 		tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le;
+		tablen = 4;
 		break;
 	case 32:
-	default:
 		tab = cfb_tab32;
+		tablen = 2;
 		break;
+	default:
+		return;
 	}
 
 	for (i = ppw-1; i--; ) {
@@ -218,19 +224,40 @@ static void fast_imageblit(const struct fb_image *image, struct fb_info *p,
 	eorx = fgx ^ bgx;
 	k = image->width/ppw;
 
+	for (i = 0; i < tablen; ++i)
+		colortab[i] = (tab[i] & eorx) ^ bgx;
+
 	for (i = image->height; i--; ) {
 		dst = dst1;
-		shift = 8;
 		src = s;
 
-		for (j = k; j--; ) {
-			shift -= ppw;
-			end_mask = tab[(*src >> shift) & bit_mask];
-			*dst++ = (end_mask & eorx) ^ bgx;
-			if (!shift) {
-				shift = 8;
-				src++;
+		switch (ppw) {
+		case 4: /* 8 bpp */
+			for (j = k; j; j -= 2, ++src) {
+				*dst++ = colortab[(*src >> 4) & bit_mask];
+				*dst++ = colortab[(*src >> 0) & bit_mask];
+			}
+			break;
+		case 2: /* 16 bpp */
+			for (j = k; j; j -= 4, ++src) {
+				*dst++ = colortab[(*src >> 6) & bit_mask];
+				*dst++ = colortab[(*src >> 4) & bit_mask];
+				*dst++ = colortab[(*src >> 2) & bit_mask];
+				*dst++ = colortab[(*src >> 0) & bit_mask];
+			}
+			break;
+		case 1: /* 32 bpp */
+			for (j = k; j; j -= 8, ++src) {
+				*dst++ = colortab[(*src >> 7) & bit_mask];
+				*dst++ = colortab[(*src >> 6) & bit_mask];
+				*dst++ = colortab[(*src >> 5) & bit_mask];
+				*dst++ = colortab[(*src >> 4) & bit_mask];
+				*dst++ = colortab[(*src >> 3) & bit_mask];
+				*dst++ = colortab[(*src >> 2) & bit_mask];
+				*dst++ = colortab[(*src >> 1) & bit_mask];
+				*dst++ = colortab[(*src >> 0) & bit_mask];
 			}
+			break;
 		}
 		dst1 += p->fix.line_length;
 		s += spitch;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/5] fbdev: Improve performance of sys_imageblit()
@ 2022-02-21 19:54   ` Thomas Zimmermann
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: linux-fbdev, Thomas Zimmermann, dri-devel

Improve the performance of sys_imageblit() by manually unrolling
the inner blitting loop and moving some invariants out. The compiler
failed to do this automatically. The resulting binary code was even
slower than the cfb_imageblit() helper, which uses the same algorithm,
but operates on I/O memory.

A microbenchmark measures the average number of CPU cycles
for sys_imageblit() after a stabilizing period of a few minutes
(i7-4790, FullHD, simpledrm, kernel with debugging). The value
for CFB is given as a reference.

  sys_imageblit(), new: 25934 cycles
  sys_imageblit(), old: 35944 cycles
  cfb_imageblit():      30566 cycles

In the optimized case, sys_imageblit() is now ~30% faster than before
and ~20% faster than cfb_imageblit().

v2:
	* move switch out of inner loop (Gerd)
	* remove test for alignment of dst1 (Sam)

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
---
 drivers/video/fbdev/core/sysimgblt.c | 49 +++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/video/fbdev/core/sysimgblt.c b/drivers/video/fbdev/core/sysimgblt.c
index a4d05b1b17d7..722c327a381b 100644
--- a/drivers/video/fbdev/core/sysimgblt.c
+++ b/drivers/video/fbdev/core/sysimgblt.c
@@ -188,23 +188,29 @@ static void fast_imageblit(const struct fb_image *image, struct fb_info *p,
 {
 	u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel;
 	u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
-	u32 bit_mask, end_mask, eorx, shift;
+	u32 bit_mask, eorx;
 	const char *s = image->data, *src;
 	u32 *dst;
-	const u32 *tab = NULL;
+	const u32 *tab;
+	size_t tablen;
+	u32 colortab[16];
 	int i, j, k;
 
 	switch (bpp) {
 	case 8:
 		tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le;
+		tablen = 16;
 		break;
 	case 16:
 		tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le;
+		tablen = 4;
 		break;
 	case 32:
-	default:
 		tab = cfb_tab32;
+		tablen = 2;
 		break;
+	default:
+		return;
 	}
 
 	for (i = ppw-1; i--; ) {
@@ -218,19 +224,40 @@ static void fast_imageblit(const struct fb_image *image, struct fb_info *p,
 	eorx = fgx ^ bgx;
 	k = image->width/ppw;
 
+	for (i = 0; i < tablen; ++i)
+		colortab[i] = (tab[i] & eorx) ^ bgx;
+
 	for (i = image->height; i--; ) {
 		dst = dst1;
-		shift = 8;
 		src = s;
 
-		for (j = k; j--; ) {
-			shift -= ppw;
-			end_mask = tab[(*src >> shift) & bit_mask];
-			*dst++ = (end_mask & eorx) ^ bgx;
-			if (!shift) {
-				shift = 8;
-				src++;
+		switch (ppw) {
+		case 4: /* 8 bpp */
+			for (j = k; j; j -= 2, ++src) {
+				*dst++ = colortab[(*src >> 4) & bit_mask];
+				*dst++ = colortab[(*src >> 0) & bit_mask];
+			}
+			break;
+		case 2: /* 16 bpp */
+			for (j = k; j; j -= 4, ++src) {
+				*dst++ = colortab[(*src >> 6) & bit_mask];
+				*dst++ = colortab[(*src >> 4) & bit_mask];
+				*dst++ = colortab[(*src >> 2) & bit_mask];
+				*dst++ = colortab[(*src >> 0) & bit_mask];
+			}
+			break;
+		case 1: /* 32 bpp */
+			for (j = k; j; j -= 8, ++src) {
+				*dst++ = colortab[(*src >> 7) & bit_mask];
+				*dst++ = colortab[(*src >> 6) & bit_mask];
+				*dst++ = colortab[(*src >> 5) & bit_mask];
+				*dst++ = colortab[(*src >> 4) & bit_mask];
+				*dst++ = colortab[(*src >> 3) & bit_mask];
+				*dst++ = colortab[(*src >> 2) & bit_mask];
+				*dst++ = colortab[(*src >> 1) & bit_mask];
+				*dst++ = colortab[(*src >> 0) & bit_mask];
 			}
+			break;
 		}
 		dst1 += p->fix.line_length;
 		s += spitch;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/5] fbdev: Remove trailing whitespaces from cfbimgblt.c
  2022-02-21 19:54 ` Thomas Zimmermann
@ 2022-02-21 19:54   ` Thomas Zimmermann
  -1 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: dri-devel, linux-fbdev, Thomas Zimmermann

Fix coding style. No functional changes.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/video/fbdev/core/cfbimgblt.c | 60 ++++++++++++++--------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c
index a2bb276a8b24..01b01a279681 100644
--- a/drivers/video/fbdev/core/cfbimgblt.c
+++ b/drivers/video/fbdev/core/cfbimgblt.c
@@ -16,15 +16,15 @@
  *  must be laid out exactly in the same format as the framebuffer. Yes I know
  *  their are cards with hardware that coverts images of various depths to the
  *  framebuffer depth. But not every card has this. All images must be rounded
- *  up to the nearest byte. For example a bitmap 12 bits wide must be two 
- *  bytes width. 
+ *  up to the nearest byte. For example a bitmap 12 bits wide must be two
+ *  bytes width.
  *
- *  Tony: 
- *  Incorporate mask tables similar to fbcon-cfb*.c in 2.4 API.  This speeds 
+ *  Tony:
+ *  Incorporate mask tables similar to fbcon-cfb*.c in 2.4 API.  This speeds
  *  up the code significantly.
- *  
+ *
  *  Code for depths not multiples of BITS_PER_LONG is still kludgy, which is
- *  still processed a bit at a time.   
+ *  still processed a bit at a time.
  *
  *  Also need to add code to deal with cards endians that are different than
  *  the native cpu endians. I also need to deal with MSB position in the word.
@@ -72,8 +72,8 @@ static const u32 cfb_tab32[] = {
 #define FB_WRITEL fb_writel
 #define FB_READL  fb_readl
 
-static inline void color_imageblit(const struct fb_image *image, 
-				   struct fb_info *p, u8 __iomem *dst1, 
+static inline void color_imageblit(const struct fb_image *image,
+				   struct fb_info *p, u8 __iomem *dst1,
 				   u32 start_index,
 				   u32 pitch_index)
 {
@@ -92,7 +92,7 @@ static inline void color_imageblit(const struct fb_image *image,
 		dst = (u32 __iomem *) dst1;
 		shift = 0;
 		val = 0;
-		
+
 		if (start_index) {
 			u32 start_mask = ~fb_shifted_pixels_mask_u32(p,
 						start_index, bswapmask);
@@ -109,8 +109,8 @@ static inline void color_imageblit(const struct fb_image *image,
 			val |= FB_SHIFT_HIGH(p, color, shift ^ bswapmask);
 			if (shift >= null_bits) {
 				FB_WRITEL(val, dst++);
-	
-				val = (shift == null_bits) ? 0 : 
+
+				val = (shift == null_bits) ? 0 :
 					FB_SHIFT_LOW(p, color, 32 - shift);
 			}
 			shift += bpp;
@@ -134,9 +134,9 @@ static inline void color_imageblit(const struct fb_image *image,
 	}
 }
 
-static inline void slow_imageblit(const struct fb_image *image, struct fb_info *p, 
+static inline void slow_imageblit(const struct fb_image *image, struct fb_info *p,
 				  u8 __iomem *dst1, u32 fgcolor,
-				  u32 bgcolor, 
+				  u32 bgcolor,
 				  u32 start_index,
 				  u32 pitch_index)
 {
@@ -172,7 +172,7 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info *
 			l--;
 			color = (*s & (1 << l)) ? fgcolor : bgcolor;
 			val |= FB_SHIFT_HIGH(p, color, shift ^ bswapmask);
-			
+
 			/* Did the bitshift spill bits to the next long? */
 			if (shift >= null_bits) {
 				FB_WRITEL(val, dst++);
@@ -191,16 +191,16 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info *
 
 			FB_WRITEL((FB_READL(dst) & end_mask) | val, dst);
 		}
-		
+
 		dst1 += pitch;
-		src += spitch;	
+		src += spitch;
 		if (pitch_index) {
 			dst2 += pitch;
 			dst1 = (u8 __iomem *)((long __force)dst2 & ~(sizeof(u32) - 1));
 			start_index += pitch_index;
 			start_index &= 32 - 1;
 		}
-		
+
 	}
 }
 
@@ -212,9 +212,9 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info *
  *           fix->line_legth is divisible by 4;
  *           beginning and end of a scanline is dword aligned
  */
-static inline void fast_imageblit(const struct fb_image *image, struct fb_info *p, 
-				  u8 __iomem *dst1, u32 fgcolor, 
-				  u32 bgcolor) 
+static inline void fast_imageblit(const struct fb_image *image, struct fb_info *p,
+				  u8 __iomem *dst1, u32 fgcolor,
+				  u32 bgcolor)
 {
 	u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel;
 	u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
@@ -243,25 +243,25 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info *
 		fgx |= fgcolor;
 		bgx |= bgcolor;
 	}
-	
+
 	bit_mask = (1 << ppw) - 1;
 	eorx = fgx ^ bgx;
 	k = image->width/ppw;
 
 	for (i = image->height; i--; ) {
 		dst = (u32 __iomem *) dst1, shift = 8; src = s;
-		
+
 		for (j = k; j--; ) {
 			shift -= ppw;
 			end_mask = tab[(*src >> shift) & bit_mask];
 			FB_WRITEL((end_mask & eorx)^bgx, dst++);
-			if (!shift) { shift = 8; src++; }		
+			if (!shift) { shift = 8; src++; }
 		}
 		dst1 += p->fix.line_length;
 		s += spitch;
 	}
-}	
-	
+}
+
 void cfb_imageblit(struct fb_info *p, const struct fb_image *image)
 {
 	u32 fgcolor, bgcolor, start_index, bitstart, pitch_index = 0;
@@ -292,13 +292,13 @@ void cfb_imageblit(struct fb_info *p, const struct fb_image *image)
 		} else {
 			fgcolor = image->fg_color;
 			bgcolor = image->bg_color;
-		}	
-		
-		if (32 % bpp == 0 && !start_index && !pitch_index && 
+		}
+
+		if (32 % bpp == 0 && !start_index && !pitch_index &&
 		    ((width & (32/bpp-1)) == 0) &&
-		    bpp >= 8 && bpp <= 32) 			
+		    bpp >= 8 && bpp <= 32)
 			fast_imageblit(image, p, dst1, fgcolor, bgcolor);
-		else 
+		else
 			slow_imageblit(image, p, dst1, fgcolor, bgcolor,
 					start_index, pitch_index);
 	} else
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/5] fbdev: Remove trailing whitespaces from cfbimgblt.c
@ 2022-02-21 19:54   ` Thomas Zimmermann
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: linux-fbdev, Thomas Zimmermann, dri-devel

Fix coding style. No functional changes.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/video/fbdev/core/cfbimgblt.c | 60 ++++++++++++++--------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c
index a2bb276a8b24..01b01a279681 100644
--- a/drivers/video/fbdev/core/cfbimgblt.c
+++ b/drivers/video/fbdev/core/cfbimgblt.c
@@ -16,15 +16,15 @@
  *  must be laid out exactly in the same format as the framebuffer. Yes I know
  *  their are cards with hardware that coverts images of various depths to the
  *  framebuffer depth. But not every card has this. All images must be rounded
- *  up to the nearest byte. For example a bitmap 12 bits wide must be two 
- *  bytes width. 
+ *  up to the nearest byte. For example a bitmap 12 bits wide must be two
+ *  bytes width.
  *
- *  Tony: 
- *  Incorporate mask tables similar to fbcon-cfb*.c in 2.4 API.  This speeds 
+ *  Tony:
+ *  Incorporate mask tables similar to fbcon-cfb*.c in 2.4 API.  This speeds
  *  up the code significantly.
- *  
+ *
  *  Code for depths not multiples of BITS_PER_LONG is still kludgy, which is
- *  still processed a bit at a time.   
+ *  still processed a bit at a time.
  *
  *  Also need to add code to deal with cards endians that are different than
  *  the native cpu endians. I also need to deal with MSB position in the word.
@@ -72,8 +72,8 @@ static const u32 cfb_tab32[] = {
 #define FB_WRITEL fb_writel
 #define FB_READL  fb_readl
 
-static inline void color_imageblit(const struct fb_image *image, 
-				   struct fb_info *p, u8 __iomem *dst1, 
+static inline void color_imageblit(const struct fb_image *image,
+				   struct fb_info *p, u8 __iomem *dst1,
 				   u32 start_index,
 				   u32 pitch_index)
 {
@@ -92,7 +92,7 @@ static inline void color_imageblit(const struct fb_image *image,
 		dst = (u32 __iomem *) dst1;
 		shift = 0;
 		val = 0;
-		
+
 		if (start_index) {
 			u32 start_mask = ~fb_shifted_pixels_mask_u32(p,
 						start_index, bswapmask);
@@ -109,8 +109,8 @@ static inline void color_imageblit(const struct fb_image *image,
 			val |= FB_SHIFT_HIGH(p, color, shift ^ bswapmask);
 			if (shift >= null_bits) {
 				FB_WRITEL(val, dst++);
-	
-				val = (shift == null_bits) ? 0 : 
+
+				val = (shift == null_bits) ? 0 :
 					FB_SHIFT_LOW(p, color, 32 - shift);
 			}
 			shift += bpp;
@@ -134,9 +134,9 @@ static inline void color_imageblit(const struct fb_image *image,
 	}
 }
 
-static inline void slow_imageblit(const struct fb_image *image, struct fb_info *p, 
+static inline void slow_imageblit(const struct fb_image *image, struct fb_info *p,
 				  u8 __iomem *dst1, u32 fgcolor,
-				  u32 bgcolor, 
+				  u32 bgcolor,
 				  u32 start_index,
 				  u32 pitch_index)
 {
@@ -172,7 +172,7 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info *
 			l--;
 			color = (*s & (1 << l)) ? fgcolor : bgcolor;
 			val |= FB_SHIFT_HIGH(p, color, shift ^ bswapmask);
-			
+
 			/* Did the bitshift spill bits to the next long? */
 			if (shift >= null_bits) {
 				FB_WRITEL(val, dst++);
@@ -191,16 +191,16 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info *
 
 			FB_WRITEL((FB_READL(dst) & end_mask) | val, dst);
 		}
-		
+
 		dst1 += pitch;
-		src += spitch;	
+		src += spitch;
 		if (pitch_index) {
 			dst2 += pitch;
 			dst1 = (u8 __iomem *)((long __force)dst2 & ~(sizeof(u32) - 1));
 			start_index += pitch_index;
 			start_index &= 32 - 1;
 		}
-		
+
 	}
 }
 
@@ -212,9 +212,9 @@ static inline void slow_imageblit(const struct fb_image *image, struct fb_info *
  *           fix->line_legth is divisible by 4;
  *           beginning and end of a scanline is dword aligned
  */
-static inline void fast_imageblit(const struct fb_image *image, struct fb_info *p, 
-				  u8 __iomem *dst1, u32 fgcolor, 
-				  u32 bgcolor) 
+static inline void fast_imageblit(const struct fb_image *image, struct fb_info *p,
+				  u8 __iomem *dst1, u32 fgcolor,
+				  u32 bgcolor)
 {
 	u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel;
 	u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
@@ -243,25 +243,25 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info *
 		fgx |= fgcolor;
 		bgx |= bgcolor;
 	}
-	
+
 	bit_mask = (1 << ppw) - 1;
 	eorx = fgx ^ bgx;
 	k = image->width/ppw;
 
 	for (i = image->height; i--; ) {
 		dst = (u32 __iomem *) dst1, shift = 8; src = s;
-		
+
 		for (j = k; j--; ) {
 			shift -= ppw;
 			end_mask = tab[(*src >> shift) & bit_mask];
 			FB_WRITEL((end_mask & eorx)^bgx, dst++);
-			if (!shift) { shift = 8; src++; }		
+			if (!shift) { shift = 8; src++; }
 		}
 		dst1 += p->fix.line_length;
 		s += spitch;
 	}
-}	
-	
+}
+
 void cfb_imageblit(struct fb_info *p, const struct fb_image *image)
 {
 	u32 fgcolor, bgcolor, start_index, bitstart, pitch_index = 0;
@@ -292,13 +292,13 @@ void cfb_imageblit(struct fb_info *p, const struct fb_image *image)
 		} else {
 			fgcolor = image->fg_color;
 			bgcolor = image->bg_color;
-		}	
-		
-		if (32 % bpp == 0 && !start_index && !pitch_index && 
+		}
+
+		if (32 % bpp == 0 && !start_index && !pitch_index &&
 		    ((width & (32/bpp-1)) == 0) &&
-		    bpp >= 8 && bpp <= 32) 			
+		    bpp >= 8 && bpp <= 32)
 			fast_imageblit(image, p, dst1, fgcolor, bgcolor);
-		else 
+		else
 			slow_imageblit(image, p, dst1, fgcolor, bgcolor,
 					start_index, pitch_index);
 	} else
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/5] fbdev: Improve performance of cfb_imageblit()
  2022-02-21 19:54 ` Thomas Zimmermann
@ 2022-02-21 19:54   ` Thomas Zimmermann
  -1 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: dri-devel, linux-fbdev, Thomas Zimmermann

Improve the performance of sys_imageblit() by manually unrolling
the inner blitting loop and moving some invariants out. The compiler
failed to do this automatically. This change keeps cfb_imageblit()
in sync with sys_imagebit().

A microbenchmark measures the average number of CPU cycles
for sys_imageblit() after a stabilizing period of a few minutes
(i7-4790, FullHD, simpledrm, kernel with debugging).

sys_imageblit(), new: 15724 cycles
cfb_imageblit(): old: 30566 cycles

In the optimized case, cfb_imageblit() is now ~2x faster than before.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 9 deletions(-)

diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c
index 01b01a279681..7361cfabdd85 100644
--- a/drivers/video/fbdev/core/cfbimgblt.c
+++ b/drivers/video/fbdev/core/cfbimgblt.c
@@ -218,23 +218,29 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info *
 {
 	u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel;
 	u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
-	u32 bit_mask, end_mask, eorx, shift;
+	u32 bit_mask, eorx;
 	const char *s = image->data, *src;
 	u32 __iomem *dst;
 	const u32 *tab = NULL;
+	size_t tablen;
+	u32 colortab[16];
 	int i, j, k;
 
 	switch (bpp) {
 	case 8:
 		tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le;
+		tablen = 16;
 		break;
 	case 16:
 		tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le;
+		tablen = 4;
 		break;
 	case 32:
-	default:
 		tab = cfb_tab32;
+		tablen = 2;
 		break;
+	default:
+		return;
 	}
 
 	for (i = ppw-1; i--; ) {
@@ -248,15 +254,42 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info *
 	eorx = fgx ^ bgx;
 	k = image->width/ppw;
 
-	for (i = image->height; i--; ) {
-		dst = (u32 __iomem *) dst1, shift = 8; src = s;
+	for (i = 0; i < tablen; ++i)
+		colortab[i] = (tab[i] & eorx) ^ bgx;
 
-		for (j = k; j--; ) {
-			shift -= ppw;
-			end_mask = tab[(*src >> shift) & bit_mask];
-			FB_WRITEL((end_mask & eorx)^bgx, dst++);
-			if (!shift) { shift = 8; src++; }
+	for (i = image->height; i--; ) {
+		dst = (u32 __iomem *)dst1;
+		src = s;
+
+		switch (ppw) {
+		case 4: /* 8 bpp */
+			for (j = k; j; j -= 2, ++src) {
+				FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
+			}
+			break;
+		case 2: /* 16 bpp */
+			for (j = k; j; j -= 4, ++src) {
+				FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
+			}
+			break;
+		case 1: /* 32 bpp */
+			for (j = k; j; j -= 8, ++src) {
+				FB_WRITEL(colortab[(*src >> 7) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 5) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 3) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 1) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
+			}
+			break;
 		}
+
 		dst1 += p->fix.line_length;
 		s += spitch;
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/5] fbdev: Improve performance of cfb_imageblit()
@ 2022-02-21 19:54   ` Thomas Zimmermann
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: linux-fbdev, Thomas Zimmermann, dri-devel

Improve the performance of sys_imageblit() by manually unrolling
the inner blitting loop and moving some invariants out. The compiler
failed to do this automatically. This change keeps cfb_imageblit()
in sync with sys_imagebit().

A microbenchmark measures the average number of CPU cycles
for sys_imageblit() after a stabilizing period of a few minutes
(i7-4790, FullHD, simpledrm, kernel with debugging).

sys_imageblit(), new: 15724 cycles
cfb_imageblit(): old: 30566 cycles

In the optimized case, cfb_imageblit() is now ~2x faster than before.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 9 deletions(-)

diff --git a/drivers/video/fbdev/core/cfbimgblt.c b/drivers/video/fbdev/core/cfbimgblt.c
index 01b01a279681..7361cfabdd85 100644
--- a/drivers/video/fbdev/core/cfbimgblt.c
+++ b/drivers/video/fbdev/core/cfbimgblt.c
@@ -218,23 +218,29 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info *
 {
 	u32 fgx = fgcolor, bgx = bgcolor, bpp = p->var.bits_per_pixel;
 	u32 ppw = 32/bpp, spitch = (image->width + 7)/8;
-	u32 bit_mask, end_mask, eorx, shift;
+	u32 bit_mask, eorx;
 	const char *s = image->data, *src;
 	u32 __iomem *dst;
 	const u32 *tab = NULL;
+	size_t tablen;
+	u32 colortab[16];
 	int i, j, k;
 
 	switch (bpp) {
 	case 8:
 		tab = fb_be_math(p) ? cfb_tab8_be : cfb_tab8_le;
+		tablen = 16;
 		break;
 	case 16:
 		tab = fb_be_math(p) ? cfb_tab16_be : cfb_tab16_le;
+		tablen = 4;
 		break;
 	case 32:
-	default:
 		tab = cfb_tab32;
+		tablen = 2;
 		break;
+	default:
+		return;
 	}
 
 	for (i = ppw-1; i--; ) {
@@ -248,15 +254,42 @@ static inline void fast_imageblit(const struct fb_image *image, struct fb_info *
 	eorx = fgx ^ bgx;
 	k = image->width/ppw;
 
-	for (i = image->height; i--; ) {
-		dst = (u32 __iomem *) dst1, shift = 8; src = s;
+	for (i = 0; i < tablen; ++i)
+		colortab[i] = (tab[i] & eorx) ^ bgx;
 
-		for (j = k; j--; ) {
-			shift -= ppw;
-			end_mask = tab[(*src >> shift) & bit_mask];
-			FB_WRITEL((end_mask & eorx)^bgx, dst++);
-			if (!shift) { shift = 8; src++; }
+	for (i = image->height; i--; ) {
+		dst = (u32 __iomem *)dst1;
+		src = s;
+
+		switch (ppw) {
+		case 4: /* 8 bpp */
+			for (j = k; j; j -= 2, ++src) {
+				FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
+			}
+			break;
+		case 2: /* 16 bpp */
+			for (j = k; j; j -= 4, ++src) {
+				FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
+			}
+			break;
+		case 1: /* 32 bpp */
+			for (j = k; j; j -= 8, ++src) {
+				FB_WRITEL(colortab[(*src >> 7) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 6) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 5) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 4) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 3) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 2) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 1) & bit_mask], dst++);
+				FB_WRITEL(colortab[(*src >> 0) & bit_mask], dst++);
+			}
+			break;
 		}
+
 		dst1 += p->fix.line_length;
 		s += spitch;
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/5] drm: Add TODO item for optimizing format helpers
  2022-02-21 19:54 ` Thomas Zimmermann
@ 2022-02-21 19:54   ` Thomas Zimmermann
  -1 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: dri-devel, linux-fbdev, Thomas Zimmermann

Add a TODO item for optimizing blitting and format-conversion helpers
in DRM and fbdev. There's always demand for faster graphics output.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 Documentation/gpu/todo.rst | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst
index 7bf7f2111696..7f113c6a02dd 100644
--- a/Documentation/gpu/todo.rst
+++ b/Documentation/gpu/todo.rst
@@ -241,6 +241,28 @@ Contact: Thomas Zimmermann <tzimmermann@suse.de>, Daniel Vetter
 
 Level: Advanced
 
+Benchmark and optimize blitting and format-conversion function
+--------------------------------------------------------------
+
+Drawing to dispay memory quickly is crucial for many applications'
+performance.
+
+On at least x86-64, sys_imageblit() is significantly slower than
+cfb_imageblit(), even though both use the same blitting algorithm and
+the latter is written for I/O memory. It turns out that cfb_imageblit()
+uses movl instructions, while sys_imageblit apparently does not. This
+seems to be a problem with gcc's optimizer. DRM's format-conversion
+heleprs might be subject to similar issues.
+
+Benchmark and optimize fbdev's sys_() helpers and DRM's format-conversion
+helpers. In cases that can be further optimized, maybe implement a different
+algorithm, For micro-optimizations, use movl/movq instructions explicitly.
+That might possibly require architecture specific helpers (e.g., storel()
+storeq()).
+
+Contact: Thomas Zimmermann <tzimmermann@suse.de>
+
+Level: Intermediate
 
 drm_framebuffer_funcs and drm_mode_config_funcs.fb_create cleanup
 -----------------------------------------------------------------
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/5] drm: Add TODO item for optimizing format helpers
@ 2022-02-21 19:54   ` Thomas Zimmermann
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-21 19:54 UTC (permalink / raw)
  To: daniel, deller, javierm, geert, sam, kraxel
  Cc: linux-fbdev, Thomas Zimmermann, dri-devel

Add a TODO item for optimizing blitting and format-conversion helpers
in DRM and fbdev. There's always demand for faster graphics output.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 Documentation/gpu/todo.rst | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst
index 7bf7f2111696..7f113c6a02dd 100644
--- a/Documentation/gpu/todo.rst
+++ b/Documentation/gpu/todo.rst
@@ -241,6 +241,28 @@ Contact: Thomas Zimmermann <tzimmermann@suse.de>, Daniel Vetter
 
 Level: Advanced
 
+Benchmark and optimize blitting and format-conversion function
+--------------------------------------------------------------
+
+Drawing to dispay memory quickly is crucial for many applications'
+performance.
+
+On at least x86-64, sys_imageblit() is significantly slower than
+cfb_imageblit(), even though both use the same blitting algorithm and
+the latter is written for I/O memory. It turns out that cfb_imageblit()
+uses movl instructions, while sys_imageblit apparently does not. This
+seems to be a problem with gcc's optimizer. DRM's format-conversion
+heleprs might be subject to similar issues.
+
+Benchmark and optimize fbdev's sys_() helpers and DRM's format-conversion
+helpers. In cases that can be further optimized, maybe implement a different
+algorithm, For micro-optimizations, use movl/movq instructions explicitly.
+That might possibly require architecture specific helpers (e.g., storel()
+storeq()).
+
+Contact: Thomas Zimmermann <tzimmermann@suse.de>
+
+Level: Intermediate
 
 drm_framebuffer_funcs and drm_mode_config_funcs.fb_create cleanup
 -----------------------------------------------------------------
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/5] fbdev: Improve performance of cfb_imageblit()
  2022-02-21 19:54   ` Thomas Zimmermann
@ 2022-02-22 13:01     ` Pekka Paalanen
  -1 siblings, 0 replies; 15+ messages in thread
From: Pekka Paalanen @ 2022-02-22 13:01 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: daniel, deller, javierm, geert, sam, kraxel, linux-fbdev, dri-devel

[-- Attachment #1: Type: text/plain, Size: 999 bytes --]

On Mon, 21 Feb 2022 20:54:09 +0100
Thomas Zimmermann <tzimmermann@suse.de> wrote:

> Improve the performance of sys_imageblit() by manually unrolling

sys?

> the inner blitting loop and moving some invariants out. The compiler
> failed to do this automatically. This change keeps cfb_imageblit()
> in sync with sys_imagebit().

This is correct here.

> 
> A microbenchmark measures the average number of CPU cycles
> for sys_imageblit() after a stabilizing period of a few minutes

sys?

> (i7-4790, FullHD, simpledrm, kernel with debugging).
> 
> sys_imageblit(), new: 15724 cycles

sys?

> cfb_imageblit(): old: 30566 cycles
> 
> In the optimized case, cfb_imageblit() is now ~2x faster than before.
> 
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> ---
>  drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++-----
>  1 file changed, 42 insertions(+), 9 deletions(-)

Just noticed some confusion in the commit message.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/5] fbdev: Improve performance of cfb_imageblit()
@ 2022-02-22 13:01     ` Pekka Paalanen
  0 siblings, 0 replies; 15+ messages in thread
From: Pekka Paalanen @ 2022-02-22 13:01 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: linux-fbdev, deller, javierm, dri-devel, geert, kraxel, sam

[-- Attachment #1: Type: text/plain, Size: 999 bytes --]

On Mon, 21 Feb 2022 20:54:09 +0100
Thomas Zimmermann <tzimmermann@suse.de> wrote:

> Improve the performance of sys_imageblit() by manually unrolling

sys?

> the inner blitting loop and moving some invariants out. The compiler
> failed to do this automatically. This change keeps cfb_imageblit()
> in sync with sys_imagebit().

This is correct here.

> 
> A microbenchmark measures the average number of CPU cycles
> for sys_imageblit() after a stabilizing period of a few minutes

sys?

> (i7-4790, FullHD, simpledrm, kernel with debugging).
> 
> sys_imageblit(), new: 15724 cycles

sys?

> cfb_imageblit(): old: 30566 cycles
> 
> In the optimized case, cfb_imageblit() is now ~2x faster than before.
> 
> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
> ---
>  drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++-----
>  1 file changed, 42 insertions(+), 9 deletions(-)

Just noticed some confusion in the commit message.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/5] fbdev: Improve performance of cfb_imageblit()
  2022-02-22 13:01     ` Pekka Paalanen
  (?)
@ 2022-02-22 18:48     ` Thomas Zimmermann
  -1 siblings, 0 replies; 15+ messages in thread
From: Thomas Zimmermann @ 2022-02-22 18:48 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: linux-fbdev, deller, javierm, dri-devel, geert, kraxel, sam


[-- Attachment #1.1: Type: text/plain, Size: 1434 bytes --]

Hi

Am 22.02.22 um 14:01 schrieb Pekka Paalanen:
> On Mon, 21 Feb 2022 20:54:09 +0100
> Thomas Zimmermann <tzimmermann@suse.de> wrote:
> 
>> Improve the performance of sys_imageblit() by manually unrolling
> 
> sys?
> 
>> the inner blitting loop and moving some invariants out. The compiler
>> failed to do this automatically. This change keeps cfb_imageblit()
>> in sync with sys_imagebit().
> 
> This is correct here.
> 
>>
>> A microbenchmark measures the average number of CPU cycles
>> for sys_imageblit() after a stabilizing period of a few minutes
> 
> sys?
> 
>> (i7-4790, FullHD, simpledrm, kernel with debugging).
>>
>> sys_imageblit(), new: 15724 cycles
> 
> sys?
> 
>> cfb_imageblit(): old: 30566 cycles
>>
>> In the optimized case, cfb_imageblit() is now ~2x faster than before.
>>
>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>> ---
>>   drivers/video/fbdev/core/cfbimgblt.c | 51 +++++++++++++++++++++++-----
>>   1 file changed, 42 insertions(+), 9 deletions(-)
> 
> Just noticed some confusion in the commit message.

I copied some of the text from the other commit and I could have sworn I 
updated it. But apparently not.

Best regards
Thomas

> 
> 
> Thanks,
> pq

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-02-22 18:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 19:54 [PATCH v2 0/5] fbdev: Improve performance of fbdev console Thomas Zimmermann
2022-02-21 19:54 ` Thomas Zimmermann
2022-02-21 19:54 ` [PATCH v2 1/5] fbdev: Improve performance of sys_fillrect() Thomas Zimmermann
2022-02-21 19:54   ` Thomas Zimmermann
2022-02-21 19:54 ` [PATCH v2 2/5] fbdev: Improve performance of sys_imageblit() Thomas Zimmermann
2022-02-21 19:54   ` Thomas Zimmermann
2022-02-21 19:54 ` [PATCH v2 3/5] fbdev: Remove trailing whitespaces from cfbimgblt.c Thomas Zimmermann
2022-02-21 19:54   ` Thomas Zimmermann
2022-02-21 19:54 ` [PATCH v2 4/5] fbdev: Improve performance of cfb_imageblit() Thomas Zimmermann
2022-02-21 19:54   ` Thomas Zimmermann
2022-02-22 13:01   ` Pekka Paalanen
2022-02-22 13:01     ` Pekka Paalanen
2022-02-22 18:48     ` Thomas Zimmermann
2022-02-21 19:54 ` [PATCH v2 5/5] drm: Add TODO item for optimizing format helpers Thomas Zimmermann
2022-02-21 19:54   ` Thomas Zimmermann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.