All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] Add new formats support to vkms
@ 2021-10-26 11:34 Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 1/8] drm: vkms: Replace the deprecated drm_mode_config_init Igor Torrente
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

Summary
=======
This series of patches refactor some vkms components in order to introduce
new formats to the planes and writeback connector.

Now in the blend function, the plane's pixels are converted to ARGB16161616
and then blended together.

The CRC is calculated based on the ARGB1616161616 buffer. And if required,
this buffer is copied/converted to the writeback buffer format.

And to handle the pixel conversion, new functions were added to convert
from a specific format to ARGB16161616 (the reciprocal is also true).

Tests
=====
This patch series was tested using the following igt tests:
-t ".*kms_plane.*"
-t ".*kms_writeback.*"
-t ".*kms_cursor_crc*"
-t ".*kms_flip.*"

New tests passing
-------------------
- pipe-A-cursor-size-change
- pipe-A-cursor-alpha-transparent

Performance
-----------
Following some optimization proposed by Pekka Paalanen, now the code
runs way faster than V1 and slightly faster than the current implementation.

|                          Frametime                          |
|:---------------:|:---------:|:--------------:|:------------:|
|  implmentation  |  Current  |  Per-pixel(V1) | Per-line(V2) |
| frametime range |  8~22 ms  |    32~56 ms    |    6~19 ms   |
|     Average     |  10.0 ms  |     35.8 ms    |    8.6 ms    |

Writeback test
--------------
During the development of this patch series, I discovered that the
writeback-check-output test wasn't filling the plane correctly.

So, currently, this patch series is failing in this test. But I sent a
patch to igt to fix it[1].

XRGB to ARGB behavior
=====================
During the development, I decided to always fill the alpha channel of
the output pixel whenever the conversion from a format without an alpha
channel to ARGB16161616 is necessary. Therefore, I ignore the value
received from the XRGB and overwrite the value with 0xFFFF.

---
Igor Torrente (8):
  drm: vkms: Replace the deprecated drm_mode_config_init
  drm: vkms: Alloc the compose frame using vzalloc
  drm: vkms: Replace hardcoded value of `vkms_composer.map` to
    DRM_FORMAT_MAX_PLANES
  drm: vkms: Add fb information to `vkms_writeback_job`
  drm: drm_atomic_helper: Add a new helper to deal with the writeback
    connector validation
  drm: vkms: Refactor the plane composer to accept new formats
  drm: vkms: Exposes ARGB_1616161616 and adds XRGB_16161616 formats
  drm: vkms: Add support to the RGB565 format

 drivers/gpu/drm/drm_atomic_helper.c   |  47 ++++
 drivers/gpu/drm/vkms/vkms_composer.c  | 329 +++++++++++++++-----------
 drivers/gpu/drm/vkms/vkms_drv.c       |   6 +-
 drivers/gpu/drm/vkms/vkms_drv.h       |  14 +-
 drivers/gpu/drm/vkms/vkms_formats.h   | 252 ++++++++++++++++++++
 drivers/gpu/drm/vkms/vkms_plane.c     |  17 +-
 drivers/gpu/drm/vkms/vkms_writeback.c |  33 ++-
 include/drm/drm_atomic_helper.h       |   3 +
 8 files changed, 545 insertions(+), 156 deletions(-)
 create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h

-- 
2.30.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/8] drm: vkms: Replace the deprecated drm_mode_config_init
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 2/8] drm: vkms: Alloc the compose frame using vzalloc Igor Torrente
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

The `drm_mode_config_init` was deprecated since c3b790e commit, and it's
being replaced by the `drmm_mode_config_init`.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
V2: Change the code style(Thomas Zimmermann).
---
 drivers/gpu/drm/vkms/vkms_drv.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_drv.c b/drivers/gpu/drm/vkms/vkms_drv.c
index 0ffe5f0e33f7..ee4d96dabe19 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.c
+++ b/drivers/gpu/drm/vkms/vkms_drv.c
@@ -140,8 +140,12 @@ static const struct drm_mode_config_helper_funcs vkms_mode_config_helpers = {
 static int vkms_modeset_init(struct vkms_device *vkmsdev)
 {
 	struct drm_device *dev = &vkmsdev->drm;
+	int ret;
+
+	ret = drmm_mode_config_init(dev);
+	if (ret < 0)
+		return ret;
 
-	drm_mode_config_init(dev);
 	dev->mode_config.funcs = &vkms_mode_funcs;
 	dev->mode_config.min_width = XRES_MIN;
 	dev->mode_config.min_height = YRES_MIN;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/8] drm: vkms: Alloc the compose frame using vzalloc
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 1/8] drm: vkms: Replace the deprecated drm_mode_config_init Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 3/8] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES Igor Torrente
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

Currently, the memory to the composition frame is being allocated using
the kzmalloc. This comes with the limitation of maximum size of one
page size(which in the x86_64 is 4Kb and 4MB for default and hugepage
respectively).

Somes test of igt (e.g. kms_plane@pixel-format) uses more than 4MB when
testing some pixel formats like ARGB16161616.

This problem is addessed by allocating the memory using kvzalloc that
circunvents this limitation.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
 drivers/gpu/drm/vkms/vkms_composer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index 9e8204be9a14..82f79e508f81 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -180,7 +180,7 @@ static int compose_active_planes(void **vaddr_out,
 	int i;
 
 	if (!*vaddr_out) {
-		*vaddr_out = kzalloc(gem_obj->size, GFP_KERNEL);
+		*vaddr_out = kvzalloc(gem_obj->size, GFP_KERNEL);
 		if (!*vaddr_out) {
 			DRM_ERROR("Cannot allocate memory for output frame.");
 			return -ENOMEM;
@@ -263,7 +263,7 @@ void vkms_composer_worker(struct work_struct *work)
 				    crtc_state);
 	if (ret) {
 		if (ret == -EINVAL && !wb_pending)
-			kfree(vaddr_out);
+			kvfree(vaddr_out);
 		return;
 	}
 
@@ -275,7 +275,7 @@ void vkms_composer_worker(struct work_struct *work)
 		crtc_state->wb_pending = false;
 		spin_unlock_irq(&out->composer_lock);
 	} else {
-		kfree(vaddr_out);
+		kvfree(vaddr_out);
 	}
 
 	/*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/8] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 1/8] drm: vkms: Replace the deprecated drm_mode_config_init Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 2/8] drm: vkms: Alloc the compose frame using vzalloc Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-11-03 15:40   ` Thomas Zimmermann
  2021-10-26 11:34 ` [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job` Igor Torrente
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

The `map` vector at `vkms_composer` uses a hardcoded value to define its
size.

If someday the maximum number of planes increases, this hardcoded value
can be a problem.

This value is being replaced with the DRM_FORMAT_MAX_PLANES macro.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
 drivers/gpu/drm/vkms/vkms_drv.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
index d48c23d40ce5..64e62993b06f 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.h
+++ b/drivers/gpu/drm/vkms/vkms_drv.h
@@ -28,7 +28,7 @@ struct vkms_writeback_job {
 struct vkms_composer {
 	struct drm_framebuffer fb;
 	struct drm_rect src, dst;
-	struct dma_buf_map map[4];
+	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
 	unsigned int offset;
 	unsigned int pitch;
 	unsigned int cpp;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job`
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
                   ` (2 preceding siblings ...)
  2021-10-26 11:34 ` [PATCH v2 3/8] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-11-03 15:45   ` Thomas Zimmermann
  2021-10-26 11:34 ` [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation Igor Torrente
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

This commit is the groundwork to introduce new formats to the planes and
writeback buffer. As part of it, a new buffer metadata field is added to
`vkms_writeback_job`, this metadata is represented by the `vkms_composer`
struct.

This will allow us, in the future, to have different compositing and wb
format types.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
V2: Change the code to get the drm_framebuffer reference and not copy its
    contents(Thomas Zimmermann).
---
 drivers/gpu/drm/vkms/vkms_composer.c  |  4 ++--
 drivers/gpu/drm/vkms/vkms_drv.h       | 12 ++++++------
 drivers/gpu/drm/vkms/vkms_plane.c     | 10 +++++-----
 drivers/gpu/drm/vkms/vkms_writeback.c | 21 ++++++++++++++++++---
 4 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index 82f79e508f81..383ca657ddf7 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -153,7 +153,7 @@ static void compose_plane(struct vkms_composer *primary_composer,
 			  struct vkms_composer *plane_composer,
 			  void *vaddr_out)
 {
-	struct drm_framebuffer *fb = &plane_composer->fb;
+	struct drm_framebuffer *fb = plane_composer->fb;
 	void *vaddr;
 	void (*pixel_blend)(const u8 *p_src, u8 *p_dst);
 
@@ -174,7 +174,7 @@ static int compose_active_planes(void **vaddr_out,
 				 struct vkms_composer *primary_composer,
 				 struct vkms_crtc_state *crtc_state)
 {
-	struct drm_framebuffer *fb = &primary_composer->fb;
+	struct drm_framebuffer *fb = primary_composer->fb;
 	struct drm_gem_object *gem_obj = drm_gem_fb_get_obj(fb, 0);
 	const void *vaddr;
 	int i;
diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
index 64e62993b06f..9e4c1e95bbb1 100644
--- a/drivers/gpu/drm/vkms/vkms_drv.h
+++ b/drivers/gpu/drm/vkms/vkms_drv.h
@@ -20,13 +20,8 @@
 #define XRES_MAX  8192
 #define YRES_MAX  8192
 
-struct vkms_writeback_job {
-	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
-	struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
-};
-
 struct vkms_composer {
-	struct drm_framebuffer fb;
+	struct drm_framebuffer *fb;
 	struct drm_rect src, dst;
 	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
 	unsigned int offset;
@@ -34,6 +29,11 @@ struct vkms_composer {
 	unsigned int cpp;
 };
 
+struct vkms_writeback_job {
+	struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
+	struct vkms_composer composer;
+};
+
 /**
  * vkms_plane_state - Driver specific plane state
  * @base: base plane state
diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
index 32409e15244b..0a28cb7a85e2 100644
--- a/drivers/gpu/drm/vkms/vkms_plane.c
+++ b/drivers/gpu/drm/vkms/vkms_plane.c
@@ -50,12 +50,12 @@ static void vkms_plane_destroy_state(struct drm_plane *plane,
 	struct vkms_plane_state *vkms_state = to_vkms_plane_state(old_state);
 	struct drm_crtc *crtc = vkms_state->base.base.crtc;
 
-	if (crtc) {
+	if (crtc && vkms_state->composer->fb) {
 		/* dropping the reference we acquired in
 		 * vkms_primary_plane_update()
 		 */
-		if (drm_framebuffer_read_refcount(&vkms_state->composer->fb))
-			drm_framebuffer_put(&vkms_state->composer->fb);
+		if (drm_framebuffer_read_refcount(vkms_state->composer->fb))
+			drm_framebuffer_put(vkms_state->composer->fb);
 	}
 
 	kfree(vkms_state->composer);
@@ -110,9 +110,9 @@ static void vkms_plane_atomic_update(struct drm_plane *plane,
 	composer = vkms_plane_state->composer;
 	memcpy(&composer->src, &new_state->src, sizeof(struct drm_rect));
 	memcpy(&composer->dst, &new_state->dst, sizeof(struct drm_rect));
-	memcpy(&composer->fb, fb, sizeof(struct drm_framebuffer));
+	composer->fb = fb;
 	memcpy(&composer->map, &shadow_plane_state->data, sizeof(composer->map));
-	drm_framebuffer_get(&composer->fb);
+	drm_framebuffer_get(composer->fb);
 	composer->offset = fb->offsets[0];
 	composer->pitch = fb->pitches[0];
 	composer->cpp = fb->format->cpp[0];
diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
index 8694227f555f..32734cdbf6c2 100644
--- a/drivers/gpu/drm/vkms/vkms_writeback.c
+++ b/drivers/gpu/drm/vkms/vkms_writeback.c
@@ -75,12 +75,15 @@ static int vkms_wb_prepare_job(struct drm_writeback_connector *wb_connector,
 	if (!vkmsjob)
 		return -ENOMEM;
 
-	ret = drm_gem_fb_vmap(job->fb, vkmsjob->map, vkmsjob->data);
+	ret = drm_gem_fb_vmap(job->fb, vkmsjob->composer.map, vkmsjob->data);
 	if (ret) {
 		DRM_ERROR("vmap failed: %d\n", ret);
 		goto err_kfree;
 	}
 
+	vkmsjob->composer.fb = job->fb;
+	drm_framebuffer_get(vkmsjob->composer.fb);
+
 	job->priv = vkmsjob;
 
 	return 0;
@@ -99,7 +102,10 @@ static void vkms_wb_cleanup_job(struct drm_writeback_connector *connector,
 	if (!job->fb)
 		return;
 
-	drm_gem_fb_vunmap(job->fb, vkmsjob->map);
+	drm_gem_fb_vunmap(job->fb, vkmsjob->composer.map);
+
+	if (drm_framebuffer_read_refcount(vkmsjob->composer.fb))
+		drm_framebuffer_put(vkmsjob->composer.fb);
 
 	vkmsdev = drm_device_to_vkms_device(job->fb->dev);
 	vkms_set_composer(&vkmsdev->output, false);
@@ -116,14 +122,23 @@ static void vkms_wb_atomic_commit(struct drm_connector *conn,
 	struct drm_writeback_connector *wb_conn = &output->wb_connector;
 	struct drm_connector_state *conn_state = wb_conn->base.state;
 	struct vkms_crtc_state *crtc_state = output->composer_state;
+	struct drm_framebuffer *fb = connector_state->writeback_job->fb;
+	struct vkms_writeback_job *active_wb;
+	struct vkms_composer *wb_composer;
 
 	if (!conn_state)
 		return;
 
 	vkms_set_composer(&vkmsdev->output, true);
 
+	active_wb = conn_state->writeback_job->priv;
+	wb_composer = &active_wb->composer;
+
 	spin_lock_irq(&output->composer_lock);
-	crtc_state->active_writeback = conn_state->writeback_job->priv;
+	crtc_state->active_writeback = active_wb;
+	wb_composer->offset = fb->offsets[0];
+	wb_composer->pitch = fb->pitches[0];
+	wb_composer->cpp = fb->format->cpp[0];
 	crtc_state->wb_pending = true;
 	spin_unlock_irq(&output->composer_lock);
 	drm_writeback_queue_job(wb_conn, connector_state);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
                   ` (3 preceding siblings ...)
  2021-10-26 11:34 ` [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job` Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-10-28 21:38   ` Leandro Ribeiro
  2021-10-26 11:34 ` [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats Igor Torrente
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

Add a helper function to validate the connector configuration receive in
the encoder atomic_check by the drivers.

So the drivers don't need do these common validations themselves.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
V2: Move the format verification to a new helper at the drm_atomic_helper.c
    (Thomas Zimmermann).
---
 drivers/gpu/drm/drm_atomic_helper.c   | 47 +++++++++++++++++++++++++++
 drivers/gpu/drm/vkms/vkms_writeback.c |  9 +++--
 include/drm/drm_atomic_helper.h       |  3 ++
 3 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 2c0c6ec92820..c2653b9824b5 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -766,6 +766,53 @@ drm_atomic_helper_check_modeset(struct drm_device *dev,
 }
 EXPORT_SYMBOL(drm_atomic_helper_check_modeset);
 
+/**
+ * drm_atomic_helper_check_wb_connector_state() - Check writeback encoder state
+ * @encoder: encoder state to check
+ * @conn_state: connector state to check
+ *
+ * Checks if the wriback connector state is valid, and returns a erros if it
+ * isn't.
+ *
+ * RETURNS:
+ * Zero for success or -errno
+ */
+int
+drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
+					 struct drm_connector_state *conn_state)
+{
+	struct drm_writeback_job *wb_job = conn_state->writeback_job;
+	struct drm_property_blob *pixel_format_blob;
+	bool format_supported = false;
+	struct drm_framebuffer *fb;
+	int i, n_formats;
+	u32 *formats;
+
+	if (!wb_job || !wb_job->fb)
+		return 0;
+
+	pixel_format_blob = wb_job->connector->pixel_formats_blob_ptr;
+	n_formats = pixel_format_blob->length / sizeof(u32);
+	formats = pixel_format_blob->data;
+	fb = wb_job->fb;
+
+	for (i = 0; i < n_formats; i++) {
+		if (fb->format->format == formats[i]) {
+			format_supported = true;
+			break;
+		}
+	}
+
+	if (!format_supported) {
+		DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
+			      &fb->format->format);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(drm_atomic_helper_check_wb_encoder_state);
+
 /**
  * drm_atomic_helper_check_plane_state() - Check plane state for validity
  * @plane_state: plane state to check
diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
index 32734cdbf6c2..42f3396c523a 100644
--- a/drivers/gpu/drm/vkms/vkms_writeback.c
+++ b/drivers/gpu/drm/vkms/vkms_writeback.c
@@ -30,6 +30,7 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
 {
 	struct drm_framebuffer *fb;
 	const struct drm_display_mode *mode = &crtc_state->mode;
+	int ret;
 
 	if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
 		return 0;
@@ -41,11 +42,9 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
 		return -EINVAL;
 	}
 
-	if (fb->format->format != vkms_wb_formats[0]) {
-		DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
-			      &fb->format->format);
-		return -EINVAL;
-	}
+	ret = drm_atomic_helper_check_wb_encoder_state(encoder, conn_state);
+	if (ret < 0)
+		return ret;
 
 	return 0;
 }
diff --git a/include/drm/drm_atomic_helper.h b/include/drm/drm_atomic_helper.h
index 4045e2507e11..3fbf695da60f 100644
--- a/include/drm/drm_atomic_helper.h
+++ b/include/drm/drm_atomic_helper.h
@@ -40,6 +40,9 @@ struct drm_private_state;
 
 int drm_atomic_helper_check_modeset(struct drm_device *dev,
 				struct drm_atomic_state *state);
+int
+drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
+					 struct drm_connector_state *conn_state);
 int drm_atomic_helper_check_plane_state(struct drm_plane_state *plane_state,
 					const struct drm_crtc_state *crtc_state,
 					int min_scale,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
                   ` (4 preceding siblings ...)
  2021-10-26 11:34 ` [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-11-09 11:40   ` Pekka Paalanen
  2021-10-26 11:34 ` [PATCH v2 7/8] drm: vkms: Exposes ARGB_1616161616 and adds XRGB_16161616 formats Igor Torrente
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel, kernel test robot

Currently the blend function only accepts XRGB_8888 and ARGB_8888
as a color input.

This patch refactors all the functions related to the plane composition
to overcome this limitation.

Now the blend function receives a struct `vkms_pixel_composition_functions`
containing two handlers.

One will generate a buffer of each line of the frame with the pixels
converted to ARGB16161616. And the other will take this line buffer,
do some computation on it, and store the pixels in the destination.

Both the handlers have the same signature. They receive a pointer to
the pixels that will be processed(`pixels_addr`), the number of pixels
that will be treated(`length`), and the intermediate buffer of the size
of a frame line (`line_buffer`).

The first function has been totally described previously.

The second is more interesting, as it has to perform two roles depending
on where it is called in the code.

The first is to convert(if necessary) the data received in the
`line_buffer` and write in the memory pointed by `pixels_addr`.

The second role is to perform the `alpha_blend`. So, it takes the pixels
in the `line_buffer` and `pixels_addr`, executes the blend, and stores
the result back to the `pixels_addr`.

The per-line implementation was chosen for performance reasons.
The per-pixel functions were having performance issues due to indirect
function call overhead.

The per-line code trades off memory for execution time. The `line_buffer`
allows us to diminish the number of function calls.

Results in the IGT test `kms_cursor_crc`:

|                     Frametime                       |
|:---------------:|:---------:|:----------:|:--------:|
|  implmentation  |  Current  |  Per-pixel | Per-line |
| frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
|     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
V2: Improves the performance drastically, by perfoming the operations
    per-line and not per-pixel(Pekka Paalanen).
    Minor improvements(Pekka Paalanen).
---
 drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
 drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
 2 files changed, 342 insertions(+), 134 deletions(-)
 create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index 383ca657ddf7..69fe3a89bdc9 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -9,18 +9,26 @@
 #include <drm/drm_vblank.h>
 
 #include "vkms_drv.h"
-
-static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
-				 const struct vkms_composer *composer)
-{
-	u32 pixel;
-	int src_offset = composer->offset + (y * composer->pitch)
-				      + (x * composer->cpp);
-
-	pixel = *(u32 *)&buffer[src_offset];
-
-	return pixel;
-}
+#include "vkms_formats.h"
+
+#define get_output_vkms_composer(buffer_pointer, composer)		\
+	((struct vkms_composer) {					\
+		.fb = &(struct drm_framebuffer) {			\
+			.format = &(struct drm_format_info) {		\
+				.format = DRM_FORMAT_ARGB16161616,	\
+			},						\
+		},							\
+		.map[0].vaddr = (buffer_pointer),			\
+		.src = (composer)->src,					\
+		.dst = (composer)->dst,					\
+		.cpp = sizeof(u64),					\
+		.pitch = drm_rect_width(&(composer)->dst) * sizeof(u64)	\
+	})
+
+struct vkms_pixel_composition_functions {
+	void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
+	void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);
+};
 
 /**
  * compute_crc - Compute CRC value on output frame
@@ -31,179 +39,222 @@ static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
  * returns CRC value computed using crc32 on the visible portion of
  * the final framebuffer at vaddr_out
  */
-static uint32_t compute_crc(const u8 *vaddr,
+static uint32_t compute_crc(const __le64 *vaddr,
 			    const struct vkms_composer *composer)
 {
-	int x, y;
-	u32 crc = 0, pixel = 0;
-	int x_src = composer->src.x1 >> 16;
-	int y_src = composer->src.y1 >> 16;
-	int h_src = drm_rect_height(&composer->src) >> 16;
-	int w_src = drm_rect_width(&composer->src) >> 16;
-
-	for (y = y_src; y < y_src + h_src; ++y) {
-		for (x = x_src; x < x_src + w_src; ++x) {
-			pixel = get_pixel_from_buffer(x, y, vaddr, composer);
-			crc = crc32_le(crc, (void *)&pixel, sizeof(u32));
-		}
-	}
+	int h = drm_rect_height(&composer->dst);
+	int w = drm_rect_width(&composer->dst);
 
-	return crc;
+	return crc32_le(0, (void *)vaddr, w * h * sizeof(u64));
 }
 
-static u8 blend_channel(u8 src, u8 dst, u8 alpha)
+static __le16 blend_channel(u16 src, u16 dst, u16 alpha)
 {
-	u32 pre_blend;
-	u8 new_color;
+	u64 pre_blend;
+	u16 new_color;
 
-	pre_blend = (src * 255 + dst * (255 - alpha));
+	pre_blend = (src * 0xffff + dst * (0xffff - alpha));
 
-	/* Faster div by 255 */
-	new_color = ((pre_blend + ((pre_blend + 257) >> 8)) >> 8);
+	new_color = DIV_ROUND_UP(pre_blend, 0xffff);
 
-	return new_color;
+	return cpu_to_le16(new_color);
 }
 
 /**
  * alpha_blend - alpha blending equation
- * @argb_src: src pixel on premultiplied alpha mode
- * @argb_dst: dst pixel completely opaque
+ * @src_composer: source framebuffer's metadata
+ * @dst_composer: destination framebuffer's metadata
+ * @y: The y coodinate(heigth) of the line that will be processed
+ * @line_buffer: The line with the pixels from src_compositor
  *
  * blend pixels using premultiplied blend formula. The current DRM assumption
  * is that pixel color values have been already pre-multiplied with the alpha
  * channel values. See more drm_plane_create_blend_mode_property(). Also, this
  * formula assumes a completely opaque background.
+ *
+ * For performance reasons this function also fetches the pixels from the
+ * destination of the frame line y.
+ * We use the information that one of the source pixels are in the output
+ * buffer to fetch it here instead of separate function. And because the
+ * output format is ARGB16161616, we know that they don't need to be
+ * converted.
+ * This save us a indirect function call for each line.
  */
-static void alpha_blend(const u8 *argb_src, u8 *argb_dst)
+static void alpha_blend(void *pixels_addr, int length, u64 *line_buffer)
 {
-	u8 alpha;
+	__le16 *output_pixel = pixels_addr;
+	int i;
 
-	alpha = argb_src[3];
-	argb_dst[0] = blend_channel(argb_src[0], argb_dst[0], alpha);
-	argb_dst[1] = blend_channel(argb_src[1], argb_dst[1], alpha);
-	argb_dst[2] = blend_channel(argb_src[2], argb_dst[2], alpha);
-}
+	for (i = 0; i < length; i++) {
+		u16 src1_a = line_buffer[i] >> 48;
+		u16 src1_r = (line_buffer[i] >> 32) & 0xffff;
+		u16 src1_g = (line_buffer[i] >> 16) & 0xffff;
+		u16 src1_b = line_buffer[i] & 0xffff;
 
-/**
- * x_blend - blending equation that ignores the pixel alpha
- *
- * overwrites RGB color value from src pixel to dst pixel.
- */
-static void x_blend(const u8 *xrgb_src, u8 *xrgb_dst)
-{
-	memcpy(xrgb_dst, xrgb_src, sizeof(u8) * 3);
+		u16 src2_r = le16_to_cpu(output_pixel[2]);
+		u16 src2_g = le16_to_cpu(output_pixel[1]);
+		u16 src2_b = le16_to_cpu(output_pixel[0]);
+
+		output_pixel[0] = blend_channel(src1_b, src2_b, src1_a);
+		output_pixel[1] = blend_channel(src1_g, src2_g, src1_a);
+		output_pixel[2] = blend_channel(src1_r, src2_r, src1_a);
+		output_pixel[3] = 0xffff;
+
+		output_pixel += 4;
+	}
 }
 
 /**
- * blend - blend value at vaddr_src with value at vaddr_dst
- * @vaddr_dst: destination address
- * @vaddr_src: source address
- * @dst_composer: destination framebuffer's metadata
  * @src_composer: source framebuffer's metadata
- * @pixel_blend: blending equation based on plane format
+ * @dst_composer: destiny framebuffer's metadata
+ * @funcs: A struct containing all the composition functions(get_src_line,
+ *         and set_output_pixel)
+ * @line_buffer: The line with the pixels from src_compositor
  *
- * Blend the vaddr_src value with the vaddr_dst value using a pixel blend
- * equation according to the supported plane formats DRM_FORMAT_(A/XRGB8888)
- * and clearing alpha channel to an completely opaque background. This function
- * uses buffer's metadata to locate the new composite values at vaddr_dst.
+ * Using the pixel_blend function passed as parameter, this function blends
+ * all pixels from src plane into a output buffer (with a blend function
+ * passed as parameter).
+ * Information of the output buffer is in the dst_composer parameter
+ * and the source plane in the src_composer.
+ * The get_src_line will use the src_composer to get the respective line,
+ * convert, and return it as ARGB_16161616.
+ * And finally, the blend function will receive the dst_composer, dst_composer,
+ * the line y coodinate, and the line buffer. Blend all pixels, and store the
+ * result in the output.
  *
  * TODO: completely clear the primary plane (a = 0xff) before starting to blend
  * pixel color values
  */
-static void blend(void *vaddr_dst, void *vaddr_src,
+static void blend(struct vkms_composer *src_composer,
 		  struct vkms_composer *dst_composer,
-		  struct vkms_composer *src_composer,
-		  void (*pixel_blend)(const u8 *, u8 *))
+		  struct vkms_pixel_composition_functions *funcs,
+		  u64 *line_buffer)
 {
-	int i, j, j_dst, i_dst;
-	int offset_src, offset_dst;
-	u8 *pixel_dst, *pixel_src;
+	int i, i_dst;
 
 	int x_src = src_composer->src.x1 >> 16;
 	int y_src = src_composer->src.y1 >> 16;
 
 	int x_dst = src_composer->dst.x1;
 	int y_dst = src_composer->dst.y1;
+
 	int h_dst = drm_rect_height(&src_composer->dst);
-	int w_dst = drm_rect_width(&src_composer->dst);
+	int length = drm_rect_width(&src_composer->dst);
 
 	int y_limit = y_src + h_dst;
-	int x_limit = x_src + w_dst;
-
-	for (i = y_src, i_dst = y_dst; i < y_limit; ++i) {
-		for (j = x_src, j_dst = x_dst; j < x_limit; ++j) {
-			offset_dst = dst_composer->offset
-				     + (i_dst * dst_composer->pitch)
-				     + (j_dst++ * dst_composer->cpp);
-			offset_src = src_composer->offset
-				     + (i * src_composer->pitch)
-				     + (j * src_composer->cpp);
-
-			pixel_src = (u8 *)(vaddr_src + offset_src);
-			pixel_dst = (u8 *)(vaddr_dst + offset_dst);
-			pixel_blend(pixel_src, pixel_dst);
-			/* clearing alpha channel (0xff)*/
-			pixel_dst[3] = 0xff;
-		}
-		i_dst++;
+
+	u8 *src_pixels = packed_pixels_addr(src_composer, x_src, y_src);
+	u8 *dst_pixels = packed_pixels_addr(dst_composer, x_dst, y_dst);
+
+	int src_next_line_offset = src_composer->pitch;
+	int dst_next_line_offset = dst_composer->pitch;
+
+	for (i = y_src, i_dst = y_dst; i < y_limit; ++i, i_dst++) {
+		funcs->get_src_line(src_pixels, length, line_buffer);
+		funcs->set_output_line(dst_pixels, length, line_buffer);
+		src_pixels += src_next_line_offset;
+		dst_pixels += dst_next_line_offset;
 	}
 }
 
-static void compose_plane(struct vkms_composer *primary_composer,
-			  struct vkms_composer *plane_composer,
-			  void *vaddr_out)
+static void ((*get_line_fmt_transform_function(u32 format))
+	    (void *pixels_addr, int length, u64 *line_buffer))
 {
-	struct drm_framebuffer *fb = plane_composer->fb;
-	void *vaddr;
-	void (*pixel_blend)(const u8 *p_src, u8 *p_dst);
+	if (format == DRM_FORMAT_ARGB8888)
+		return &ARGB8888_to_ARGB16161616;
+	else if (format == DRM_FORMAT_ARGB16161616)
+		return &get_ARGB16161616;
+	else
+		return &XRGB8888_to_ARGB16161616;
+}
 
-	if (WARN_ON(dma_buf_map_is_null(&primary_composer->map[0])))
-		return;
+static void ((*get_output_line_function(u32 format))
+	     (void *pixels_addr, int length, u64 *line_buffer))
+{
+	if (format == DRM_FORMAT_ARGB8888)
+		return &convert_to_ARGB8888;
+	else if (format == DRM_FORMAT_ARGB16161616)
+		return &convert_to_ARGB16161616;
+	else
+		return &convert_to_XRGB8888;
+}
 
-	vaddr = plane_composer->map[0].vaddr;
+static void compose_plane(struct vkms_composer *src_composer,
+			  struct vkms_composer *dst_composer,
+			  struct vkms_pixel_composition_functions *funcs,
+			  u64 *line_buffer)
+{
+	u32 src_format = src_composer->fb->format->format;
 
-	if (fb->format->format == DRM_FORMAT_ARGB8888)
-		pixel_blend = &alpha_blend;
-	else
-		pixel_blend = &x_blend;
+	funcs->get_src_line = get_line_fmt_transform_function(src_format);
 
-	blend(vaddr_out, vaddr, primary_composer, plane_composer, pixel_blend);
+	blend(src_composer, dst_composer, funcs, line_buffer);
 }
 
-static int compose_active_planes(void **vaddr_out,
-				 struct vkms_composer *primary_composer,
-				 struct vkms_crtc_state *crtc_state)
+static __le64 *compose_active_planes(struct vkms_composer *primary_composer,
+				     struct vkms_crtc_state *crtc_state,
+				     u64 *line_buffer)
 {
-	struct drm_framebuffer *fb = primary_composer->fb;
-	struct drm_gem_object *gem_obj = drm_gem_fb_get_obj(fb, 0);
-	const void *vaddr;
+	struct vkms_plane_state **active_planes = crtc_state->active_planes;
+	int h = drm_rect_height(&primary_composer->dst);
+	int w = drm_rect_width(&primary_composer->dst);
+	struct vkms_pixel_composition_functions funcs;
+	struct vkms_composer dst_composer;
+	__le64 *vaddr_out;
 	int i;
 
-	if (!*vaddr_out) {
-		*vaddr_out = kvzalloc(gem_obj->size, GFP_KERNEL);
-		if (!*vaddr_out) {
-			DRM_ERROR("Cannot allocate memory for output frame.");
-			return -ENOMEM;
-		}
-	}
-
 	if (WARN_ON(dma_buf_map_is_null(&primary_composer->map[0])))
-		return -EINVAL;
+		return NULL;
 
-	vaddr = primary_composer->map[0].vaddr;
+	vaddr_out = kvzalloc(w * h * sizeof(__le64), GFP_KERNEL);
+	if (!vaddr_out) {
+		DRM_ERROR("Cannot allocate memory for output frame.");
+		return NULL;
+	}
 
-	memcpy(*vaddr_out, vaddr, gem_obj->size);
+	dst_composer = get_output_vkms_composer(vaddr_out, primary_composer);
+	funcs.set_output_line = get_output_line_function(DRM_FORMAT_ARGB16161616);
+	compose_plane(active_planes[0]->composer, &dst_composer,
+		      &funcs, line_buffer);
 
 	/* If there are other planes besides primary, we consider the active
 	 * planes should be in z-order and compose them associatively:
 	 * ((primary <- overlay) <- cursor)
 	 */
+	funcs.set_output_line = alpha_blend;
 	for (i = 1; i < crtc_state->num_active_planes; i++)
-		compose_plane(primary_composer,
-			      crtc_state->active_planes[i]->composer,
-			      *vaddr_out);
+		compose_plane(active_planes[i]->composer, &dst_composer,
+			      &funcs, line_buffer);
 
-	return 0;
+	return vaddr_out;
+}
+
+static void write_wb_buffer(struct vkms_writeback_job *active_wb,
+			    struct vkms_composer *primary_composer,
+			    __le64 *vaddr_out, u64 *line_buffer)
+{
+	u32 dst_fb_format = active_wb->composer.fb->format->format;
+	struct vkms_pixel_composition_functions funcs;
+	struct vkms_composer src_composer;
+
+	src_composer = get_output_vkms_composer(vaddr_out, primary_composer);
+	funcs.set_output_line = get_output_line_function(dst_fb_format);
+	active_wb->composer.src = primary_composer->src;
+	active_wb->composer.dst = primary_composer->dst;
+
+	compose_plane(&src_composer, &active_wb->composer, &funcs, line_buffer);
+}
+
+u64 *alloc_line_buffer(struct vkms_composer *primary_composer)
+{
+	int line_width = drm_rect_width(&primary_composer->dst);
+	u64 *line_buffer;
+
+	line_buffer = kvmalloc(line_width * sizeof(u64), GFP_KERNEL);
+	if (!line_buffer)
+		DRM_ERROR("Cannot allocate memory for intermediate line buffer");
+
+	return line_buffer;
 }
 
 /**
@@ -221,14 +272,14 @@ void vkms_composer_worker(struct work_struct *work)
 						struct vkms_crtc_state,
 						composer_work);
 	struct drm_crtc *crtc = crtc_state->base.crtc;
+	struct vkms_writeback_job *active_wb = crtc_state->active_writeback;
 	struct vkms_output *out = drm_crtc_to_vkms_output(crtc);
 	struct vkms_composer *primary_composer = NULL;
 	struct vkms_plane_state *act_plane = NULL;
+	u64 frame_start, frame_end, *line_buffer;
 	bool crc_pending, wb_pending;
-	void *vaddr_out = NULL;
+	__le64 *vaddr_out = NULL;
 	u32 crc32 = 0;
-	u64 frame_start, frame_end;
-	int ret;
 
 	spin_lock_irq(&out->composer_lock);
 	frame_start = crtc_state->frame_start;
@@ -256,28 +307,30 @@ void vkms_composer_worker(struct work_struct *work)
 	if (!primary_composer)
 		return;
 
-	if (wb_pending)
-		vaddr_out = crtc_state->active_writeback->data[0].vaddr;
+	line_buffer = alloc_line_buffer(primary_composer);
+	if (!line_buffer)
+		return;
 
-	ret = compose_active_planes(&vaddr_out, primary_composer,
-				    crtc_state);
-	if (ret) {
-		if (ret == -EINVAL && !wb_pending)
-			kvfree(vaddr_out);
+	vaddr_out = compose_active_planes(primary_composer, crtc_state,
+					  line_buffer);
+	if (!vaddr_out) {
+		kvfree(line_buffer);
 		return;
 	}
 
-	crc32 = compute_crc(vaddr_out, primary_composer);
-
 	if (wb_pending) {
+		write_wb_buffer(active_wb, primary_composer,
+				vaddr_out, line_buffer);
 		drm_writeback_signal_completion(&out->wb_connector, 0);
 		spin_lock_irq(&out->composer_lock);
 		crtc_state->wb_pending = false;
 		spin_unlock_irq(&out->composer_lock);
-	} else {
-		kvfree(vaddr_out);
 	}
 
+	kvfree(line_buffer);
+	crc32 = compute_crc(vaddr_out, primary_composer);
+	kvfree(vaddr_out);
+
 	/*
 	 * The worker can fall behind the vblank hrtimer, make sure we catch up.
 	 */
diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
new file mode 100644
index 000000000000..5b850fce69f3
--- /dev/null
+++ b/drivers/gpu/drm/vkms/vkms_formats.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef _VKMS_FORMATS_H_
+#define _VKMS_FORMATS_H_
+
+#include <drm/drm_rect.h>
+
+#define pixel_offset(composer, x, y) \
+	((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
+
+/*
+ * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
+ *
+ * @composer: Buffer metadata
+ * @x: The x(width) coordinate of the 2D buffer
+ * @y: The y(Heigth) coordinate of the 2D buffer
+ *
+ * Takes the information stored in the composer, a pair of coordinates, and
+ * returns the address of the first color channel.
+ * This function assumes the channels are packed together, i.e. a color channel
+ * comes immediately after another. And therefore, this function doesn't work
+ * for YUV with chroma subsampling (e.g. YUV420 and NV21).
+ */
+static void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
+{
+	int offset = pixel_offset(composer, x, y);
+
+	return (u8 *)composer->map[0].vaddr + offset;
+}
+
+static void ARGB8888_to_ARGB16161616(void *pixels_addr, int length,
+				     u64 *line_buffer)
+{
+	u8 *src_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		/*
+		 * Organizes the channels in their respective positions and converts
+		 * the 8 bits channel to 16.
+		 * The 257 is the "conversion ratio". This number is obtained by the
+		 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
+		 * the best color value in a pixel format with more possibilities.
+		 * And a similar idea applies to others RGB color conversions.
+		 */
+		line_buffer[i] = ((u64)src_pixels[3] * 257) << 48 |
+				 ((u64)src_pixels[2] * 257) << 32 |
+				 ((u64)src_pixels[1] * 257) << 16 |
+				 ((u64)src_pixels[0] * 257);
+
+		src_pixels += 4;
+	}
+}
+
+static void XRGB8888_to_ARGB16161616(void *pixels_addr, int length,
+				     u64 *line_buffer)
+{
+	u8 *src_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		/*
+		 * The same as the ARGB8888 but with the alpha channel as the
+		 * maximum value as possible.
+		 */
+		line_buffer[i] = 0xffffllu << 48 |
+				 ((u64)src_pixels[2] * 257) << 32 |
+				 ((u64)src_pixels[1] * 257) << 16 |
+				 ((u64)src_pixels[0] * 257);
+
+		src_pixels += 4;
+	}
+}
+
+static void get_ARGB16161616(void *pixels_addr, int length, u64 *line_buffer)
+{
+	__le64 *src_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		/*
+		 * Because the format byte order is in little-endian and this code
+		 * needs to run on big-endian machines too, we need modify
+		 * the byte order from little-endian to the CPU native byte order.
+		 */
+		line_buffer[i] = le64_to_cpu(*src_pixels);
+
+		src_pixels++;
+	}
+}
+
+/*
+ * The following functions are used as blend operations. But unlike the
+ * `alpha_blend`, these functions take an ARGB16161616 pixel from the
+ * source, convert it to a specific format, and store it in the destination.
+ *
+ * They are used in the `compose_active_planes` and `write_wb_buffer` to
+ * copy and convert one line of the frame from/to the output buffer to/from
+ * another buffer (e.g. writeback buffer, primary plane buffer).
+ */
+
+static void convert_to_ARGB8888(void *pixels_addr, int length, u64 *line_buffer)
+{
+	u8 *dst_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		/*
+		 * This sequence below is important because the format's byte order is
+		 * in little-endian. In the case of the ARGB8888 the memory is
+		 * organized this way:
+		 *
+		 * | Addr     | = blue channel
+		 * | Addr + 1 | = green channel
+		 * | Addr + 2 | = Red channel
+		 * | Addr + 3 | = Alpha channel
+		 */
+		dst_pixels[0] = DIV_ROUND_UP(line_buffer[i] & 0xffff, 257);
+		dst_pixels[1] = DIV_ROUND_UP((line_buffer[i] >> 16) & 0xffff, 257);
+		dst_pixels[2] = DIV_ROUND_UP((line_buffer[i] >> 32) & 0xffff, 257);
+		dst_pixels[3] = DIV_ROUND_UP(line_buffer[i] >> 48, 257);
+
+		dst_pixels += 4;
+	}
+}
+
+static void convert_to_XRGB8888(void *pixels_addr, int length, u64 *line_buffer)
+{
+	u8 *dst_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		dst_pixels[0] = DIV_ROUND_UP(line_buffer[i] & 0xffff, 257);
+		dst_pixels[1] = DIV_ROUND_UP((line_buffer[i] >> 16) & 0xffff, 257);
+		dst_pixels[2] = DIV_ROUND_UP((line_buffer[i] >> 32) & 0xffff, 257);
+		dst_pixels[3] = 0xff;
+
+		dst_pixels += 4;
+	}
+}
+
+static void convert_to_ARGB16161616(void *pixels_addr, int length,
+				    u64 *line_buffer)
+{
+	__le64 *dst_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+
+		*dst_pixels = cpu_to_le64(line_buffer[i]);
+		dst_pixels++;
+	}
+}
+
+#endif /* _VKMS_FORMATS_H_ */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 7/8] drm: vkms: Exposes ARGB_1616161616 and adds XRGB_16161616 formats
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
                   ` (5 preceding siblings ...)
  2021-10-26 11:34 ` [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 8/8] drm: vkms: Add support the RGB565 format Igor Torrente
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

This will be useful to write tests that depends on these formats.

ARGB format is already used as the universal format for internal uses.
Here we are just exposing it to the user space.

XRGB follows the a similar implementation of the former format.
Just overwriting the alpha channel.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
 drivers/gpu/drm/vkms/vkms_composer.c  |  4 ++++
 drivers/gpu/drm/vkms/vkms_formats.h   | 25 +++++++++++++++++++++++++
 drivers/gpu/drm/vkms/vkms_plane.c     |  5 ++++-
 drivers/gpu/drm/vkms/vkms_writeback.c |  2 ++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index 69fe3a89bdc9..f16fcfc88cea 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -164,6 +164,8 @@ static void ((*get_line_fmt_transform_function(u32 format))
 		return &ARGB8888_to_ARGB16161616;
 	else if (format == DRM_FORMAT_ARGB16161616)
 		return &get_ARGB16161616;
+	else if (format == DRM_FORMAT_XRGB16161616)
+		return &XRGB16161616_to_ARGB16161616;
 	else
 		return &XRGB8888_to_ARGB16161616;
 }
@@ -175,6 +177,8 @@ static void ((*get_output_line_function(u32 format))
 		return &convert_to_ARGB8888;
 	else if (format == DRM_FORMAT_ARGB16161616)
 		return &convert_to_ARGB16161616;
+	else if (format == DRM_FORMAT_XRGB16161616)
+		return &convert_to_XRGB16161616;
 	else
 		return &convert_to_XRGB8888;
 }
diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
index 5b850fce69f3..aa433edd00bd 100644
--- a/drivers/gpu/drm/vkms/vkms_formats.h
+++ b/drivers/gpu/drm/vkms/vkms_formats.h
@@ -89,6 +89,19 @@ static void get_ARGB16161616(void *pixels_addr, int length, u64 *line_buffer)
 	}
 }
 
+static void XRGB16161616_to_ARGB16161616(void *pixels_addr, int length,
+					 u64 *line_buffer)
+{
+	__le64 *src_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		line_buffer[i] = le64_to_cpu(*src_pixels) | (0xffffllu << 48);
+
+		src_pixels++;
+	}
+}
+
 /*
  * The following functions are used as blend operations. But unlike the
  * `alpha_blend`, these functions take an ARGB16161616 pixel from the
@@ -152,4 +165,16 @@ static void convert_to_ARGB16161616(void *pixels_addr, int length,
 	}
 }
 
+static void convert_to_XRGB16161616(void *pixels_addr, int length,
+				    u64 *line_buffer)
+{
+	__le64 *dst_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		*dst_pixels = cpu_to_le64(line_buffer[i] | (0xffffllu << 48));
+		dst_pixels++;
+	}
+}
+
 #endif /* _VKMS_FORMATS_H_ */
diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
index 0a28cb7a85e2..516e48b38806 100644
--- a/drivers/gpu/drm/vkms/vkms_plane.c
+++ b/drivers/gpu/drm/vkms/vkms_plane.c
@@ -13,11 +13,14 @@
 
 static const u32 vkms_formats[] = {
 	DRM_FORMAT_XRGB8888,
+	DRM_FORMAT_XRGB16161616
 };
 
 static const u32 vkms_plane_formats[] = {
 	DRM_FORMAT_ARGB8888,
-	DRM_FORMAT_XRGB8888
+	DRM_FORMAT_XRGB8888,
+	DRM_FORMAT_XRGB16161616,
+	DRM_FORMAT_ARGB16161616
 };
 
 static struct drm_plane_state *
diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
index 42f3396c523a..0f7bb77f981e 100644
--- a/drivers/gpu/drm/vkms/vkms_writeback.c
+++ b/drivers/gpu/drm/vkms/vkms_writeback.c
@@ -14,6 +14,8 @@
 
 static const u32 vkms_wb_formats[] = {
 	DRM_FORMAT_XRGB8888,
+	DRM_FORMAT_XRGB16161616,
+	DRM_FORMAT_ARGB16161616
 };
 
 static const struct drm_connector_funcs vkms_wb_connector_funcs = {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 8/8] drm: vkms: Add support the RGB565 format
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
                   ` (6 preceding siblings ...)
  2021-10-26 11:34 ` [PATCH v2 7/8] drm: vkms: Exposes ARGB_1616161616 and adds XRGB_16161616 formats Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-10-26 11:34 ` [PATCH v2 8/8] drm: vkms: Add support to " Igor Torrente
  2021-11-09  9:32 ` [PATCH v2 0/8] Add new formats support to vkms Pekka Paalanen
  9 siblings, 0 replies; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

Adds this common format to vkms.

This commit also adds new helper macros to deal with fixed-point
arithmetic.

It was done to improve the precision of the conversion to ARGB16161616
since the "conversion ratio" is not an integer.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
 drivers/gpu/drm/vkms/vkms_composer.c  |  4 ++
 drivers/gpu/drm/vkms/vkms_formats.h   | 72 +++++++++++++++++++++++++++
 drivers/gpu/drm/vkms/vkms_plane.c     |  6 ++-
 drivers/gpu/drm/vkms/vkms_writeback.c |  3 +-
 4 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index f16fcfc88cea..57ec82839a89 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -166,6 +166,8 @@ static void ((*get_line_fmt_transform_function(u32 format))
 		return &get_ARGB16161616;
 	else if (format == DRM_FORMAT_XRGB16161616)
 		return &XRGB16161616_to_ARGB16161616;
+	else if (format == DRM_FORMAT_RGB565)
+		return &RGB565_to_ARGB16161616;
 	else
 		return &XRGB8888_to_ARGB16161616;
 }
@@ -179,6 +181,8 @@ static void ((*get_output_line_function(u32 format))
 		return &convert_to_ARGB16161616;
 	else if (format == DRM_FORMAT_XRGB16161616)
 		return &convert_to_XRGB16161616;
+	else if (format == DRM_FORMAT_RGB565)
+		return &convert_to_RGB565;
 	else
 		return &convert_to_XRGB8888;
 }
diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
index aa433edd00bd..1e2db1a844aa 100644
--- a/drivers/gpu/drm/vkms/vkms_formats.h
+++ b/drivers/gpu/drm/vkms/vkms_formats.h
@@ -8,6 +8,26 @@
 #define pixel_offset(composer, x, y) \
 	((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
 
+/*
+ * FP stands for _Fixed Point_ and **not** _Float Point_
+ * LF stands for Long Float (i.e. double)
+ * The following macros help doing fixed point arithmetic.
+ */
+/*
+ * With FP scale 15 we have 17 and 15 bits of integer and fractional parts
+ * respectively.
+ *  | 0000 0000 0000 0000 0.000 0000 0000 0000 |
+ * 31                                          0
+ */
+#define FP_SCALE 15
+
+#define LF_TO_FP(a) ((a) * (u64)(1 << FP_SCALE))
+#define INT_TO_FP(a) ((a) << FP_SCALE)
+#define FP_MUL(a, b) ((s32)(((s64)(a) * (b)) >> FP_SCALE))
+#define FP_DIV(a, b) ((s32)(((s64)(a) << FP_SCALE) / (b)))
+/* This macro converts a fixed point number to int, and round half up it */
+#define FP_TO_INT_ROUND_UP(a) (((a) + (1 << (FP_SCALE - 1))) >> FP_SCALE)
+
 /*
  * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
  *
@@ -102,6 +122,35 @@ static void XRGB16161616_to_ARGB16161616(void *pixels_addr, int length,
 	}
 }
 
+static void RGB565_to_ARGB16161616(void *pixels_addr, int length,
+				   u64 *line_buffer)
+{
+	__le16 *src_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		u16 rgb_565 = le16_to_cpu(*src_pixels);
+		int fp_r = INT_TO_FP((rgb_565 >> 11) & 0x1f);
+		int fp_g = INT_TO_FP((rgb_565 >> 5) & 0x3f);
+		int fp_b = INT_TO_FP(rgb_565 & 0x1f);
+
+		/*
+		 * The magic constants is the "conversion ratio" and is calculated
+		 * dividing 65535(2^16 - 1) by 31(2^5 -1) and 63(2^6 - 1) respectively.
+		 */
+		int fp_rb_ratio = LF_TO_FP(2114.032258065);
+		int fp_g_ratio = LF_TO_FP(1040.238095238);
+
+		u64 r = FP_TO_INT_ROUND_UP(FP_MUL(fp_r, fp_rb_ratio));
+		u64 g = FP_TO_INT_ROUND_UP(FP_MUL(fp_g, fp_g_ratio));
+		u64 b = FP_TO_INT_ROUND_UP(FP_MUL(fp_b, fp_rb_ratio));
+
+		line_buffer[i] = 0xffffllu << 48 | r << 32 | g << 16 | b;
+
+		src_pixels++;
+	}
+}
+
 /*
  * The following functions are used as blend operations. But unlike the
  * `alpha_blend`, these functions take an ARGB16161616 pixel from the
@@ -177,4 +226,27 @@ static void convert_to_XRGB16161616(void *pixels_addr, int length,
 	}
 }
 
+static void convert_to_RGB565(void *pixels_addr, int length,
+			      u64 *line_buffer)
+{
+	__le16 *dst_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++)  {
+		int fp_r = INT_TO_FP((line_buffer[i] >> 32) & 0xffff);
+		int fp_g = INT_TO_FP((line_buffer[i] >> 16) & 0xffff);
+		int fp_b = INT_TO_FP(line_buffer[i] & 0xffffllu);
+
+		int fp_rb_ratio = LF_TO_FP(2114.032258065);
+		int fp_g_ratio = LF_TO_FP(1040.238095238);
+
+		u16 r = FP_TO_INT_ROUND_UP(FP_DIV(fp_r, fp_rb_ratio));
+		u16 g = FP_TO_INT_ROUND_UP(FP_DIV(fp_g, fp_g_ratio));
+		u16 b = FP_TO_INT_ROUND_UP(FP_DIV(fp_b, fp_rb_ratio));
+
+		*dst_pixels = cpu_to_le16(r << 11 | g << 5 | b);
+		dst_pixels++;
+	}
+}
+
 #endif /* _VKMS_FORMATS_H_ */
diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
index 516e48b38806..de250808aa39 100644
--- a/drivers/gpu/drm/vkms/vkms_plane.c
+++ b/drivers/gpu/drm/vkms/vkms_plane.c
@@ -13,14 +13,16 @@
 
 static const u32 vkms_formats[] = {
 	DRM_FORMAT_XRGB8888,
-	DRM_FORMAT_XRGB16161616
+	DRM_FORMAT_XRGB16161616,
+	DRM_FORMAT_RGB565
 };
 
 static const u32 vkms_plane_formats[] = {
 	DRM_FORMAT_ARGB8888,
 	DRM_FORMAT_XRGB8888,
 	DRM_FORMAT_XRGB16161616,
-	DRM_FORMAT_ARGB16161616
+	DRM_FORMAT_ARGB16161616,
+	DRM_FORMAT_RGB565
 };
 
 static struct drm_plane_state *
diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
index 0f7bb77f981e..11eb1be5a0fc 100644
--- a/drivers/gpu/drm/vkms/vkms_writeback.c
+++ b/drivers/gpu/drm/vkms/vkms_writeback.c
@@ -15,7 +15,8 @@
 static const u32 vkms_wb_formats[] = {
 	DRM_FORMAT_XRGB8888,
 	DRM_FORMAT_XRGB16161616,
-	DRM_FORMAT_ARGB16161616
+	DRM_FORMAT_ARGB16161616,
+	DRM_FORMAT_RGB565
 };
 
 static const struct drm_connector_funcs vkms_wb_connector_funcs = {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 8/8] drm: vkms: Add support to the RGB565 format
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
                   ` (7 preceding siblings ...)
  2021-10-26 11:34 ` [PATCH v2 8/8] drm: vkms: Add support the RGB565 format Igor Torrente
@ 2021-10-26 11:34 ` Igor Torrente
  2021-11-09  9:32 ` [PATCH v2 0/8] Add new formats support to vkms Pekka Paalanen
  9 siblings, 0 replies; 28+ messages in thread
From: Igor Torrente @ 2021-10-26 11:34 UTC (permalink / raw)
  To: rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: Igor Torrente, hamohammed.sa, daniel, airlied, contact,
	leandro.ribeiro, dri-devel

Adds this common format to vkms.

This commit also adds new helper macros to deal with fixed-point
arithmetic.

It was done to improve the precision of the conversion to ARGB16161616
since the "conversion ratio" is not an integer.

Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
---
 drivers/gpu/drm/vkms/vkms_composer.c  |  4 ++
 drivers/gpu/drm/vkms/vkms_formats.h   | 72 +++++++++++++++++++++++++++
 drivers/gpu/drm/vkms/vkms_plane.c     |  6 ++-
 drivers/gpu/drm/vkms/vkms_writeback.c |  3 +-
 4 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
index f16fcfc88cea..57ec82839a89 100644
--- a/drivers/gpu/drm/vkms/vkms_composer.c
+++ b/drivers/gpu/drm/vkms/vkms_composer.c
@@ -166,6 +166,8 @@ static void ((*get_line_fmt_transform_function(u32 format))
 		return &get_ARGB16161616;
 	else if (format == DRM_FORMAT_XRGB16161616)
 		return &XRGB16161616_to_ARGB16161616;
+	else if (format == DRM_FORMAT_RGB565)
+		return &RGB565_to_ARGB16161616;
 	else
 		return &XRGB8888_to_ARGB16161616;
 }
@@ -179,6 +181,8 @@ static void ((*get_output_line_function(u32 format))
 		return &convert_to_ARGB16161616;
 	else if (format == DRM_FORMAT_XRGB16161616)
 		return &convert_to_XRGB16161616;
+	else if (format == DRM_FORMAT_RGB565)
+		return &convert_to_RGB565;
 	else
 		return &convert_to_XRGB8888;
 }
diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
index aa433edd00bd..1e2db1a844aa 100644
--- a/drivers/gpu/drm/vkms/vkms_formats.h
+++ b/drivers/gpu/drm/vkms/vkms_formats.h
@@ -8,6 +8,26 @@
 #define pixel_offset(composer, x, y) \
 	((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
 
+/*
+ * FP stands for _Fixed Point_ and **not** _Float Point_
+ * LF stands for Long Float (i.e. double)
+ * The following macros help doing fixed point arithmetic.
+ */
+/*
+ * With FP scale 15 we have 17 and 15 bits of integer and fractional parts
+ * respectively.
+ *  | 0000 0000 0000 0000 0.000 0000 0000 0000 |
+ * 31                                          0
+ */
+#define FP_SCALE 15
+
+#define LF_TO_FP(a) ((a) * (u64)(1 << FP_SCALE))
+#define INT_TO_FP(a) ((a) << FP_SCALE)
+#define FP_MUL(a, b) ((s32)(((s64)(a) * (b)) >> FP_SCALE))
+#define FP_DIV(a, b) ((s32)(((s64)(a) << FP_SCALE) / (b)))
+/* This macro converts a fixed point number to int, and round half up it */
+#define FP_TO_INT_ROUND_UP(a) (((a) + (1 << (FP_SCALE - 1))) >> FP_SCALE)
+
 /*
  * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
  *
@@ -102,6 +122,35 @@ static void XRGB16161616_to_ARGB16161616(void *pixels_addr, int length,
 	}
 }
 
+static void RGB565_to_ARGB16161616(void *pixels_addr, int length,
+				   u64 *line_buffer)
+{
+	__le16 *src_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		u16 rgb_565 = le16_to_cpu(*src_pixels);
+		int fp_r = INT_TO_FP((rgb_565 >> 11) & 0x1f);
+		int fp_g = INT_TO_FP((rgb_565 >> 5) & 0x3f);
+		int fp_b = INT_TO_FP(rgb_565 & 0x1f);
+
+		/*
+		 * The magic constants is the "conversion ratio" and is calculated
+		 * dividing 65535(2^16 - 1) by 31(2^5 -1) and 63(2^6 - 1) respectively.
+		 */
+		int fp_rb_ratio = LF_TO_FP(2114.032258065);
+		int fp_g_ratio = LF_TO_FP(1040.238095238);
+
+		u64 r = FP_TO_INT_ROUND_UP(FP_MUL(fp_r, fp_rb_ratio));
+		u64 g = FP_TO_INT_ROUND_UP(FP_MUL(fp_g, fp_g_ratio));
+		u64 b = FP_TO_INT_ROUND_UP(FP_MUL(fp_b, fp_rb_ratio));
+
+		line_buffer[i] = 0xffffllu << 48 | r << 32 | g << 16 | b;
+
+		src_pixels++;
+	}
+}
+
 /*
  * The following functions are used as blend operations. But unlike the
  * `alpha_blend`, these functions take an ARGB16161616 pixel from the
@@ -177,4 +226,27 @@ static void convert_to_XRGB16161616(void *pixels_addr, int length,
 	}
 }
 
+static void convert_to_RGB565(void *pixels_addr, int length,
+			      u64 *line_buffer)
+{
+	__le16 *dst_pixels = pixels_addr;
+	int i;
+
+	for (i = 0; i < length; i++)  {
+		int fp_r = INT_TO_FP((line_buffer[i] >> 32) & 0xffff);
+		int fp_g = INT_TO_FP((line_buffer[i] >> 16) & 0xffff);
+		int fp_b = INT_TO_FP(line_buffer[i] & 0xffffllu);
+
+		int fp_rb_ratio = LF_TO_FP(2114.032258065);
+		int fp_g_ratio = LF_TO_FP(1040.238095238);
+
+		u16 r = FP_TO_INT_ROUND_UP(FP_DIV(fp_r, fp_rb_ratio));
+		u16 g = FP_TO_INT_ROUND_UP(FP_DIV(fp_g, fp_g_ratio));
+		u16 b = FP_TO_INT_ROUND_UP(FP_DIV(fp_b, fp_rb_ratio));
+
+		*dst_pixels = cpu_to_le16(r << 11 | g << 5 | b);
+		dst_pixels++;
+	}
+}
+
 #endif /* _VKMS_FORMATS_H_ */
diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
index 516e48b38806..de250808aa39 100644
--- a/drivers/gpu/drm/vkms/vkms_plane.c
+++ b/drivers/gpu/drm/vkms/vkms_plane.c
@@ -13,14 +13,16 @@
 
 static const u32 vkms_formats[] = {
 	DRM_FORMAT_XRGB8888,
-	DRM_FORMAT_XRGB16161616
+	DRM_FORMAT_XRGB16161616,
+	DRM_FORMAT_RGB565
 };
 
 static const u32 vkms_plane_formats[] = {
 	DRM_FORMAT_ARGB8888,
 	DRM_FORMAT_XRGB8888,
 	DRM_FORMAT_XRGB16161616,
-	DRM_FORMAT_ARGB16161616
+	DRM_FORMAT_ARGB16161616,
+	DRM_FORMAT_RGB565
 };
 
 static struct drm_plane_state *
diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
index 0f7bb77f981e..11eb1be5a0fc 100644
--- a/drivers/gpu/drm/vkms/vkms_writeback.c
+++ b/drivers/gpu/drm/vkms/vkms_writeback.c
@@ -15,7 +15,8 @@
 static const u32 vkms_wb_formats[] = {
 	DRM_FORMAT_XRGB8888,
 	DRM_FORMAT_XRGB16161616,
-	DRM_FORMAT_ARGB16161616
+	DRM_FORMAT_ARGB16161616,
+	DRM_FORMAT_RGB565
 };
 
 static const struct drm_connector_funcs vkms_wb_connector_funcs = {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation
  2021-10-26 11:34 ` [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation Igor Torrente
@ 2021-10-28 21:38   ` Leandro Ribeiro
  2021-11-03 15:03     ` Igor Torrente
  0 siblings, 1 reply; 28+ messages in thread
From: Leandro Ribeiro @ 2021-10-28 21:38 UTC (permalink / raw)
  To: Igor Torrente, rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: hamohammed.sa, daniel, airlied, contact, dri-devel

Hi,

On 10/26/21 08:34, Igor Torrente wrote:
> Add a helper function to validate the connector configuration receive in
> the encoder atomic_check by the drivers.
> 
> So the drivers don't need do these common validations themselves.
> 
> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> ---
> V2: Move the format verification to a new helper at the drm_atomic_helper.c
>     (Thomas Zimmermann).
> ---
>  drivers/gpu/drm/drm_atomic_helper.c   | 47 +++++++++++++++++++++++++++
>  drivers/gpu/drm/vkms/vkms_writeback.c |  9 +++--
>  include/drm/drm_atomic_helper.h       |  3 ++
>  3 files changed, 54 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
> index 2c0c6ec92820..c2653b9824b5 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -766,6 +766,53 @@ drm_atomic_helper_check_modeset(struct drm_device *dev,
>  }
>  EXPORT_SYMBOL(drm_atomic_helper_check_modeset);
>  
> +/**
> + * drm_atomic_helper_check_wb_connector_state() - Check writeback encoder state
> + * @encoder: encoder state to check
> + * @conn_state: connector state to check
> + *
> + * Checks if the wriback connector state is valid, and returns a erros if it
> + * isn't.
> + *
> + * RETURNS:
> + * Zero for success or -errno
> + */
> +int
> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
> +					 struct drm_connector_state *conn_state)
> +{
> +	struct drm_writeback_job *wb_job = conn_state->writeback_job;
> +	struct drm_property_blob *pixel_format_blob;
> +	bool format_supported = false;
> +	struct drm_framebuffer *fb;
> +	int i, n_formats;
> +	u32 *formats;
> +
> +	if (!wb_job || !wb_job->fb)
> +		return 0;

I think that this should be removed and that this functions should
assume that (wb_job && wb_job->fb) == true.

Actually, it's weird to have conn_state as argument and only use it to
get the wb_job. Instead, this function could receive wb_job directly.

Of course, its name/description would have to change.

> +
> +	pixel_format_blob = wb_job->connector->pixel_formats_blob_ptr;
> +	n_formats = pixel_format_blob->length / sizeof(u32);
> +	formats = pixel_format_blob->data;
> +	fb = wb_job->fb;
> +
> +	for (i = 0; i < n_formats; i++) {
> +		if (fb->format->format == formats[i]) {
> +			format_supported = true;
> +			break;
> +		}
> +	}
> +
> +	if (!format_supported) {
> +		DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
> +			      &fb->format->format);
> +		return -EINVAL;
> +	}
> +
> +	return 0;

If you do this, you can get rid of the format_supported flag:

	for(...) {
		if (fb->format->format == formats[i])
			return 0;
	}


	DRM_DEBUG_KMS(...);
	return -EINVAL;

Thanks,
Leandro Ribeiro

> +}
> +EXPORT_SYMBOL(drm_atomic_helper_check_wb_encoder_state);
> +
>  /**
>   * drm_atomic_helper_check_plane_state() - Check plane state for validity
>   * @plane_state: plane state to check
> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
> index 32734cdbf6c2..42f3396c523a 100644
> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
> @@ -30,6 +30,7 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
>  {
>  	struct drm_framebuffer *fb;
>  	const struct drm_display_mode *mode = &crtc_state->mode;
> +	int ret;
>  
>  	if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
>  		return 0;
> @@ -41,11 +42,9 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
>  		return -EINVAL;
>  	}
>  
> -	if (fb->format->format != vkms_wb_formats[0]) {
> -		DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
> -			      &fb->format->format);
> -		return -EINVAL;
> -	}
> +	ret = drm_atomic_helper_check_wb_encoder_state(encoder, conn_state);
> +	if (ret < 0)
> +		return ret;
>  
>  	return 0;
>  }
> diff --git a/include/drm/drm_atomic_helper.h b/include/drm/drm_atomic_helper.h
> index 4045e2507e11..3fbf695da60f 100644
> --- a/include/drm/drm_atomic_helper.h
> +++ b/include/drm/drm_atomic_helper.h
> @@ -40,6 +40,9 @@ struct drm_private_state;
>  
>  int drm_atomic_helper_check_modeset(struct drm_device *dev,
>  				struct drm_atomic_state *state);
> +int
> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
> +					 struct drm_connector_state *conn_state);
>  int drm_atomic_helper_check_plane_state(struct drm_plane_state *plane_state,
>  					const struct drm_crtc_state *crtc_state,
>  					int min_scale,
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation
  2021-10-28 21:38   ` Leandro Ribeiro
@ 2021-11-03 15:03     ` Igor Torrente
  2021-11-03 15:11       ` Leandro Ribeiro
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-11-03 15:03 UTC (permalink / raw)
  To: Leandro Ribeiro, rodrigosiqueiramelo, melissa.srw, ppaalanen,
	tzimmermann
  Cc: airlied, hamohammed.sa, dri-devel

Hi Leandro,

On 10/28/21 6:38 PM, Leandro Ribeiro wrote:
> Hi,
> 
> On 10/26/21 08:34, Igor Torrente wrote:
>> Add a helper function to validate the connector configuration receive in
>> the encoder atomic_check by the drivers.
>>
>> So the drivers don't need do these common validations themselves.
>>
>> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
>> ---
>> V2: Move the format verification to a new helper at the drm_atomic_helper.c
>>      (Thomas Zimmermann).
>> ---
>>   drivers/gpu/drm/drm_atomic_helper.c   | 47 +++++++++++++++++++++++++++
>>   drivers/gpu/drm/vkms/vkms_writeback.c |  9 +++--
>>   include/drm/drm_atomic_helper.h       |  3 ++
>>   3 files changed, 54 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
>> index 2c0c6ec92820..c2653b9824b5 100644
>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>> @@ -766,6 +766,53 @@ drm_atomic_helper_check_modeset(struct drm_device *dev,
>>   }
>>   EXPORT_SYMBOL(drm_atomic_helper_check_modeset);
>>   
>> +/**
>> + * drm_atomic_helper_check_wb_connector_state() - Check writeback encoder state
>> + * @encoder: encoder state to check
>> + * @conn_state: connector state to check
>> + *
>> + * Checks if the wriback connector state is valid, and returns a erros if it
>> + * isn't.
>> + *
>> + * RETURNS:
>> + * Zero for success or -errno
>> + */
>> +int
>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>> +					 struct drm_connector_state *conn_state)
>> +{
>> +	struct drm_writeback_job *wb_job = conn_state->writeback_job;
>> +	struct drm_property_blob *pixel_format_blob;
>> +	bool format_supported = false;
>> +	struct drm_framebuffer *fb;
>> +	int i, n_formats;
>> +	u32 *formats;
>> +
>> +	if (!wb_job || !wb_job->fb)
>> +		return 0;
> 
> I think that this should be removed and that this functions should
> assume that (wb_job && wb_job->fb) == true.

Ok.

> 
> Actually, it's weird to have conn_state as argument and only use it to
> get the wb_job. Instead, this function could receive wb_job directly.

In the Thomas review of v1, he said that maybe other things could be
tested in this helper. I'm not sure what these additional checks could
be, so I tried to design the function signature expecting more things
to be added after his review.

As you can see, the helper is receiving the `drm_encoder` and doing
nothing with it.

If we, eventually, don't find anything else that this helper can do, I
will revert to something very similar (if not equal) to your proposal.
I just want to wait for Thomas's review first.

> 
> Of course, its name/description would have to change.
> 
>> +
>> +	pixel_format_blob = wb_job->connector->pixel_formats_blob_ptr;
>> +	n_formats = pixel_format_blob->length / sizeof(u32);
>> +	formats = pixel_format_blob->data;
>> +	fb = wb_job->fb;
>> +
>> +	for (i = 0; i < n_formats; i++) {
>> +		if (fb->format->format == formats[i]) {
>> +			format_supported = true;
>> +			break;
>> +		}
>> +	}
>> +
>> +	if (!format_supported) {
>> +		DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>> +			      &fb->format->format);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
> 
> If you do this, you can get rid of the format_supported flag:
> 
> 	for(...) {
> 		if (fb->format->format == formats[i])
> 			return 0;
> 	}
> 
> 
> 	DRM_DEBUG_KMS(...);
> 	return -EINVAL;
> 

Indeed. Thanks!

> Thanks,
> Leandro Ribeiro
> 
>> +}
>> +EXPORT_SYMBOL(drm_atomic_helper_check_wb_encoder_state);
>> +
>>   /**
>>    * drm_atomic_helper_check_plane_state() - Check plane state for validity
>>    * @plane_state: plane state to check
>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
>> index 32734cdbf6c2..42f3396c523a 100644
>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>> @@ -30,6 +30,7 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
>>   {
>>   	struct drm_framebuffer *fb;
>>   	const struct drm_display_mode *mode = &crtc_state->mode;
>> +	int ret;
>>   
>>   	if (!conn_state->writeback_job || !conn_state->writeback_job->fb)
>>   		return 0;
>> @@ -41,11 +42,9 @@ static int vkms_wb_encoder_atomic_check(struct drm_encoder *encoder,
>>   		return -EINVAL;
>>   	}
>>   
>> -	if (fb->format->format != vkms_wb_formats[0]) {
>> -		DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>> -			      &fb->format->format);
>> -		return -EINVAL;
>> -	}
>> +	ret = drm_atomic_helper_check_wb_encoder_state(encoder, conn_state);
>> +	if (ret < 0)
>> +		return ret;
>>   
>>   	return 0;
>>   }
>> diff --git a/include/drm/drm_atomic_helper.h b/include/drm/drm_atomic_helper.h
>> index 4045e2507e11..3fbf695da60f 100644
>> --- a/include/drm/drm_atomic_helper.h
>> +++ b/include/drm/drm_atomic_helper.h
>> @@ -40,6 +40,9 @@ struct drm_private_state;
>>   
>>   int drm_atomic_helper_check_modeset(struct drm_device *dev,
>>   				struct drm_atomic_state *state);
>> +int
>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>> +					 struct drm_connector_state *conn_state);
>>   int drm_atomic_helper_check_plane_state(struct drm_plane_state *plane_state,
>>   					const struct drm_crtc_state *crtc_state,
>>   					int min_scale,
>>

Thanks,
---
Igor M. A. Torrente

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation
  2021-11-03 15:03     ` Igor Torrente
@ 2021-11-03 15:11       ` Leandro Ribeiro
  2021-11-03 15:37         ` Thomas Zimmermann
  0 siblings, 1 reply; 28+ messages in thread
From: Leandro Ribeiro @ 2021-11-03 15:11 UTC (permalink / raw)
  To: Igor Torrente, rodrigosiqueiramelo, melissa.srw, ppaalanen, tzimmermann
  Cc: airlied, hamohammed.sa, dri-devel

Hi,

On 11/3/21 12:03, Igor Torrente wrote:
> Hi Leandro,
> 
> On 10/28/21 6:38 PM, Leandro Ribeiro wrote:
>> Hi,
>>
>> On 10/26/21 08:34, Igor Torrente wrote:
>>> Add a helper function to validate the connector configuration receive in
>>> the encoder atomic_check by the drivers.
>>>
>>> So the drivers don't need do these common validations themselves.
>>>
>>> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
>>> ---
>>> V2: Move the format verification to a new helper at the
>>> drm_atomic_helper.c
>>>      (Thomas Zimmermann).
>>> ---
>>>   drivers/gpu/drm/drm_atomic_helper.c   | 47 +++++++++++++++++++++++++++
>>>   drivers/gpu/drm/vkms/vkms_writeback.c |  9 +++--
>>>   include/drm/drm_atomic_helper.h       |  3 ++
>>>   3 files changed, 54 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c
>>> b/drivers/gpu/drm/drm_atomic_helper.c
>>> index 2c0c6ec92820..c2653b9824b5 100644
>>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>>> @@ -766,6 +766,53 @@ drm_atomic_helper_check_modeset(struct
>>> drm_device *dev,
>>>   }
>>>   EXPORT_SYMBOL(drm_atomic_helper_check_modeset);
>>>   +/**
>>> + * drm_atomic_helper_check_wb_connector_state() - Check writeback
>>> encoder state
>>> + * @encoder: encoder state to check
>>> + * @conn_state: connector state to check
>>> + *
>>> + * Checks if the wriback connector state is valid, and returns a
>>> erros if it
>>> + * isn't.
>>> + *
>>> + * RETURNS:
>>> + * Zero for success or -errno
>>> + */
>>> +int
>>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>>> +                     struct drm_connector_state *conn_state)
>>> +{
>>> +    struct drm_writeback_job *wb_job = conn_state->writeback_job;
>>> +    struct drm_property_blob *pixel_format_blob;
>>> +    bool format_supported = false;
>>> +    struct drm_framebuffer *fb;
>>> +    int i, n_formats;
>>> +    u32 *formats;
>>> +
>>> +    if (!wb_job || !wb_job->fb)
>>> +        return 0;
>>
>> I think that this should be removed and that this functions should
>> assume that (wb_job && wb_job->fb) == true.
> 
> Ok.
> 
>>
>> Actually, it's weird to have conn_state as argument and only use it to
>> get the wb_job. Instead, this function could receive wb_job directly.
> 
> In the Thomas review of v1, he said that maybe other things could be
> tested in this helper. I'm not sure what these additional checks could
> be, so I tried to design the function signature expecting more things
> to be added after his review.
> 
> As you can see, the helper is receiving the `drm_encoder` and doing
> nothing with it.
> 
> If we, eventually, don't find anything else that this helper can do, I
> will revert to something very similar (if not equal) to your proposal.
> I just want to wait for Thomas's review first.
>

Sure, that makes sense.

Thanks,
Leandro Ribeiro

>>
>> Of course, its name/description would have to change.
>>
>>> +
>>> +    pixel_format_blob = wb_job->connector->pixel_formats_blob_ptr;
>>> +    n_formats = pixel_format_blob->length / sizeof(u32);
>>> +    formats = pixel_format_blob->data;
>>> +    fb = wb_job->fb;
>>> +
>>> +    for (i = 0; i < n_formats; i++) {
>>> +        if (fb->format->format == formats[i]) {
>>> +            format_supported = true;
>>> +            break;
>>> +        }
>>> +    }
>>> +
>>> +    if (!format_supported) {
>>> +        DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>> +                  &fb->format->format);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    return 0;
>>
>> If you do this, you can get rid of the format_supported flag:
>>
>>     for(...) {
>>         if (fb->format->format == formats[i])
>>             return 0;
>>     }
>>
>>
>>     DRM_DEBUG_KMS(...);
>>     return -EINVAL;
>>
> 
> Indeed. Thanks!
> 
>> Thanks,
>> Leandro Ribeiro
>>
>>> +}
>>> +EXPORT_SYMBOL(drm_atomic_helper_check_wb_encoder_state);
>>> +
>>>   /**
>>>    * drm_atomic_helper_check_plane_state() - Check plane state for
>>> validity
>>>    * @plane_state: plane state to check
>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>> index 32734cdbf6c2..42f3396c523a 100644
>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>> @@ -30,6 +30,7 @@ static int vkms_wb_encoder_atomic_check(struct
>>> drm_encoder *encoder,
>>>   {
>>>       struct drm_framebuffer *fb;
>>>       const struct drm_display_mode *mode = &crtc_state->mode;
>>> +    int ret;
>>>         if (!conn_state->writeback_job ||
>>> !conn_state->writeback_job->fb)
>>>           return 0;
>>> @@ -41,11 +42,9 @@ static int vkms_wb_encoder_atomic_check(struct
>>> drm_encoder *encoder,
>>>           return -EINVAL;
>>>       }
>>>   -    if (fb->format->format != vkms_wb_formats[0]) {
>>> -        DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>> -                  &fb->format->format);
>>> -        return -EINVAL;
>>> -    }
>>> +    ret = drm_atomic_helper_check_wb_encoder_state(encoder,
>>> conn_state);
>>> +    if (ret < 0)
>>> +        return ret;
>>>         return 0;
>>>   }
>>> diff --git a/include/drm/drm_atomic_helper.h
>>> b/include/drm/drm_atomic_helper.h
>>> index 4045e2507e11..3fbf695da60f 100644
>>> --- a/include/drm/drm_atomic_helper.h
>>> +++ b/include/drm/drm_atomic_helper.h
>>> @@ -40,6 +40,9 @@ struct drm_private_state;
>>>     int drm_atomic_helper_check_modeset(struct drm_device *dev,
>>>                   struct drm_atomic_state *state);
>>> +int
>>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>>> +                     struct drm_connector_state *conn_state);
>>>   int drm_atomic_helper_check_plane_state(struct drm_plane_state
>>> *plane_state,
>>>                       const struct drm_crtc_state *crtc_state,
>>>                       int min_scale,
>>>
> 
> Thanks,
> ---
> Igor M. A. Torrente

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation
  2021-11-03 15:11       ` Leandro Ribeiro
@ 2021-11-03 15:37         ` Thomas Zimmermann
  2021-11-03 18:41           ` Igor Torrente
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Zimmermann @ 2021-11-03 15:37 UTC (permalink / raw)
  To: Leandro Ribeiro, Igor Torrente, rodrigosiqueiramelo, melissa.srw,
	ppaalanen
  Cc: airlied, hamohammed.sa, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 8112 bytes --]

Hi

Am 03.11.21 um 16:11 schrieb Leandro Ribeiro:
> Hi,
> 
> On 11/3/21 12:03, Igor Torrente wrote:
>> Hi Leandro,
>>
>> On 10/28/21 6:38 PM, Leandro Ribeiro wrote:
>>> Hi,
>>>
>>> On 10/26/21 08:34, Igor Torrente wrote:
>>>> Add a helper function to validate the connector configuration receive in
>>>> the encoder atomic_check by the drivers.
>>>>
>>>> So the drivers don't need do these common validations themselves.
>>>>
>>>> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
>>>> ---
>>>> V2: Move the format verification to a new helper at the
>>>> drm_atomic_helper.c
>>>>       (Thomas Zimmermann).
>>>> ---
>>>>    drivers/gpu/drm/drm_atomic_helper.c   | 47 +++++++++++++++++++++++++++
>>>>    drivers/gpu/drm/vkms/vkms_writeback.c |  9 +++--
>>>>    include/drm/drm_atomic_helper.h       |  3 ++
>>>>    3 files changed, 54 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c
>>>> b/drivers/gpu/drm/drm_atomic_helper.c
>>>> index 2c0c6ec92820..c2653b9824b5 100644
>>>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>>>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>>>> @@ -766,6 +766,53 @@ drm_atomic_helper_check_modeset(struct
>>>> drm_device *dev,
>>>>    }
>>>>    EXPORT_SYMBOL(drm_atomic_helper_check_modeset);
>>>>    +/**
>>>> + * drm_atomic_helper_check_wb_connector_state() - Check writeback
>>>> encoder state
>>>> + * @encoder: encoder state to check
>>>> + * @conn_state: connector state to check
>>>> + *
>>>> + * Checks if the wriback connector state is valid, and returns a

'writeback'

'an error'

>>>> erros if it

'error'

>>>> + * isn't.
>>>> + *
>>>> + * RETURNS:
>>>> + * Zero for success or -errno
>>>> + */
>>>> +int
>>>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>>>> +                     struct drm_connector_state *conn_state)
>>>> +{
>>>> +    struct drm_writeback_job *wb_job = conn_state->writeback_job;
>>>> +    struct drm_property_blob *pixel_format_blob;
>>>> +    bool format_supported = false;
>>>> +    struct drm_framebuffer *fb;
>>>> +    int i, n_formats;

Just 'nformats'.

Please make both variables 'size_t'.


>>>> +    u32 *formats;
>>>> +
>>>> +    if (!wb_job || !wb_job->fb)
>>>> +        return 0;
>>>
>>> I think that this should be removed and that this functions should
>>> assume that (wb_job && wb_job->fb) == true.
>>
>> Ok.

In regular atomic check for planes, there can be planes with no attached 
framebuffer. Helpers handle this situation. [1] I don't know if this is 
possible in writeback code, but for consistency, it would make sense to 
keep this test here. Not sure though.

>>
>>>
>>> Actually, it's weird to have conn_state as argument and only use it to
>>> get the wb_job. Instead, this function could receive wb_job directly.
>>
>> In the Thomas review of v1, he said that maybe other things could be
>> tested in this helper. I'm not sure what these additional checks could
>> be, so I tried to design the function signature expecting more things
>> to be added after his review.
>>
>> As you can see, the helper is receiving the `drm_encoder` and doing
>> nothing with it.
>>
>> If we, eventually, don't find anything else that this helper can do, I
>> will revert to something very similar (if not equal) to your proposal.
>> I just want to wait for Thomas's review first.
>>
> 
> Sure, that makes sense.

We had many helper functions for atomic modesetting that took various 
arguments for whatever they required. Extending such a function with new 
functionality/arguments required required touching many drivers and made 
the parameter list hard to read. At some point, Maxime went through most 
of the code and unified it all to pass full state to the helpers.

So please keep the connector state. I think it's how we do things ATM.

> 
> Thanks,
> Leandro Ribeiro
> 
>>>
>>> Of course, its name/description would have to change.
>>>
>>>> +
>>>> +    pixel_format_blob = wb_job->connector->pixel_formats_blob_ptr;
>>>> +    n_formats = pixel_format_blob->length / sizeof(u32);
>>>> +    formats = pixel_format_blob->data;
>>>> +    fb = wb_job->fb;
>>>> +
>>>> +    for (i = 0; i < n_formats; i++) {
>>>> +        if (fb->format->format == formats[i]) {
>>>> +            format_supported = true;
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!format_supported) {
>>>> +        DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>> +                  &fb->format->format);

Please use drm_dgb_kms() instead. There's a 100-character-per-line 
limit. The comment probably fits onto a single line.(?)

>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    return 0;
>>>
>>> If you do this, you can get rid of the format_supported flag:
>>>
>>>      for(...) {
>>>          if (fb->format->format == formats[i])
>>>              return 0;
>>>      }
>>>
>>>
>>>      DRM_DEBUG_KMS(...);
>>>      return -EINVAL;
>>>
>>
>> Indeed. Thanks!

Yes, that looks nicer.

>>
>>> Thanks,
>>> Leandro Ribeiro
>>>
>>>> +}
>>>> +EXPORT_SYMBOL(drm_atomic_helper_check_wb_encoder_state);
>>>> +
>>>>    /**
>>>>     * drm_atomic_helper_check_plane_state() - Check plane state for
>>>> validity
>>>>     * @plane_state: plane state to check
>>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> index 32734cdbf6c2..42f3396c523a 100644
>>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>> @@ -30,6 +30,7 @@ static int vkms_wb_encoder_atomic_check(struct
>>>> drm_encoder *encoder,
>>>>    {
>>>>        struct drm_framebuffer *fb;
>>>>        const struct drm_display_mode *mode = &crtc_state->mode;
>>>> +    int ret;
>>>>          if (!conn_state->writeback_job ||
>>>> !conn_state->writeback_job->fb)
>>>>            return 0;
>>>> @@ -41,11 +42,9 @@ static int vkms_wb_encoder_atomic_check(struct
>>>> drm_encoder *encoder,
>>>>            return -EINVAL;
>>>>        }
>>>>    -    if (fb->format->format != vkms_wb_formats[0]) {
>>>> -        DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>> -                  &fb->format->format);
>>>> -        return -EINVAL;
>>>> -    }
>>>> +    ret = drm_atomic_helper_check_wb_encoder_state(encoder,
>>>> conn_state);
>>>> +    if (ret < 0)
>>>> +        return ret;

We usually use just 'if (ret)' for such test. No need for a less-than.

Best regards
Thomas

[1] 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_atomic_helper.c#L809

>>>>          return 0;
>>>>    }
>>>> diff --git a/include/drm/drm_atomic_helper.h
>>>> b/include/drm/drm_atomic_helper.h
>>>> index 4045e2507e11..3fbf695da60f 100644
>>>> --- a/include/drm/drm_atomic_helper.h
>>>> +++ b/include/drm/drm_atomic_helper.h
>>>> @@ -40,6 +40,9 @@ struct drm_private_state;
>>>>      int drm_atomic_helper_check_modeset(struct drm_device *dev,
>>>>                    struct drm_atomic_state *state);
>>>> +int
>>>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>>>> +                     struct drm_connector_state *conn_state);
>>>>    int drm_atomic_helper_check_plane_state(struct drm_plane_state
>>>> *plane_state,
>>>>                        const struct drm_crtc_state *crtc_state,
>>>>                        int min_scale,
>>>>
>>
>> Thanks,
>> ---
>> Igor M. A. Torrente

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/8] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES
  2021-10-26 11:34 ` [PATCH v2 3/8] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES Igor Torrente
@ 2021-11-03 15:40   ` Thomas Zimmermann
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Zimmermann @ 2021-11-03 15:40 UTC (permalink / raw)
  To: Igor Torrente, rodrigosiqueiramelo, melissa.srw, ppaalanen
  Cc: hamohammed.sa, airlied, dri-devel, leandro.ribeiro


[-- Attachment #1.1: Type: text/plain, Size: 1294 bytes --]

Hi

Am 26.10.21 um 13:34 schrieb Igor Torrente:
> The `map` vector at `vkms_composer` uses a hardcoded value to define its
> size.
> 
> If someday the maximum number of planes increases, this hardcoded value
> can be a problem.
> 
> This value is being replaced with the DRM_FORMAT_MAX_PLANES macro.
> 
> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>

Acked-by: Thomas Zimmermann <tzimmermann@suse.de>

We can merge that immediately.

Best regards
Thomas

> ---
>   drivers/gpu/drm/vkms/vkms_drv.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
> index d48c23d40ce5..64e62993b06f 100644
> --- a/drivers/gpu/drm/vkms/vkms_drv.h
> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
> @@ -28,7 +28,7 @@ struct vkms_writeback_job {
>   struct vkms_composer {
>   	struct drm_framebuffer fb;
>   	struct drm_rect src, dst;
> -	struct dma_buf_map map[4];
> +	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
>   	unsigned int offset;
>   	unsigned int pitch;
>   	unsigned int cpp;
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job`
  2021-10-26 11:34 ` [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job` Igor Torrente
@ 2021-11-03 15:45   ` Thomas Zimmermann
  2021-11-03 19:18     ` Igor Torrente
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Zimmermann @ 2021-11-03 15:45 UTC (permalink / raw)
  To: Igor Torrente, rodrigosiqueiramelo, melissa.srw, ppaalanen
  Cc: hamohammed.sa, airlied, dri-devel, leandro.ribeiro


[-- Attachment #1.1: Type: text/plain, Size: 6929 bytes --]

Hi

Am 26.10.21 um 13:34 schrieb Igor Torrente:
> This commit is the groundwork to introduce new formats to the planes and
> writeback buffer. As part of it, a new buffer metadata field is added to
> `vkms_writeback_job`, this metadata is represented by the `vkms_composer`
> struct.
> 
> This will allow us, in the future, to have different compositing and wb
> format types.
> 
> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> ---
> V2: Change the code to get the drm_framebuffer reference and not copy its
>      contents(Thomas Zimmermann).
> ---
>   drivers/gpu/drm/vkms/vkms_composer.c  |  4 ++--
>   drivers/gpu/drm/vkms/vkms_drv.h       | 12 ++++++------
>   drivers/gpu/drm/vkms/vkms_plane.c     | 10 +++++-----
>   drivers/gpu/drm/vkms/vkms_writeback.c | 21 ++++++++++++++++++---
>   4 files changed, 31 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> index 82f79e508f81..383ca657ddf7 100644
> --- a/drivers/gpu/drm/vkms/vkms_composer.c
> +++ b/drivers/gpu/drm/vkms/vkms_composer.c
> @@ -153,7 +153,7 @@ static void compose_plane(struct vkms_composer *primary_composer,
>   			  struct vkms_composer *plane_composer,
>   			  void *vaddr_out)
>   {
> -	struct drm_framebuffer *fb = &plane_composer->fb;
> +	struct drm_framebuffer *fb = plane_composer->fb;
>   	void *vaddr;
>   	void (*pixel_blend)(const u8 *p_src, u8 *p_dst);
>   
> @@ -174,7 +174,7 @@ static int compose_active_planes(void **vaddr_out,
>   				 struct vkms_composer *primary_composer,
>   				 struct vkms_crtc_state *crtc_state)
>   {
> -	struct drm_framebuffer *fb = &primary_composer->fb;
> +	struct drm_framebuffer *fb = primary_composer->fb;
>   	struct drm_gem_object *gem_obj = drm_gem_fb_get_obj(fb, 0);
>   	const void *vaddr;
>   	int i;
> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
> index 64e62993b06f..9e4c1e95bbb1 100644
> --- a/drivers/gpu/drm/vkms/vkms_drv.h
> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
> @@ -20,13 +20,8 @@
>   #define XRES_MAX  8192
>   #define YRES_MAX  8192
>   
> -struct vkms_writeback_job {
> -	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
> -	struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
> -};
> -
>   struct vkms_composer {
> -	struct drm_framebuffer fb;
> +	struct drm_framebuffer *fb;
>   	struct drm_rect src, dst;
>   	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
>   	unsigned int offset;
> @@ -34,6 +29,11 @@ struct vkms_composer {
>   	unsigned int cpp;
>   };
>   
> +struct vkms_writeback_job {
> +	struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
> +	struct vkms_composer composer;
> +};
> +
>   /**
>    * vkms_plane_state - Driver specific plane state
>    * @base: base plane state
> diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
> index 32409e15244b..0a28cb7a85e2 100644
> --- a/drivers/gpu/drm/vkms/vkms_plane.c
> +++ b/drivers/gpu/drm/vkms/vkms_plane.c
> @@ -50,12 +50,12 @@ static void vkms_plane_destroy_state(struct drm_plane *plane,
>   	struct vkms_plane_state *vkms_state = to_vkms_plane_state(old_state);
>   	struct drm_crtc *crtc = vkms_state->base.base.crtc;
>   
> -	if (crtc) {
> +	if (crtc && vkms_state->composer->fb) {
>   		/* dropping the reference we acquired in
>   		 * vkms_primary_plane_update()
>   		 */
> -		if (drm_framebuffer_read_refcount(&vkms_state->composer->fb))
> -			drm_framebuffer_put(&vkms_state->composer->fb);
> +		if (drm_framebuffer_read_refcount(vkms_state->composer->fb))
> +			drm_framebuffer_put(vkms_state->composer->fb);
>   	}
>   
>   	kfree(vkms_state->composer);
> @@ -110,9 +110,9 @@ static void vkms_plane_atomic_update(struct drm_plane *plane,
>   	composer = vkms_plane_state->composer;
>   	memcpy(&composer->src, &new_state->src, sizeof(struct drm_rect));
>   	memcpy(&composer->dst, &new_state->dst, sizeof(struct drm_rect));
> -	memcpy(&composer->fb, fb, sizeof(struct drm_framebuffer));
> +	composer->fb = fb;
>   	memcpy(&composer->map, &shadow_plane_state->data, sizeof(composer->map));
> -	drm_framebuffer_get(&composer->fb);
> +	drm_framebuffer_get(composer->fb);
>   	composer->offset = fb->offsets[0];
>   	composer->pitch = fb->pitches[0];
>   	composer->cpp = fb->format->cpp[0];
> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
> index 8694227f555f..32734cdbf6c2 100644
> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
> @@ -75,12 +75,15 @@ static int vkms_wb_prepare_job(struct drm_writeback_connector *wb_connector,
>   	if (!vkmsjob)
>   		return -ENOMEM;
>   
> -	ret = drm_gem_fb_vmap(job->fb, vkmsjob->map, vkmsjob->data);
> +	ret = drm_gem_fb_vmap(job->fb, vkmsjob->composer.map, vkmsjob->data);
>   	if (ret) {
>   		DRM_ERROR("vmap failed: %d\n", ret);
>   		goto err_kfree;
>   	}
>   
> +	vkmsjob->composer.fb = job->fb;
> +	drm_framebuffer_get(vkmsjob->composer.fb);
> +
>   	job->priv = vkmsjob;
>   
>   	return 0;
> @@ -99,7 +102,10 @@ static void vkms_wb_cleanup_job(struct drm_writeback_connector *connector,
>   	if (!job->fb)
>   		return;
>   
> -	drm_gem_fb_vunmap(job->fb, vkmsjob->map);
> +	drm_gem_fb_vunmap(job->fb, vkmsjob->composer.map);
> +
> +	if (drm_framebuffer_read_refcount(vkmsjob->composer.fb))
> +		drm_framebuffer_put(vkmsjob->composer.fb);

Why is this protected by an if conditional?

Best regards
Thomas

>   
>   	vkmsdev = drm_device_to_vkms_device(job->fb->dev);
>   	vkms_set_composer(&vkmsdev->output, false);
> @@ -116,14 +122,23 @@ static void vkms_wb_atomic_commit(struct drm_connector *conn,
>   	struct drm_writeback_connector *wb_conn = &output->wb_connector;
>   	struct drm_connector_state *conn_state = wb_conn->base.state;
>   	struct vkms_crtc_state *crtc_state = output->composer_state;
> +	struct drm_framebuffer *fb = connector_state->writeback_job->fb;
> +	struct vkms_writeback_job *active_wb;
> +	struct vkms_composer *wb_composer;
>   
>   	if (!conn_state)
>   		return;
>   
>   	vkms_set_composer(&vkmsdev->output, true);
>   
> +	active_wb = conn_state->writeback_job->priv;
> +	wb_composer = &active_wb->composer;
> +
>   	spin_lock_irq(&output->composer_lock);
> -	crtc_state->active_writeback = conn_state->writeback_job->priv;
> +	crtc_state->active_writeback = active_wb;
> +	wb_composer->offset = fb->offsets[0];
> +	wb_composer->pitch = fb->pitches[0];
> +	wb_composer->cpp = fb->format->cpp[0];
>   	crtc_state->wb_pending = true;
>   	spin_unlock_irq(&output->composer_lock);
>   	drm_writeback_queue_job(wb_conn, connector_state);
> 

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation
  2021-11-03 15:37         ` Thomas Zimmermann
@ 2021-11-03 18:41           ` Igor Torrente
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Torrente @ 2021-11-03 18:41 UTC (permalink / raw)
  To: Thomas Zimmermann, Leandro Ribeiro, rodrigosiqueiramelo,
	melissa.srw, ppaalanen
  Cc: airlied, hamohammed.sa, dri-devel

Hi Thomas,

On 11/3/21 12:37 PM, Thomas Zimmermann wrote:
> Hi
> 
> Am 03.11.21 um 16:11 schrieb Leandro Ribeiro:
>> Hi,
>>
>> On 11/3/21 12:03, Igor Torrente wrote:
>>> Hi Leandro,
>>>
>>> On 10/28/21 6:38 PM, Leandro Ribeiro wrote:
>>>> Hi,
>>>>
>>>> On 10/26/21 08:34, Igor Torrente wrote:
>>>>> Add a helper function to validate the connector configuration receive in
>>>>> the encoder atomic_check by the drivers.
>>>>>
>>>>> So the drivers don't need do these common validations themselves.
>>>>>
>>>>> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
>>>>> ---
>>>>> V2: Move the format verification to a new helper at the
>>>>> drm_atomic_helper.c
>>>>>        (Thomas Zimmermann).
>>>>> ---
>>>>>     drivers/gpu/drm/drm_atomic_helper.c   | 47 +++++++++++++++++++++++++++
>>>>>     drivers/gpu/drm/vkms/vkms_writeback.c |  9 +++--
>>>>>     include/drm/drm_atomic_helper.h       |  3 ++
>>>>>     3 files changed, 54 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c
>>>>> b/drivers/gpu/drm/drm_atomic_helper.c
>>>>> index 2c0c6ec92820..c2653b9824b5 100644
>>>>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>>>>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>>>>> @@ -766,6 +766,53 @@ drm_atomic_helper_check_modeset(struct
>>>>> drm_device *dev,
>>>>>     }
>>>>>     EXPORT_SYMBOL(drm_atomic_helper_check_modeset);
>>>>>     +/**
>>>>> + * drm_atomic_helper_check_wb_connector_state() - Check writeback
>>>>> encoder state
>>>>> + * @encoder: encoder state to check
>>>>> + * @conn_state: connector state to check
>>>>> + *
>>>>> + * Checks if the wriback connector state is valid, and returns a
> 
> 'writeback'
> 
> 'an error'
> 
>>>>> erros if it
> 
> 'error'
> 
>>>>> + * isn't.
>>>>> + *
>>>>> + * RETURNS:
>>>>> + * Zero for success or -errno
>>>>> + */
>>>>> +int
>>>>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>>>>> +                     struct drm_connector_state *conn_state)
>>>>> +{
>>>>> +    struct drm_writeback_job *wb_job = conn_state->writeback_job;
>>>>> +    struct drm_property_blob *pixel_format_blob;
>>>>> +    bool format_supported = false;
>>>>> +    struct drm_framebuffer *fb;
>>>>> +    int i, n_formats;
> 
> Just 'nformats'.
> 
> Please make both variables 'size_t'.

I Will correct all these minor issues.

> 
> 
>>>>> +    u32 *formats;
>>>>> +
>>>>> +    if (!wb_job || !wb_job->fb)
>>>>> +        return 0;
>>>>
>>>> I think that this should be removed and that this functions should
>>>> assume that (wb_job && wb_job->fb) == true.
>>>
>>> Ok.
> 
> In regular atomic check for planes, there can be planes with no attached
> framebuffer. Helpers handle this situation. [1] I don't know if this is
> possible in writeback code, but for consistency, it would make sense to
> keep this test here. Not sure though.

@Leandro, do you know if it is possible to have a wb_job without a fb
attached?

> 
>>>
>>>>
>>>> Actually, it's weird to have conn_state as argument and only use it to
>>>> get the wb_job. Instead, this function could receive wb_job directly.
>>>
>>> In the Thomas review of v1, he said that maybe other things could be
>>> tested in this helper. I'm not sure what these additional checks could
>>> be, so I tried to design the function signature expecting more things
>>> to be added after his review.
>>>
>>> As you can see, the helper is receiving the `drm_encoder` and doing
>>> nothing with it.
>>>
>>> If we, eventually, don't find anything else that this helper can do, I
>>> will revert to something very similar (if not equal) to your proposal.
>>> I just want to wait for Thomas's review first.
>>>
>>
>> Sure, that makes sense.
> 
> We had many helper functions for atomic modesetting that took various
> arguments for whatever they required. Extending such a function with new
> functionality/arguments required required touching many drivers and made
> the parameter list hard to read. At some point, Maxime went through most
> of the code and unified it all to pass full state > So please keep the connector state. I think it's how we do things ATM.to the helpers.
> 
> So please keep the connector state. I think it's how we do things ATM.

OK, I will keep then.

> 
>>
>> Thanks,
>> Leandro Ribeiro
>>
>>>>
>>>> Of course, its name/description would have to change.
>>>>
>>>>> +
>>>>> +    pixel_format_blob = wb_job->connector->pixel_formats_blob_ptr;
>>>>> +    n_formats = pixel_format_blob->length / sizeof(u32);
>>>>> +    formats = pixel_format_blob->data;
>>>>> +    fb = wb_job->fb;
>>>>> +
>>>>> +    for (i = 0; i < n_formats; i++) {
>>>>> +        if (fb->format->format == formats[i]) {
>>>>> +            format_supported = true;
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    if (!format_supported) {
>>>>> +        DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>>> +                  &fb->format->format);
> 
> Please use drm_dgb_kms() instead. There's a 100-character-per-line
> limit. The comment probably fits onto a single line.(?)


I will improve that. This code came from the vkms, which follows the 80
chars limit. If I'm not mistaken.

> 
>>>>> +        return -EINVAL;
>>>>> +    }
>>>>> +
>>>>> +    return 0;
>>>>
>>>> If you do this, you can get rid of the format_supported flag:
>>>>
>>>>       for(...) {
>>>>           if (fb->format->format == formats[i])
>>>>               return 0;
>>>>       }
>>>>
>>>>
>>>>       DRM_DEBUG_KMS(...);
>>>>       return -EINVAL;
>>>>
>>>
>>> Indeed. Thanks!
> 
> Yes, that looks nicer.
> 
>>>
>>>> Thanks,
>>>> Leandro Ribeiro
>>>>
>>>>> +}
>>>>> +EXPORT_SYMBOL(drm_atomic_helper_check_wb_encoder_state);
>>>>> +
>>>>>     /**
>>>>>      * drm_atomic_helper_check_plane_state() - Check plane state for
>>>>> validity
>>>>>      * @plane_state: plane state to check
>>>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> index 32734cdbf6c2..42f3396c523a 100644
>>>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>>>> @@ -30,6 +30,7 @@ static int vkms_wb_encoder_atomic_check(struct
>>>>> drm_encoder *encoder,
>>>>>     {
>>>>>         struct drm_framebuffer *fb;
>>>>>         const struct drm_display_mode *mode = &crtc_state->mode;
>>>>> +    int ret;
>>>>>           if (!conn_state->writeback_job ||
>>>>> !conn_state->writeback_job->fb)
>>>>>             return 0;
>>>>> @@ -41,11 +42,9 @@ static int vkms_wb_encoder_atomic_check(struct
>>>>> drm_encoder *encoder,
>>>>>             return -EINVAL;
>>>>>         }
>>>>>     -    if (fb->format->format != vkms_wb_formats[0]) {
>>>>> -        DRM_DEBUG_KMS("Invalid pixel format %p4cc\n",
>>>>> -                  &fb->format->format);
>>>>> -        return -EINVAL;
>>>>> -    }
>>>>> +    ret = drm_atomic_helper_check_wb_encoder_state(encoder,
>>>>> conn_state);
>>>>> +    if (ret < 0)
>>>>> +        return ret;
> 
> We usually use just 'if (ret)' for such test. No need for a less-than.

I will change that.

> 
> Best regards
> Thomas
> 
> [1]
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_atomic_helper.c#L809
> 
>>>>>           return 0;
>>>>>     }
>>>>> diff --git a/include/drm/drm_atomic_helper.h
>>>>> b/include/drm/drm_atomic_helper.h
>>>>> index 4045e2507e11..3fbf695da60f 100644
>>>>> --- a/include/drm/drm_atomic_helper.h
>>>>> +++ b/include/drm/drm_atomic_helper.h
>>>>> @@ -40,6 +40,9 @@ struct drm_private_state;
>>>>>       int drm_atomic_helper_check_modeset(struct drm_device *dev,
>>>>>                     struct drm_atomic_state *state);
>>>>> +int
>>>>> +drm_atomic_helper_check_wb_encoder_state(struct drm_encoder *encoder,
>>>>> +                     struct drm_connector_state *conn_state);
>>>>>     int drm_atomic_helper_check_plane_state(struct drm_plane_state
>>>>> *plane_state,
>>>>>                         const struct drm_crtc_state *crtc_state,
>>>>>                         int min_scale,
>>>>>
>>>
>>> Thanks,
>>> ---
>>> Igor M. A. Torrente
> 

Thanks,
---
Igor M. A. Torrente

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job`
  2021-11-03 15:45   ` Thomas Zimmermann
@ 2021-11-03 19:18     ` Igor Torrente
  2021-11-04  7:21       ` Thomas Zimmermann
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-11-03 19:18 UTC (permalink / raw)
  To: Thomas Zimmermann, rodrigosiqueiramelo, melissa.srw, ppaalanen
  Cc: hamohammed.sa, airlied, dri-devel, leandro.ribeiro

Hi Thomas,

On 11/3/21 12:45 PM, Thomas Zimmermann wrote:
> Hi
> 
> Am 26.10.21 um 13:34 schrieb Igor Torrente:
>> This commit is the groundwork to introduce new formats to the planes and
>> writeback buffer. As part of it, a new buffer metadata field is added to
>> `vkms_writeback_job`, this metadata is represented by the `vkms_composer`
>> struct.
>>
>> This will allow us, in the future, to have different compositing and wb
>> format types.
>>
>> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
>> ---
>> V2: Change the code to get the drm_framebuffer reference and not copy its
>>       contents(Thomas Zimmermann).
>> ---
>>    drivers/gpu/drm/vkms/vkms_composer.c  |  4 ++--
>>    drivers/gpu/drm/vkms/vkms_drv.h       | 12 ++++++------
>>    drivers/gpu/drm/vkms/vkms_plane.c     | 10 +++++-----
>>    drivers/gpu/drm/vkms/vkms_writeback.c | 21 ++++++++++++++++++---
>>    4 files changed, 31 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
>> index 82f79e508f81..383ca657ddf7 100644
>> --- a/drivers/gpu/drm/vkms/vkms_composer.c
>> +++ b/drivers/gpu/drm/vkms/vkms_composer.c
>> @@ -153,7 +153,7 @@ static void compose_plane(struct vkms_composer *primary_composer,
>>    			  struct vkms_composer *plane_composer,
>>    			  void *vaddr_out)
>>    {
>> -	struct drm_framebuffer *fb = &plane_composer->fb;
>> +	struct drm_framebuffer *fb = plane_composer->fb;
>>    	void *vaddr;
>>    	void (*pixel_blend)(const u8 *p_src, u8 *p_dst);
>>    
>> @@ -174,7 +174,7 @@ static int compose_active_planes(void **vaddr_out,
>>    				 struct vkms_composer *primary_composer,
>>    				 struct vkms_crtc_state *crtc_state)
>>    {
>> -	struct drm_framebuffer *fb = &primary_composer->fb;
>> +	struct drm_framebuffer *fb = primary_composer->fb;
>>    	struct drm_gem_object *gem_obj = drm_gem_fb_get_obj(fb, 0);
>>    	const void *vaddr;
>>    	int i;
>> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h
>> index 64e62993b06f..9e4c1e95bbb1 100644
>> --- a/drivers/gpu/drm/vkms/vkms_drv.h
>> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
>> @@ -20,13 +20,8 @@
>>    #define XRES_MAX  8192
>>    #define YRES_MAX  8192
>>    
>> -struct vkms_writeback_job {
>> -	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
>> -	struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
>> -};
>> -
>>    struct vkms_composer {
>> -	struct drm_framebuffer fb;
>> +	struct drm_framebuffer *fb;
>>    	struct drm_rect src, dst;
>>    	struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
>>    	unsigned int offset;
>> @@ -34,6 +29,11 @@ struct vkms_composer {
>>    	unsigned int cpp;
>>    };
>>    
>> +struct vkms_writeback_job {
>> +	struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
>> +	struct vkms_composer composer;
>> +};
>> +
>>    /**
>>     * vkms_plane_state - Driver specific plane state
>>     * @base: base plane state
>> diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c
>> index 32409e15244b..0a28cb7a85e2 100644
>> --- a/drivers/gpu/drm/vkms/vkms_plane.c
>> +++ b/drivers/gpu/drm/vkms/vkms_plane.c
>> @@ -50,12 +50,12 @@ static void vkms_plane_destroy_state(struct drm_plane *plane,
>>    	struct vkms_plane_state *vkms_state = to_vkms_plane_state(old_state);
>>    	struct drm_crtc *crtc = vkms_state->base.base.crtc;
>>    
>> -	if (crtc) {
>> +	if (crtc && vkms_state->composer->fb) {
>>    		/* dropping the reference we acquired in
>>    		 * vkms_primary_plane_update()
>>    		 */
>> -		if (drm_framebuffer_read_refcount(&vkms_state->composer->fb))
>> -			drm_framebuffer_put(&vkms_state->composer->fb);
>> +		if (drm_framebuffer_read_refcount(vkms_state->composer->fb))
>> +			drm_framebuffer_put(vkms_state->composer->fb);
>>    	}
>>    
>>    	kfree(vkms_state->composer);
>> @@ -110,9 +110,9 @@ static void vkms_plane_atomic_update(struct drm_plane *plane,
>>    	composer = vkms_plane_state->composer;
>>    	memcpy(&composer->src, &new_state->src, sizeof(struct drm_rect));
>>    	memcpy(&composer->dst, &new_state->dst, sizeof(struct drm_rect));
>> -	memcpy(&composer->fb, fb, sizeof(struct drm_framebuffer));
>> +	composer->fb = fb;
>>    	memcpy(&composer->map, &shadow_plane_state->data, sizeof(composer->map));
>> -	drm_framebuffer_get(&composer->fb);
>> +	drm_framebuffer_get(composer->fb);
>>    	composer->offset = fb->offsets[0];
>>    	composer->pitch = fb->pitches[0];
>>    	composer->cpp = fb->format->cpp[0];
>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c b/drivers/gpu/drm/vkms/vkms_writeback.c
>> index 8694227f555f..32734cdbf6c2 100644
>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>> @@ -75,12 +75,15 @@ static int vkms_wb_prepare_job(struct drm_writeback_connector *wb_connector,
>>    	if (!vkmsjob)
>>    		return -ENOMEM;
>>    
>> -	ret = drm_gem_fb_vmap(job->fb, vkmsjob->map, vkmsjob->data);
>> +	ret = drm_gem_fb_vmap(job->fb, vkmsjob->composer.map, vkmsjob->data);
>>    	if (ret) {
>>    		DRM_ERROR("vmap failed: %d\n", ret);
>>    		goto err_kfree;
>>    	}
>>    
>> +	vkmsjob->composer.fb = job->fb;
>> +	drm_framebuffer_get(vkmsjob->composer.fb);
>> +
>>    	job->priv = vkmsjob;
>>    
>>    	return 0;
>> @@ -99,7 +102,10 @@ static void vkms_wb_cleanup_job(struct drm_writeback_connector *connector,
>>    	if (!job->fb)
>>    		return;
>>    
>> -	drm_gem_fb_vunmap(job->fb, vkmsjob->map);
>> +	drm_gem_fb_vunmap(job->fb, vkmsjob->composer.map);
>> +
>> +	if (drm_framebuffer_read_refcount(vkmsjob->composer.fb))
>> +		drm_framebuffer_put(vkmsjob->composer.fb);
> 
> Why is this protected by an if conditional?

Here, I followed what was done in the vkms_plane code, just adapting it
to the writeback callbacks.

I put this if because I wasn't 100% sure that for each
`vkms_wb_prepare_job` I would have exactly one `vkms_wb_cleanup_job`
after it.

It happened in my testings, but I can't guarantee that it will always
happen.

> 
> Best regards
> Thomas
> 
>>    
>>    	vkmsdev = drm_device_to_vkms_device(job->fb->dev);
>>    	vkms_set_composer(&vkmsdev->output, false);
>> @@ -116,14 +122,23 @@ static void vkms_wb_atomic_commit(struct drm_connector *conn,
>>    	struct drm_writeback_connector *wb_conn = &output->wb_connector;
>>    	struct drm_connector_state *conn_state = wb_conn->base.state;
>>    	struct vkms_crtc_state *crtc_state = output->composer_state;
>> +	struct drm_framebuffer *fb = connector_state->writeback_job->fb;
>> +	struct vkms_writeback_job *active_wb;
>> +	struct vkms_composer *wb_composer;
>>    
>>    	if (!conn_state)
>>    		return;
>>    
>>    	vkms_set_composer(&vkmsdev->output, true);
>>    
>> +	active_wb = conn_state->writeback_job->priv;
>> +	wb_composer = &active_wb->composer;
>> +
>>    	spin_lock_irq(&output->composer_lock);
>> -	crtc_state->active_writeback = conn_state->writeback_job->priv;
>> +	crtc_state->active_writeback = active_wb;
>> +	wb_composer->offset = fb->offsets[0];
>> +	wb_composer->pitch = fb->pitches[0];
>> +	wb_composer->cpp = fb->format->cpp[0];
>>    	crtc_state->wb_pending = true;
>>    	spin_unlock_irq(&output->composer_lock);
>>    	drm_writeback_queue_job(wb_conn, connector_state);
>>
> 

Thanks,
---
Igor M. A. Torrente

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job`
  2021-11-03 19:18     ` Igor Torrente
@ 2021-11-04  7:21       ` Thomas Zimmermann
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Zimmermann @ 2021-11-04  7:21 UTC (permalink / raw)
  To: Igor Torrente, rodrigosiqueiramelo, melissa.srw, ppaalanen
  Cc: hamohammed.sa, airlied, dri-devel, leandro.ribeiro


[-- Attachment #1.1: Type: text/plain, Size: 8999 bytes --]

Hi

Am 03.11.21 um 20:18 schrieb Igor Torrente:
> Hi Thomas,
> 
> On 11/3/21 12:45 PM, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 26.10.21 um 13:34 schrieb Igor Torrente:
>>> This commit is the groundwork to introduce new formats to the planes and
>>> writeback buffer. As part of it, a new buffer metadata field is added to
>>> `vkms_writeback_job`, this metadata is represented by the 
>>> `vkms_composer`
>>> struct.
>>>
>>> This will allow us, in the future, to have different compositing and wb
>>> format types.
>>>
>>> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
>>> ---
>>> V2: Change the code to get the drm_framebuffer reference and not copy 
>>> its
>>>       contents(Thomas Zimmermann).
>>> ---
>>>    drivers/gpu/drm/vkms/vkms_composer.c  |  4 ++--
>>>    drivers/gpu/drm/vkms/vkms_drv.h       | 12 ++++++------
>>>    drivers/gpu/drm/vkms/vkms_plane.c     | 10 +++++-----
>>>    drivers/gpu/drm/vkms/vkms_writeback.c | 21 ++++++++++++++++++---
>>>    4 files changed, 31 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/vkms/vkms_composer.c 
>>> b/drivers/gpu/drm/vkms/vkms_composer.c
>>> index 82f79e508f81..383ca657ddf7 100644
>>> --- a/drivers/gpu/drm/vkms/vkms_composer.c
>>> +++ b/drivers/gpu/drm/vkms/vkms_composer.c
>>> @@ -153,7 +153,7 @@ static void compose_plane(struct vkms_composer 
>>> *primary_composer,
>>>                  struct vkms_composer *plane_composer,
>>>                  void *vaddr_out)
>>>    {
>>> -    struct drm_framebuffer *fb = &plane_composer->fb;
>>> +    struct drm_framebuffer *fb = plane_composer->fb;
>>>        void *vaddr;
>>>        void (*pixel_blend)(const u8 *p_src, u8 *p_dst);
>>> @@ -174,7 +174,7 @@ static int compose_active_planes(void **vaddr_out,
>>>                     struct vkms_composer *primary_composer,
>>>                     struct vkms_crtc_state *crtc_state)
>>>    {
>>> -    struct drm_framebuffer *fb = &primary_composer->fb;
>>> +    struct drm_framebuffer *fb = primary_composer->fb;
>>>        struct drm_gem_object *gem_obj = drm_gem_fb_get_obj(fb, 0);
>>>        const void *vaddr;
>>>        int i;
>>> diff --git a/drivers/gpu/drm/vkms/vkms_drv.h 
>>> b/drivers/gpu/drm/vkms/vkms_drv.h
>>> index 64e62993b06f..9e4c1e95bbb1 100644
>>> --- a/drivers/gpu/drm/vkms/vkms_drv.h
>>> +++ b/drivers/gpu/drm/vkms/vkms_drv.h
>>> @@ -20,13 +20,8 @@
>>>    #define XRES_MAX  8192
>>>    #define YRES_MAX  8192
>>> -struct vkms_writeback_job {
>>> -    struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
>>> -    struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
>>> -};
>>> -
>>>    struct vkms_composer {
>>> -    struct drm_framebuffer fb;
>>> +    struct drm_framebuffer *fb;
>>>        struct drm_rect src, dst;
>>>        struct dma_buf_map map[DRM_FORMAT_MAX_PLANES];
>>>        unsigned int offset;
>>> @@ -34,6 +29,11 @@ struct vkms_composer {
>>>        unsigned int cpp;
>>>    };
>>> +struct vkms_writeback_job {
>>> +    struct dma_buf_map data[DRM_FORMAT_MAX_PLANES];
>>> +    struct vkms_composer composer;
>>> +};
>>> +
>>>    /**
>>>     * vkms_plane_state - Driver specific plane state
>>>     * @base: base plane state
>>> diff --git a/drivers/gpu/drm/vkms/vkms_plane.c 
>>> b/drivers/gpu/drm/vkms/vkms_plane.c
>>> index 32409e15244b..0a28cb7a85e2 100644
>>> --- a/drivers/gpu/drm/vkms/vkms_plane.c
>>> +++ b/drivers/gpu/drm/vkms/vkms_plane.c
>>> @@ -50,12 +50,12 @@ static void vkms_plane_destroy_state(struct 
>>> drm_plane *plane,
>>>        struct vkms_plane_state *vkms_state = 
>>> to_vkms_plane_state(old_state);
>>>        struct drm_crtc *crtc = vkms_state->base.base.crtc;
>>> -    if (crtc) {
>>> +    if (crtc && vkms_state->composer->fb) {
>>>            /* dropping the reference we acquired in
>>>             * vkms_primary_plane_update()
>>>             */
>>> -        if (drm_framebuffer_read_refcount(&vkms_state->composer->fb))
>>> -            drm_framebuffer_put(&vkms_state->composer->fb);
>>> +        if (drm_framebuffer_read_refcount(vkms_state->composer->fb))
>>> +            drm_framebuffer_put(vkms_state->composer->fb);
>>>        }
>>>        kfree(vkms_state->composer);
>>> @@ -110,9 +110,9 @@ static void vkms_plane_atomic_update(struct 
>>> drm_plane *plane,
>>>        composer = vkms_plane_state->composer;
>>>        memcpy(&composer->src, &new_state->src, sizeof(struct drm_rect));
>>>        memcpy(&composer->dst, &new_state->dst, sizeof(struct drm_rect));
>>> -    memcpy(&composer->fb, fb, sizeof(struct drm_framebuffer));
>>> +    composer->fb = fb;
>>>        memcpy(&composer->map, &shadow_plane_state->data, 
>>> sizeof(composer->map));
>>> -    drm_framebuffer_get(&composer->fb);
>>> +    drm_framebuffer_get(composer->fb);
>>>        composer->offset = fb->offsets[0];
>>>        composer->pitch = fb->pitches[0];
>>>        composer->cpp = fb->format->cpp[0];
>>> diff --git a/drivers/gpu/drm/vkms/vkms_writeback.c 
>>> b/drivers/gpu/drm/vkms/vkms_writeback.c
>>> index 8694227f555f..32734cdbf6c2 100644
>>> --- a/drivers/gpu/drm/vkms/vkms_writeback.c
>>> +++ b/drivers/gpu/drm/vkms/vkms_writeback.c
>>> @@ -75,12 +75,15 @@ static int vkms_wb_prepare_job(struct 
>>> drm_writeback_connector *wb_connector,
>>>        if (!vkmsjob)
>>>            return -ENOMEM;
>>> -    ret = drm_gem_fb_vmap(job->fb, vkmsjob->map, vkmsjob->data);
>>> +    ret = drm_gem_fb_vmap(job->fb, vkmsjob->composer.map, 
>>> vkmsjob->data);
>>>        if (ret) {
>>>            DRM_ERROR("vmap failed: %d\n", ret);
>>>            goto err_kfree;
>>>        }
>>> +    vkmsjob->composer.fb = job->fb;
>>> +    drm_framebuffer_get(vkmsjob->composer.fb);
>>> +
>>>        job->priv = vkmsjob;
>>>        return 0;
>>> @@ -99,7 +102,10 @@ static void vkms_wb_cleanup_job(struct 
>>> drm_writeback_connector *connector,
>>>        if (!job->fb)
>>>            return;
>>> -    drm_gem_fb_vunmap(job->fb, vkmsjob->map);
>>> +    drm_gem_fb_vunmap(job->fb, vkmsjob->composer.map);
>>> +
>>> +    if (drm_framebuffer_read_refcount(vkmsjob->composer.fb))
>>> +        drm_framebuffer_put(vkmsjob->composer.fb);
>>
>> Why is this protected by an if conditional?
> 
> Here, I followed what was done in the vkms_plane code, just adapting it
> to the writeback callbacks.
> 
> I put this if because I wasn't 100% sure that for each
> `vkms_wb_prepare_job` I would have exactly one `vkms_wb_cleanup_job`
> after it.
> 
> It happened in my testings, but I can't guarantee that it will always
> happen.

It would be strange it it wasn't like that. Sounds like a bug to me. The 
docs say that the cleanup is called for commited and aborted changes. 
TBH, I'd leave out the condition.

Maybe a vkms maintainer can comment on this?

Best regards
Thomas

> 
>>
>> Best regards
>> Thomas
>>
>>>        vkmsdev = drm_device_to_vkms_device(job->fb->dev);
>>>        vkms_set_composer(&vkmsdev->output, false);
>>> @@ -116,14 +122,23 @@ static void vkms_wb_atomic_commit(struct 
>>> drm_connector *conn,
>>>        struct drm_writeback_connector *wb_conn = &output->wb_connector;
>>>        struct drm_connector_state *conn_state = wb_conn->base.state;
>>>        struct vkms_crtc_state *crtc_state = output->composer_state;
>>> +    struct drm_framebuffer *fb = connector_state->writeback_job->fb;
>>> +    struct vkms_writeback_job *active_wb;
>>> +    struct vkms_composer *wb_composer;
>>>        if (!conn_state)
>>>            return;
>>>        vkms_set_composer(&vkmsdev->output, true);
>>> +    active_wb = conn_state->writeback_job->priv;
>>> +    wb_composer = &active_wb->composer;
>>> +
>>>        spin_lock_irq(&output->composer_lock);
>>> -    crtc_state->active_writeback = conn_state->writeback_job->priv;
>>> +    crtc_state->active_writeback = active_wb;
>>> +    wb_composer->offset = fb->offsets[0];
>>> +    wb_composer->pitch = fb->pitches[0];
>>> +    wb_composer->cpp = fb->format->cpp[0];
>>>        crtc_state->wb_pending = true;
>>>        spin_unlock_irq(&output->composer_lock);
>>>        drm_writeback_queue_job(wb_conn, connector_state);
>>>
>>
> 
> Thanks,
> ---
> Igor M. A. Torrente

-- 
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/8] Add new formats support to vkms
  2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
                   ` (8 preceding siblings ...)
  2021-10-26 11:34 ` [PATCH v2 8/8] drm: vkms: Add support to " Igor Torrente
@ 2021-11-09  9:32 ` Pekka Paalanen
  2021-11-10 17:32   ` Igor Torrente
  9 siblings, 1 reply; 28+ messages in thread
From: Pekka Paalanen @ 2021-11-09  9:32 UTC (permalink / raw)
  To: Igor Torrente
  Cc: hamohammed.sa, tzimmermann, rodrigosiqueiramelo, airlied,
	leandro.ribeiro, melissa.srw, dri-devel

[-- Attachment #1: Type: text/plain, Size: 3544 bytes --]

On Tue, 26 Oct 2021 08:34:00 -0300
Igor Torrente <igormtorrente@gmail.com> wrote:

> Summary
> =======
> This series of patches refactor some vkms components in order to introduce
> new formats to the planes and writeback connector.
> 
> Now in the blend function, the plane's pixels are converted to ARGB16161616
> and then blended together.
> 
> The CRC is calculated based on the ARGB1616161616 buffer. And if required,
> this buffer is copied/converted to the writeback buffer format.
> 
> And to handle the pixel conversion, new functions were added to convert
> from a specific format to ARGB16161616 (the reciprocal is also true).
> 
> Tests
> =====
> This patch series was tested using the following igt tests:
> -t ".*kms_plane.*"
> -t ".*kms_writeback.*"
> -t ".*kms_cursor_crc*"
> -t ".*kms_flip.*"
> 
> New tests passing
> -------------------
> - pipe-A-cursor-size-change
> - pipe-A-cursor-alpha-transparent
> 
> Performance
> -----------
> Following some optimization proposed by Pekka Paalanen, now the code
> runs way faster than V1 and slightly faster than the current implementation.
> 
> |                          Frametime                          |
> |:---------------:|:---------:|:--------------:|:------------:|
> |  implmentation  |  Current  |  Per-pixel(V1) | Per-line(V2) |
> | frametime range |  8~22 ms  |    32~56 ms    |    6~19 ms   |
> |     Average     |  10.0 ms  |     35.8 ms    |    8.6 ms    |

Wow, that's much better than I expected.

What is your benchmark? That is, what program do you use and what
operations does it trigger to produce these measurements? What are the
sizes of all the planes/buffers involved? What kind of CPU was this ran
on?


Thanks,
pq

> 
> Writeback test
> --------------
> During the development of this patch series, I discovered that the
> writeback-check-output test wasn't filling the plane correctly.
> 
> So, currently, this patch series is failing in this test. But I sent a
> patch to igt to fix it[1].
> 
> XRGB to ARGB behavior
> =====================
> During the development, I decided to always fill the alpha channel of
> the output pixel whenever the conversion from a format without an alpha
> channel to ARGB16161616 is necessary. Therefore, I ignore the value
> received from the XRGB and overwrite the value with 0xFFFF.
> 
> ---
> Igor Torrente (8):
>   drm: vkms: Replace the deprecated drm_mode_config_init
>   drm: vkms: Alloc the compose frame using vzalloc
>   drm: vkms: Replace hardcoded value of `vkms_composer.map` to
>     DRM_FORMAT_MAX_PLANES
>   drm: vkms: Add fb information to `vkms_writeback_job`
>   drm: drm_atomic_helper: Add a new helper to deal with the writeback
>     connector validation
>   drm: vkms: Refactor the plane composer to accept new formats
>   drm: vkms: Exposes ARGB_1616161616 and adds XRGB_16161616 formats
>   drm: vkms: Add support to the RGB565 format
> 
>  drivers/gpu/drm/drm_atomic_helper.c   |  47 ++++
>  drivers/gpu/drm/vkms/vkms_composer.c  | 329 +++++++++++++++-----------
>  drivers/gpu/drm/vkms/vkms_drv.c       |   6 +-
>  drivers/gpu/drm/vkms/vkms_drv.h       |  14 +-
>  drivers/gpu/drm/vkms/vkms_formats.h   | 252 ++++++++++++++++++++
>  drivers/gpu/drm/vkms/vkms_plane.c     |  17 +-
>  drivers/gpu/drm/vkms/vkms_writeback.c |  33 ++-
>  include/drm/drm_atomic_helper.h       |   3 +
>  8 files changed, 545 insertions(+), 156 deletions(-)
>  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats
  2021-10-26 11:34 ` [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats Igor Torrente
@ 2021-11-09 11:40   ` Pekka Paalanen
  2021-11-10 16:56     ` Igor Torrente
  0 siblings, 1 reply; 28+ messages in thread
From: Pekka Paalanen @ 2021-11-09 11:40 UTC (permalink / raw)
  To: Igor Torrente
  Cc: hamohammed.sa, tzimmermann, rodrigosiqueiramelo, airlied,
	leandro.ribeiro, melissa.srw, dri-devel, kernel test robot

[-- Attachment #1: Type: text/plain, Size: 27111 bytes --]

Hi Igor,

again, that is a really nice speed-up. Unfortunately, I find the code
rather messy and hard to follow. I hope my comments below help with
re-designing it to be easier to understand.


On Tue, 26 Oct 2021 08:34:06 -0300
Igor Torrente <igormtorrente@gmail.com> wrote:

> Currently the blend function only accepts XRGB_8888 and ARGB_8888
> as a color input.
> 
> This patch refactors all the functions related to the plane composition
> to overcome this limitation.
> 
> Now the blend function receives a struct `vkms_pixel_composition_functions`
> containing two handlers.
> 
> One will generate a buffer of each line of the frame with the pixels
> converted to ARGB16161616. And the other will take this line buffer,
> do some computation on it, and store the pixels in the destination.
> 
> Both the handlers have the same signature. They receive a pointer to
> the pixels that will be processed(`pixels_addr`), the number of pixels
> that will be treated(`length`), and the intermediate buffer of the size
> of a frame line (`line_buffer`).
> 
> The first function has been totally described previously.

What does this sentence mean?

> 
> The second is more interesting, as it has to perform two roles depending
> on where it is called in the code.
> 
> The first is to convert(if necessary) the data received in the
> `line_buffer` and write in the memory pointed by `pixels_addr`.
> 
> The second role is to perform the `alpha_blend`. So, it takes the pixels
> in the `line_buffer` and `pixels_addr`, executes the blend, and stores
> the result back to the `pixels_addr`.
> 
> The per-line implementation was chosen for performance reasons.
> The per-pixel functions were having performance issues due to indirect
> function call overhead.
> 
> The per-line code trades off memory for execution time. The `line_buffer`
> allows us to diminish the number of function calls.
> 
> Results in the IGT test `kms_cursor_crc`:
> 
> |                     Frametime                       |
> |:---------------:|:---------:|:----------:|:--------:|
> |  implmentation  |  Current  |  Per-pixel | Per-line |
> | frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
> |     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> ---
> V2: Improves the performance drastically, by perfoming the operations
>     per-line and not per-pixel(Pekka Paalanen).
>     Minor improvements(Pekka Paalanen).
> ---
>  drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
>  drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
>  2 files changed, 342 insertions(+), 134 deletions(-)
>  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> 
> diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> index 383ca657ddf7..69fe3a89bdc9 100644
> --- a/drivers/gpu/drm/vkms/vkms_composer.c
> +++ b/drivers/gpu/drm/vkms/vkms_composer.c
> @@ -9,18 +9,26 @@
>  #include <drm/drm_vblank.h>
>  
>  #include "vkms_drv.h"
> -
> -static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
> -				 const struct vkms_composer *composer)
> -{
> -	u32 pixel;
> -	int src_offset = composer->offset + (y * composer->pitch)
> -				      + (x * composer->cpp);
> -
> -	pixel = *(u32 *)&buffer[src_offset];
> -
> -	return pixel;
> -}
> +#include "vkms_formats.h"
> +
> +#define get_output_vkms_composer(buffer_pointer, composer)		\
> +	((struct vkms_composer) {					\
> +		.fb = &(struct drm_framebuffer) {			\
> +			.format = &(struct drm_format_info) {		\
> +				.format = DRM_FORMAT_ARGB16161616,	\
> +			},						\

Is that really how one can initialize a drm_format_info? Does that
struct not have a lot more fields? Shouldn't you call a function to
look up the proper struct with all fields populated?

> +		},							\
> +		.map[0].vaddr = (buffer_pointer),			\
> +		.src = (composer)->src,					\
> +		.dst = (composer)->dst,					\
> +		.cpp = sizeof(u64),					\
> +		.pitch = drm_rect_width(&(composer)->dst) * sizeof(u64)	\
> +	})

Why is this a macro rather than a function?

> +
> +struct vkms_pixel_composition_functions {
> +	void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
> +	void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);

I would be a little more comfortable if instead of u64 *line_buffer you
would have something like

struct line_buffer {
	u16 *row;
	size_t nelem;
}

so that the functions to be plugged into these function pointers could
assert that you do not accidentally overflow the array (which would
imply a code bug in kernel).

One could perhaps go even for:

struct line_pixel {
	u16 r, g, b, a;
};

struct line_buffer {
	struct line_pixel *row;
	size_t npixels;
};

Because as I mention further down, there is no need for the line buffer
to use an existing DRM pixel format at all.

> +};
>  
>  /**
>   * compute_crc - Compute CRC value on output frame
> @@ -31,179 +39,222 @@ static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
>   * returns CRC value computed using crc32 on the visible portion of
>   * the final framebuffer at vaddr_out
>   */
> -static uint32_t compute_crc(const u8 *vaddr,
> +static uint32_t compute_crc(const __le64 *vaddr,
>  			    const struct vkms_composer *composer)
>  {
> -	int x, y;
> -	u32 crc = 0, pixel = 0;
> -	int x_src = composer->src.x1 >> 16;
> -	int y_src = composer->src.y1 >> 16;
> -	int h_src = drm_rect_height(&composer->src) >> 16;
> -	int w_src = drm_rect_width(&composer->src) >> 16;
> -
> -	for (y = y_src; y < y_src + h_src; ++y) {
> -		for (x = x_src; x < x_src + w_src; ++x) {
> -			pixel = get_pixel_from_buffer(x, y, vaddr, composer);
> -			crc = crc32_le(crc, (void *)&pixel, sizeof(u32));
> -		}
> -	}
> +	int h = drm_rect_height(&composer->dst);
> +	int w = drm_rect_width(&composer->dst);
>  
> -	return crc;
> +	return crc32_le(0, (void *)vaddr, w * h * sizeof(u64));
>  }
>  
> -static u8 blend_channel(u8 src, u8 dst, u8 alpha)
> +static __le16 blend_channel(u16 src, u16 dst, u16 alpha)

This function is doing the OVER operation (Porter-Duff classification)
assuming pre-multiplied alpha. I think the function name should reflect
that. At the very least it should somehow note pre-multiplied alpha,
because KMS property "pixel blend mode" can change that.

'alpha' should be named 'src_alpha'.

>  {
> -	u32 pre_blend;
> -	u8 new_color;
> +	u64 pre_blend;

I'm not quite sure if u32 would suffice... max value for src is
0xffff * src_alpha / 0xffff = src_alpha. Max value for dst is 0xffff.

So we have at max

src_alpha * 0xffff + 0xffff * (0xffff - src_alpha)

Each multiplication independently will fit in u32.

Rearranging we get

src_alpha * 0xffff + 0xffff * 0xffff - 0xffff * src_alpha

which equals

0xffff * 0xffff

which fits in u32 and does not depend on src_alpha.

So unless I made a mistake, looks like u32 should be enough. On 32-bit
CPUs it should have speed benefits compared to u64.

> +	u16 new_color;
>  
> -	pre_blend = (src * 255 + dst * (255 - alpha));
> +	pre_blend = (src * 0xffff + dst * (0xffff - alpha));

'pre_blend' means "before blending" so maybe a better name here as the
blending is already done.

>  
> -	/* Faster div by 255 */
> -	new_color = ((pre_blend + ((pre_blend + 257) >> 8)) >> 8);
> +	new_color = DIV_ROUND_UP(pre_blend, 0xffff);
>  
> -	return new_color;
> +	return cpu_to_le16(new_color);

What's the thing with cpu_to_le16 here?

I think the temporary line buffers could just be using the cpu-native
u16 type. There is no DRM format code for that, but we don't need one
either. This format is not for interoperation with anything else, it's
just internal here, and the main goals with it are precision and speed.

As such, the temporary line buffers could be simply u16 arrays, so you
don't need to consider the channel packing into a u64.

>  }
>  



From here on, I will be removing the diff minus lines from the quoted
code, because these functions are completely new.

>  /**
>   * alpha_blend - alpha blending equation

This is specifically the pre-multiplied alpha blending, so reflect that
in the function name.

> + * @src_composer: source framebuffer's metadata
> + * @dst_composer: destination framebuffer's metadata
> + * @y: The y coodinate(heigth) of the line that will be processed
> + * @line_buffer: The line with the pixels from src_compositor
>   *
>   * blend pixels using premultiplied blend formula. The current DRM assumption
>   * is that pixel color values have been already pre-multiplied with the alpha
>   * channel values. See more drm_plane_create_blend_mode_property(). Also, this
>   * formula assumes a completely opaque background.
> + *
> + * For performance reasons this function also fetches the pixels from the
> + * destination of the frame line y.
> + * We use the information that one of the source pixels are in the output
> + * buffer to fetch it here instead of separate function. And because the
> + * output format is ARGB16161616, we know that they don't need to be
> + * converted.
> + * This save us a indirect function call for each line.

I think this paragraph should be obvious from the type of 'line_buffer'
parameter and that you are blending src into dst.

>   */
> +static void alpha_blend(void *pixels_addr, int length, u64 *line_buffer)
>  {
> +	__le16 *output_pixel = pixels_addr;

Aren't you supposed to be writing into line_buffer, not into src?

There is something very strange with the logic here.

In fact, the function signature of the blending function is unexpected.
A blending function should operate on two line_buffers, not what looks
like arbitrary buffer pixels.

I think you should forget the old code and design these from scratch.
You would have three different kinds of functions:

- loading: fetch a row from an image and convert into a line buffer
- blending: take two line buffers and blend them into one of the line
  buffers
- storing: convert a line buffer and write it into an image row

I would not coerce these three different operations into less than
three function pointer types.

To actually run a blending operation between source and destination
images, you would need four function pointers:
- loader for source (by pixel format)
- loader for destination (by pixel format)
- blender (by chosen blending operation)
- storing for destination (by pixel format)

Function parameter types should make it obvious whether something is an
image or row in arbitrary format, or a line buffer in the special
internal format.

Then the algorithm would work roughly like this:

for each plane:
	for each row:
		load source into lb1
		load destination into lb2
		blend lb1 into lb2
		store lb2 into destination

This is not optimal, you see how destination is repeatedly loaded and
stored for each plane. So you could swap the loops:

allocate lb1, lb2 with destination width
for each destination row:
	load destination into lb2

	for each plane:
		load source into lb1
		blend lb1 into lb2

	store lb2 into destination

Inside the loop over plane, you need to check if the plane overlaps the
current destination row at all. If not, continue on the next plane. If
yes, load source into lb1 and compute the offset into lb2 where it
needs to be blended.

Since we don't support scaling yet, lb1 length will never exceed
destination width, because there is no need to load plane buffer pixels
we would not be writing out.

Also "load destination into lb2" could be replaced with just "clear
lb2" is the old destination contents are to be discarded. Then you also
don't need the function pointer for "loader for destination".

I think you already had all these ideas, just the execution in code got
really messy somehow.

> +	int i;
>  
> +	for (i = 0; i < length; i++) {
> +		u16 src1_a = line_buffer[i] >> 48;
> +		u16 src1_r = (line_buffer[i] >> 32) & 0xffff;
> +		u16 src1_g = (line_buffer[i] >> 16) & 0xffff;
> +		u16 src1_b = line_buffer[i] & 0xffff;

If you used native u16 array for line buffers, all this arithmetic
would be unnecessary.

>  
> +		u16 src2_r = le16_to_cpu(output_pixel[2]);
> +		u16 src2_g = le16_to_cpu(output_pixel[1]);
> +		u16 src2_b = le16_to_cpu(output_pixel[0]);
> +
> +		output_pixel[0] = blend_channel(src1_b, src2_b, src1_a);
> +		output_pixel[1] = blend_channel(src1_g, src2_g, src1_a);
> +		output_pixel[2] = blend_channel(src1_r, src2_r, src1_a);
> +		output_pixel[3] = 0xffff;
> +
> +		output_pixel += 4;
> +	}
>  }
>  
>  /**
>   * @src_composer: source framebuffer's metadata
> + * @dst_composer: destiny framebuffer's metadata
> + * @funcs: A struct containing all the composition functions(get_src_line,
> + *         and set_output_pixel)
> + * @line_buffer: The line with the pixels from src_compositor
>   *
> + * Using the pixel_blend function passed as parameter, this function blends
> + * all pixels from src plane into a output buffer (with a blend function
> + * passed as parameter).
> + * Information of the output buffer is in the dst_composer parameter
> + * and the source plane in the src_composer.
> + * The get_src_line will use the src_composer to get the respective line,
> + * convert, and return it as ARGB_16161616.
> + * And finally, the blend function will receive the dst_composer, dst_composer,
> + * the line y coodinate, and the line buffer. Blend all pixels, and store the
> + * result in the output.
>   *
>   * TODO: completely clear the primary plane (a = 0xff) before starting to blend
>   * pixel color values
>   */
> +static void blend(struct vkms_composer *src_composer,
>  		  struct vkms_composer *dst_composer,
> +		  struct vkms_pixel_composition_functions *funcs,
> +		  u64 *line_buffer)
>  {
> +	int i, i_dst;
>  
>  	int x_src = src_composer->src.x1 >> 16;
>  	int y_src = src_composer->src.y1 >> 16;
>  
>  	int x_dst = src_composer->dst.x1;
>  	int y_dst = src_composer->dst.y1;
> +
>  	int h_dst = drm_rect_height(&src_composer->dst);
> +	int length = drm_rect_width(&src_composer->dst);
>  
>  	int y_limit = y_src + h_dst;
> +
> +	u8 *src_pixels = packed_pixels_addr(src_composer, x_src, y_src);
> +	u8 *dst_pixels = packed_pixels_addr(dst_composer, x_dst, y_dst);
> +
> +	int src_next_line_offset = src_composer->pitch;
> +	int dst_next_line_offset = dst_composer->pitch;
> +
> +	for (i = y_src, i_dst = y_dst; i < y_limit; ++i, i_dst++) {
> +		funcs->get_src_line(src_pixels, length, line_buffer);
> +		funcs->set_output_line(dst_pixels, length, line_buffer);
> +		src_pixels += src_next_line_offset;
> +		dst_pixels += dst_next_line_offset;
>  	}
>  }
>  
> +static void ((*get_line_fmt_transform_function(u32 format))
> +	    (void *pixels_addr, int length, u64 *line_buffer))
>  {
> +	if (format == DRM_FORMAT_ARGB8888)
> +		return &ARGB8888_to_ARGB16161616;
> +	else if (format == DRM_FORMAT_ARGB16161616)
> +		return &get_ARGB16161616;
> +	else
> +		return &XRGB8888_to_ARGB16161616;
> +}
>  
> +static void ((*get_output_line_function(u32 format))
> +	     (void *pixels_addr, int length, u64 *line_buffer))
> +{
> +	if (format == DRM_FORMAT_ARGB8888)
> +		return &convert_to_ARGB8888;
> +	else if (format == DRM_FORMAT_ARGB16161616)
> +		return &convert_to_ARGB16161616;
> +	else
> +		return &convert_to_XRGB8888;
> +}
>  
> +static void compose_plane(struct vkms_composer *src_composer,
> +			  struct vkms_composer *dst_composer,

I'm confused by the vkms_composer concept. If there is a separate thing
for source and destination and they are used together, then I don't
think that thing is a "composer" but some kind of... image structure?
"Composer" is what compose_active_planes() does.

> +			  struct vkms_pixel_composition_functions *funcs,
> +			  u64 *line_buffer)
> +{
> +	u32 src_format = src_composer->fb->format->format;
>  
> +	funcs->get_src_line = get_line_fmt_transform_function(src_format);
>  
> +	blend(src_composer, dst_composer, funcs, line_buffer);

This function is confusing. You get 'funcs' as argument, but you
overwrite one field and then trust that the other field was already set
by the caller. The policy of how 'funcs' argument here works is too
complicated to me.

If you need just one function pointer as argument, then do exactly
that, and construct the vfunc struct inside this function.

>  }
>  
> +static __le64 *struct vkms_composer *primary_composer,
> +				     struct vkms_crtc_state *crtc_state,
> +				     u64 *line_buffer)
>  {
> +	struct vkms_plane_state **active_planes = crtc_state->active_planes;
> +	int h = drm_rect_height(&primary_composer->dst);
> +	int w = drm_rect_width(&primary_composer->dst);
> +	struct vkms_pixel_composition_functions funcs;
> +	struct vkms_composer dst_composer;
> +	__le64 *vaddr_out;
>  	int i;
>  
>  	if (WARN_ON(dma_buf_map_is_null(&primary_composer->map[0])))
> +		return NULL;
>  
> +	vaddr_out = kvzalloc(w * h * sizeof(__le64), GFP_KERNEL);

Why allocate a full size image here in the compositing function?

You should be able to do with just few line buffers instead.

> +	if (!vaddr_out) {
> +		DRM_ERROR("Cannot allocate memory for output frame.");
> +		return NULL;
> +	}
>  
> +	dst_composer = get_output_vkms_composer(vaddr_out, primary_composer);
> +	funcs.set_output_line = get_output_line_function(DRM_FORMAT_ARGB16161616);
> +	compose_plane(active_planes[0]->composer, &dst_composer,
> +		      &funcs, line_buffer);
>  
>  	/* If there are other planes besides primary, we consider the active
>  	 * planes should be in z-order and compose them associatively:
>  	 * ((primary <- overlay) <- cursor)
>  	 */
> +	funcs.set_output_line = alpha_blend;
>  	for (i = 1; i < crtc_state->num_active_planes; i++)
> +		compose_plane(active_planes[i]->composer, &dst_composer,
> +			      &funcs, line_buffer);
>  
> +	return vaddr_out;
> +}
> +
> +static void write_wb_buffer(struct vkms_writeback_job *active_wb,
> +			    struct vkms_composer *primary_composer,
> +			    __le64 *vaddr_out, u64 *line_buffer)
> +{
> +	u32 dst_fb_format = active_wb->composer.fb->format->format;
> +	struct vkms_pixel_composition_functions funcs;
> +	struct vkms_composer src_composer;
> +
> +	src_composer = get_output_vkms_composer(vaddr_out, primary_composer);
> +	funcs.set_output_line = get_output_line_function(dst_fb_format);
> +	active_wb->composer.src = primary_composer->src;
> +	active_wb->composer.dst = primary_composer->dst;
> +
> +	compose_plane(&src_composer, &active_wb->composer, &funcs, line_buffer);
> +}
> +
> +u64 *alloc_line_buffer(struct vkms_composer *primary_composer)
> +{
> +	int line_width = drm_rect_width(&primary_composer->dst);
> +	u64 *line_buffer;
> +
> +	line_buffer = kvmalloc(line_width * sizeof(u64), GFP_KERNEL);
> +	if (!line_buffer)
> +		DRM_ERROR("Cannot allocate memory for intermediate line buffer");
> +
> +	return line_buffer;
>  }
>  
>  /**
> @@ -221,14 +272,14 @@ void vkms_composer_worker(struct work_struct *work)
>  						struct vkms_crtc_state,
>  						composer_work);
>  	struct drm_crtc *crtc = crtc_state->base.crtc;
> +	struct vkms_writeback_job *active_wb = crtc_state->active_writeback;
>  	struct vkms_output *out = drm_crtc_to_vkms_output(crtc);
>  	struct vkms_composer *primary_composer = NULL;
>  	struct vkms_plane_state *act_plane = NULL;
> +	u64 frame_start, frame_end, *line_buffer;
>  	bool crc_pending, wb_pending;
> -	void *vaddr_out = NULL;
> +	__le64 *vaddr_out = NULL;
>  	u32 crc32 = 0;
> -	u64 frame_start, frame_end;
> -	int ret;
>  
>  	spin_lock_irq(&out->composer_lock);
>  	frame_start = crtc_state->frame_start;
> @@ -256,28 +307,30 @@ void vkms_composer_worker(struct work_struct *work)
>  	if (!primary_composer)
>  		return;
>  
> -	if (wb_pending)
> -		vaddr_out = crtc_state->active_writeback->data[0].vaddr;
> +	line_buffer = alloc_line_buffer(primary_composer);
> +	if (!line_buffer)
> +		return;
>  
> -	ret = compose_active_planes(&vaddr_out, primary_composer,
> -				    crtc_state);
> -	if (ret) {
> -		if (ret == -EINVAL && !wb_pending)
> -			kvfree(vaddr_out);
> +	vaddr_out = compose_active_planes(primary_composer, crtc_state,
> +					  line_buffer);
> +	if (!vaddr_out) {
> +		kvfree(line_buffer);
>  		return;
>  	}
>  
> -	crc32 = compute_crc(vaddr_out, primary_composer);
> -
>  	if (wb_pending) {
> +		write_wb_buffer(active_wb, primary_composer,
> +				vaddr_out, line_buffer);
>  		drm_writeback_signal_completion(&out->wb_connector, 0);
>  		spin_lock_irq(&out->composer_lock);
>  		crtc_state->wb_pending = false;
>  		spin_unlock_irq(&out->composer_lock);
> -	} else {
> -		kvfree(vaddr_out);
>  	}
>  
> +	kvfree(line_buffer);
> +	crc32 = compute_crc(vaddr_out, primary_composer);
> +	kvfree(vaddr_out);
> +
>  	/*
>  	 * The worker can fall behind the vblank hrtimer, make sure we catch up.
>  	 */
> diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
> new file mode 100644
> index 000000000000..5b850fce69f3
> --- /dev/null
> +++ b/drivers/gpu/drm/vkms/vkms_formats.h
> @@ -0,0 +1,155 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef _VKMS_FORMATS_H_
> +#define _VKMS_FORMATS_H_
> +
> +#include <drm/drm_rect.h>
> +
> +#define pixel_offset(composer, x, y) \
> +	((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))

Why macro instead of a static inline function?

> +
> +/*
> + * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
> + *
> + * @composer: Buffer metadata
> + * @x: The x(width) coordinate of the 2D buffer
> + * @y: The y(Heigth) coordinate of the 2D buffer
> + *
> + * Takes the information stored in the composer, a pair of coordinates, and
> + * returns the address of the first color channel.
> + * This function assumes the channels are packed together, i.e. a color channel
> + * comes immediately after another. And therefore, this function doesn't work
> + * for YUV with chroma subsampling (e.g. YUV420 and NV21).
> + */
> +static void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)

Is it normal in the kernel to have non-inline functions in headers?

Actually this file does not look like a header at all, it should
probably be a .c file and not #included.

> +{
> +	int offset = pixel_offset(composer, x, y);
> +
> +	return (u8 *)composer->map[0].vaddr + offset;
> +}
> +
> +static void ARGB8888_to_ARGB16161616(void *pixels_addr, int length,
> +				     u64 *line_buffer)
> +{
> +	u8 *src_pixels = pixels_addr;
> +	int i;
> +
> +	for (i = 0; i < length; i++) {
> +		/*
> +		 * Organizes the channels in their respective positions and converts
> +		 * the 8 bits channel to 16.
> +		 * The 257 is the "conversion ratio". This number is obtained by the
> +		 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
> +		 * the best color value in a pixel format with more possibilities.
> +		 * And a similar idea applies to others RGB color conversions.
> +		 */
> +		line_buffer[i] = ((u64)src_pixels[3] * 257) << 48 |
> +				 ((u64)src_pixels[2] * 257) << 32 |
> +				 ((u64)src_pixels[1] * 257) << 16 |
> +				 ((u64)src_pixels[0] * 257);
> +
> +		src_pixels += 4;
> +	}
> +}
> +
> +static void XRGB8888_to_ARGB16161616(void *pixels_addr, int length,
> +				     u64 *line_buffer)
> +{
> +	u8 *src_pixels = pixels_addr;
> +	int i;
> +
> +	for (i = 0; i < length; i++) {
> +		/*
> +		 * The same as the ARGB8888 but with the alpha channel as the
> +		 * maximum value as possible.
> +		 */
> +		line_buffer[i] = 0xffffllu << 48 |
> +				 ((u64)src_pixels[2] * 257) << 32 |
> +				 ((u64)src_pixels[1] * 257) << 16 |
> +				 ((u64)src_pixels[0] * 257);
> +
> +		src_pixels += 4;
> +	}
> +}
> +
> +static void get_ARGB16161616(void *pixels_addr, int length, u64 *line_buffer)
> +{
> +	__le64 *src_pixels = pixels_addr;
> +	int i;
> +
> +	for (i = 0; i < length; i++) {
> +		/*
> +		 * Because the format byte order is in little-endian and this code
> +		 * needs to run on big-endian machines too, we need modify
> +		 * the byte order from little-endian to the CPU native byte order.
> +		 */
> +		line_buffer[i] = le64_to_cpu(*src_pixels);
> +
> +		src_pixels++;
> +	}
> +}
> +
> +/*
> + * The following functions are used as blend operations. But unlike the
> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
> + * source, convert it to a specific format, and store it in the destination.

This is a surprising trick I don't like. Blending operation and storing
operation are fundamentally different. Once you have more obvious
function signatures, this trick is not possible anymore.


Thanks,
pq

> + *
> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
> + * copy and convert one line of the frame from/to the output buffer to/from
> + * another buffer (e.g. writeback buffer, primary plane buffer).
> + */
> +
> +static void convert_to_ARGB8888(void *pixels_addr, int length, u64 *line_buffer)
> +{
> +	u8 *dst_pixels = pixels_addr;
> +	int i;
> +
> +	for (i = 0; i < length; i++) {
> +		/*
> +		 * This sequence below is important because the format's byte order is
> +		 * in little-endian. In the case of the ARGB8888 the memory is
> +		 * organized this way:
> +		 *
> +		 * | Addr     | = blue channel
> +		 * | Addr + 1 | = green channel
> +		 * | Addr + 2 | = Red channel
> +		 * | Addr + 3 | = Alpha channel
> +		 */
> +		dst_pixels[0] = DIV_ROUND_UP(line_buffer[i] & 0xffff, 257);
> +		dst_pixels[1] = DIV_ROUND_UP((line_buffer[i] >> 16) & 0xffff, 257);
> +		dst_pixels[2] = DIV_ROUND_UP((line_buffer[i] >> 32) & 0xffff, 257);
> +		dst_pixels[3] = DIV_ROUND_UP(line_buffer[i] >> 48, 257);
> +
> +		dst_pixels += 4;
> +	}
> +}
> +
> +static void convert_to_XRGB8888(void *pixels_addr, int length, u64 *line_buffer)
> +{
> +	u8 *dst_pixels = pixels_addr;
> +	int i;
> +
> +	for (i = 0; i < length; i++) {
> +		dst_pixels[0] = DIV_ROUND_UP(line_buffer[i] & 0xffff, 257);
> +		dst_pixels[1] = DIV_ROUND_UP((line_buffer[i] >> 16) & 0xffff, 257);
> +		dst_pixels[2] = DIV_ROUND_UP((line_buffer[i] >> 32) & 0xffff, 257);
> +		dst_pixels[3] = 0xff;
> +
> +		dst_pixels += 4;
> +	}
> +}
> +
> +static void convert_to_ARGB16161616(void *pixels_addr, int length,
> +				    u64 *line_buffer)
> +{
> +	__le64 *dst_pixels = pixels_addr;
> +	int i;
> +
> +	for (i = 0; i < length; i++) {
> +
> +		*dst_pixels = cpu_to_le64(line_buffer[i]);
> +		dst_pixels++;
> +	}
> +}
> +
> +#endif /* _VKMS_FORMATS_H_ */


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats
  2021-11-09 11:40   ` Pekka Paalanen
@ 2021-11-10 16:56     ` Igor Torrente
  2021-11-11  9:33       ` Pekka Paalanen
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-11-10 16:56 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: hamohammed.sa, Thomas Zimmermann, rodrigosiqueiramelo, airlied,
	Leandro Ribeiro, melissa.srw, dri-devel, kernel test robot

On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> Hi Igor,
>
> again, that is a really nice speed-up. Unfortunately, I find the code
> rather messy and hard to follow. I hope my comments below help with
> re-designing it to be easier to understand.
>
>
> On Tue, 26 Oct 2021 08:34:06 -0300
> Igor Torrente <igormtorrente@gmail.com> wrote:
>
> > Currently the blend function only accepts XRGB_8888 and ARGB_8888
> > as a color input.
> >
> > This patch refactors all the functions related to the plane composition
> > to overcome this limitation.
> >
> > Now the blend function receives a struct `vkms_pixel_composition_functions`
> > containing two handlers.
> >
> > One will generate a buffer of each line of the frame with the pixels
> > converted to ARGB16161616. And the other will take this line buffer,
> > do some computation on it, and store the pixels in the destination.
> >
> > Both the handlers have the same signature. They receive a pointer to
> > the pixels that will be processed(`pixels_addr`), the number of pixels
> > that will be treated(`length`), and the intermediate buffer of the size
> > of a frame line (`line_buffer`).
> >
> > The first function has been totally described previously.
>
> What does this sentence mean?

In the sentence "One will generate...", I give an overview of the two types of
handlers. And the overview of the first handler describes the full behavior of
it.

But it doesn't look clear enough, I will improve it in the future.

>
> >
> > The second is more interesting, as it has to perform two roles depending
> > on where it is called in the code.
> >
> > The first is to convert(if necessary) the data received in the
> > `line_buffer` and write in the memory pointed by `pixels_addr`.
> >
> > The second role is to perform the `alpha_blend`. So, it takes the pixels
> > in the `line_buffer` and `pixels_addr`, executes the blend, and stores
> > the result back to the `pixels_addr`.
> >
> > The per-line implementation was chosen for performance reasons.
> > The per-pixel functions were having performance issues due to indirect
> > function call overhead.
> >
> > The per-line code trades off memory for execution time. The `line_buffer`
> > allows us to diminish the number of function calls.
> >
> > Results in the IGT test `kms_cursor_crc`:
> >
> > |                     Frametime                       |
> > |:---------------:|:---------:|:----------:|:--------:|
> > |  implmentation  |  Current  |  Per-pixel | Per-line |
> > | frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
> > |     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |
> >
> > Reported-by: kernel test robot <lkp@intel.com>
> > Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> > ---
> > V2: Improves the performance drastically, by perfoming the operations
> >     per-line and not per-pixel(Pekka Paalanen).
> >     Minor improvements(Pekka Paalanen).
> > ---
> >  drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
> >  drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
> >  2 files changed, 342 insertions(+), 134 deletions(-)
> >  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> >
> > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> > index 383ca657ddf7..69fe3a89bdc9 100644
> > --- a/drivers/gpu/drm/vkms/vkms_composer.c
> > +++ b/drivers/gpu/drm/vkms/vkms_composer.c
> > @@ -9,18 +9,26 @@
> >  #include <drm/drm_vblank.h>
> >
> >  #include "vkms_drv.h"
> > -
> > -static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
> > -                              const struct vkms_composer *composer)
> > -{
> > -     u32 pixel;
> > -     int src_offset = composer->offset + (y * composer->pitch)
> > -                                   + (x * composer->cpp);
> > -
> > -     pixel = *(u32 *)&buffer[src_offset];
> > -
> > -     return pixel;
> > -}
> > +#include "vkms_formats.h"
> > +
> > +#define get_output_vkms_composer(buffer_pointer, composer)           \
> > +     ((struct vkms_composer) {                                       \
> > +             .fb = &(struct drm_framebuffer) {                       \
> > +                     .format = &(struct drm_format_info) {           \
> > +                             .format = DRM_FORMAT_ARGB16161616,      \
> > +                     },                                              \
>
> Is that really how one can initialize a drm_format_info? Does that
> struct not have a lot more fields? Shouldn't you call a function to
> look up the proper struct with all fields populated?

I did this macro to just fill the necessary fields, and add more of them
as necessary.

I was implementing something very similar to the algorithm that
you described below. So this macro will not exist in the next version.

>
> > +             },                                                      \
> > +             .map[0].vaddr = (buffer_pointer),                       \
> > +             .src = (composer)->src,                                 \
> > +             .dst = (composer)->dst,                                 \
> > +             .cpp = sizeof(u64),                                     \
> > +             .pitch = drm_rect_width(&(composer)->dst) * sizeof(u64) \
> > +     })
>
> Why is this a macro rather than a function?

I don't have a good answer for that. I'm just more used to these kinds of
initializations using macro instead of function.

>
> > +
> > +struct vkms_pixel_composition_functions {
> > +     void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
> > +     void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);
>
> I would be a little more comfortable if instead of u64 *line_buffer you
> would have something like
>
> struct line_buffer {
>         u16 *row;
>         size_t nelem;
> }
>
> so that the functions to be plugged into these function pointers could
> assert that you do not accidentally overflow the array (which would
> imply a code bug in kernel).
>
> One could perhaps go even for:
>
> struct line_pixel {
>         u16 r, g, b, a;
> };
>
> struct line_buffer {
>         struct line_pixel *row;
>         size_t npixels;
> };

If we decide to follow this representation, would it be possible
to calculate the crc in the similar way that is being done currently?

Something like that:

crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel));

I mean, If the compiler can decide to put a padding somewhere, it
would mess with the crc value. Right?

>
> Because as I mention further down, there is no need for the line buffer
> to use an existing DRM pixel format at all.
>

All this is fine for me. I will change that to the next patch version.

> > +};
> >
> >  /**
> >   * compute_crc - Compute CRC value on output frame
> > @@ -31,179 +39,222 @@ static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
> >   * returns CRC value computed using crc32 on the visible portion of
> >   * the final framebuffer at vaddr_out
> >   */
> > -static uint32_t compute_crc(const u8 *vaddr,
> > +static uint32_t compute_crc(const __le64 *vaddr,
> >                           const struct vkms_composer *composer)
> >  {
> > -     int x, y;
> > -     u32 crc = 0, pixel = 0;
> > -     int x_src = composer->src.x1 >> 16;
> > -     int y_src = composer->src.y1 >> 16;
> > -     int h_src = drm_rect_height(&composer->src) >> 16;
> > -     int w_src = drm_rect_width(&composer->src) >> 16;
> > -
> > -     for (y = y_src; y < y_src + h_src; ++y) {
> > -             for (x = x_src; x < x_src + w_src; ++x) {
> > -                     pixel = get_pixel_from_buffer(x, y, vaddr, composer);
> > -                     crc = crc32_le(crc, (void *)&pixel, sizeof(u32));
> > -             }
> > -     }
> > +     int h = drm_rect_height(&composer->dst);
> > +     int w = drm_rect_width(&composer->dst);
> >
> > -     return crc;
> > +     return crc32_le(0, (void *)vaddr, w * h * sizeof(u64));
> >  }
> >
> > -static u8 blend_channel(u8 src, u8 dst, u8 alpha)
> > +static __le16 blend_channel(u16 src, u16 dst, u16 alpha)
>
> This function is doing the OVER operation (Porter-Duff classification)
> assuming pre-multiplied alpha. I think the function name should reflect
> that. At the very least it should somehow note pre-multiplied alpha,
> because KMS property "pixel blend mode" can change that.

The closest that it has is a comment in the alpha_blend function.

But, aside from that, `pre_mul_channel_blend` look good to you?

>
> 'alpha' should be named 'src_alpha'.
>
> >  {
> > -     u32 pre_blend;
> > -     u8 new_color;
> > +     u64 pre_blend;
>
> I'm not quite sure if u32 would suffice... max value for src is
> 0xffff * src_alpha / 0xffff = src_alpha. Max value for dst is 0xffff.

I didn't understand this division. What does the second 0xffff represent?

>
> So we have at max
>
> src_alpha * 0xffff + 0xffff * (0xffff - src_alpha)
>
> Each multiplication independently will fit in u32.
>
> Rearranging we get
>
> src_alpha * 0xffff + 0xffff * 0xffff - 0xffff * src_alpha
>
> which equals
>
> 0xffff * 0xffff
>
> which fits in u32 and does not depend on src_alpha.
>
> So unless I made a mistake, looks like u32 should be enough. On 32-bit
> CPUs it should have speed benefits compared to u64.
>
> > +     u16 new_color;
> >
> > -     pre_blend = (src * 255 + dst * (255 - alpha));
> > +     pre_blend = (src * 0xffff + dst * (0xffff - alpha));
>
> 'pre_blend' means "before blending" so maybe a better name here as the
> blending is already done.
>

I don't have a good name right now, but I will think of something.

> >
> > -     /* Faster div by 255 */
> > -     new_color = ((pre_blend + ((pre_blend + 257) >> 8)) >> 8);
> > +     new_color = DIV_ROUND_UP(pre_blend, 0xffff);
> >
> > -     return new_color;
> > +     return cpu_to_le16(new_color);
>
> What's the thing with cpu_to_le16 here?
>
> I think the temporary line buffers could just be using the cpu-native
> u16 type. There is no DRM format code for that, but we don't need one
> either. This format is not for interoperation with anything else, it's
> just internal here, and the main goals with it are precision and speed.
>
> As such, the temporary line buffers could be simply u16 arrays, so you
> don't need to consider the channel packing into a u64.
>

This wouldn't cause a problem to calculate the crc in BE machines?

> >  }
> >
>
>
>
> From here on, I will be removing the diff minus lines from the quoted
> code, because these functions are completely new.
>
> >  /**
> >   * alpha_blend - alpha blending equation
>
> This is specifically the pre-multiplied alpha blending, so reflect that
> in the function name.
>

OK, I will use `pre_mul_alpha_blend`. Or something similar.

> > + * @src_composer: source framebuffer's metadata
> > + * @dst_composer: destination framebuffer's metadata
> > + * @y: The y coodinate(heigth) of the line that will be processed
> > + * @line_buffer: The line with the pixels from src_compositor
> >   *
> >   * blend pixels using premultiplied blend formula. The current DRM assumption
> >   * is that pixel color values have been already pre-multiplied with the alpha
> >   * channel values. See more drm_plane_create_blend_mode_property(). Also, this
> >   * formula assumes a completely opaque background.
> > + *
> > + * For performance reasons this function also fetches the pixels from the
> > + * destination of the frame line y.
> > + * We use the information that one of the source pixels are in the output
> > + * buffer to fetch it here instead of separate function. And because the
> > + * output format is ARGB16161616, we know that they don't need to be
> > + * converted.
> > + * This save us a indirect function call for each line.
>
> I think this paragraph should be obvious from the type of 'line_buffer'
> parameter and that you are blending src into dst.
>
> >   */
> > +static void alpha_blend(void *pixels_addr, int length, u64 *line_buffer)
> >  {
> > +     __le16 *output_pixel = pixels_addr;
>
> Aren't you supposed to be writing into line_buffer, not into src?
>
> There is something very strange with the logic here.
>
> In fact, the function signature of the blending function is unexpected.
> A blending function should operate on two line_buffers, not what looks
> like arbitrary buffer pixels.
>
> I think you should forget the old code and design these from scratch.
> You would have three different kinds of functions:
>
> - loading: fetch a row from an image and convert into a line buffer
> - blending: take two line buffers and blend them into one of the line
>   buffers
> - storing: convert a line buffer and write it into an image row
>
> I would not coerce these three different operations into less than
> three function pointer types.
>
> To actually run a blending operation between source and destination
> images, you would need four function pointers:
> - loader for source (by pixel format)
> - loader for destination (by pixel format)
> - blender (by chosen blending operation)
> - storing for destination (by pixel format)
>
> Function parameter types should make it obvious whether something is an
> image or row in arbitrary format, or a line buffer in the special
> internal format.
>
> Then the algorithm would work roughly like this:
>
> for each plane:
>         for each row:
>                 load source into lb1
>                 load destination into lb2
>                 blend lb1 into lb2
>                 store lb2 into destination
>
> This is not optimal, you see how destination is repeatedly loaded and
> stored for each plane. So you could swap the loops:
>
> allocate lb1, lb2 with destination width
> for each destination row:
>         load destination into lb2
>
>         for each plane:
>                 load source into lb1
>                 blend lb1 into lb2
>
>         store lb2 into destination

I'm doing something very similar right now, based on comments from the
previous emails. It looks very similar to your pseudocode.

And this solves several weirdnesses of my code that you commented
throughout this review.

But I made a decision that I would like to hear your thoughts about it.

Using your variables, instead of storing the lb2 in the destination,
I'm using it to calculate the CRC in the middle of the compositing loop.
And if necessary, storing/converting the lb2 into the wb buffer.

So the pseudocode looks like that:

allocate lb1, lb2 with destination width
for each destination row:
        load destination into lb2

        for each plane:
                load source into lb1
                blend lb1 into lb2

        compute crc of lb2

        if wb pending
                 convert and store ib2 to wb buffer

return crc

With that we avoid the allocation of the full image buffer.

>
> Inside the loop over plane, you need to check if the plane overlaps the
> current destination row at all. If not, continue on the next plane. If
> yes, load source into lb1 and compute the offset into lb2 where it
> needs to be blended.

Thanks for this tip, this is an optimization that, currently, my code doesn't
have.

>
> Since we don't support scaling yet, lb1 length will never exceed
> destination width, because there is no need to load plane buffer pixels
> we would not be writing out.
>
> Also "load destination into lb2" could be replaced with just "clear
> lb2" is the old destination contents are to be discarded. Then you also
> don't need the function pointer for "loader for destination".
>
> I think you already had all these ideas, just the execution in code got
> really messy somehow.
>
> > +     int i;
> >
> > +     for (i = 0; i < length; i++) {
> > +             u16 src1_a = line_buffer[i] >> 48;
> > +             u16 src1_r = (line_buffer[i] >> 32) & 0xffff;
> > +             u16 src1_g = (line_buffer[i] >> 16) & 0xffff;
> > +             u16 src1_b = line_buffer[i] & 0xffff;
>
> If you used native u16 array for line buffers, all this arithmetic
> would be unnecessary.
>
> >
> > +             u16 src2_r = le16_to_cpu(output_pixel[2]);
> > +             u16 src2_g = le16_to_cpu(output_pixel[1]);
> > +             u16 src2_b = le16_to_cpu(output_pixel[0]);
> > +
> > +             output_pixel[0] = blend_channel(src1_b, src2_b, src1_a);
> > +             output_pixel[1] = blend_channel(src1_g, src2_g, src1_a);
> > +             output_pixel[2] = blend_channel(src1_r, src2_r, src1_a);
> > +             output_pixel[3] = 0xffff;
> > +
> > +             output_pixel += 4;
> > +     }
> >  }
> >
> >  /**
> >   * @src_composer: source framebuffer's metadata
> > + * @dst_composer: destiny framebuffer's metadata
> > + * @funcs: A struct containing all the composition functions(get_src_line,
> > + *         and set_output_pixel)
> > + * @line_buffer: The line with the pixels from src_compositor
> >   *
> > + * Using the pixel_blend function passed as parameter, this function blends
> > + * all pixels from src plane into a output buffer (with a blend function
> > + * passed as parameter).
> > + * Information of the output buffer is in the dst_composer parameter
> > + * and the source plane in the src_composer.
> > + * The get_src_line will use the src_composer to get the respective line,
> > + * convert, and return it as ARGB_16161616.
> > + * And finally, the blend function will receive the dst_composer, dst_composer,
> > + * the line y coodinate, and the line buffer. Blend all pixels, and store the
> > + * result in the output.
> >   *
> >   * TODO: completely clear the primary plane (a = 0xff) before starting to blend
> >   * pixel color values
> >   */
> > +static void blend(struct vkms_composer *src_composer,
> >                 struct vkms_composer *dst_composer,
> > +               struct vkms_pixel_composition_functions *funcs,
> > +               u64 *line_buffer)
> >  {
> > +     int i, i_dst;
> >
> >       int x_src = src_composer->src.x1 >> 16;
> >       int y_src = src_composer->src.y1 >> 16;
> >
> >       int x_dst = src_composer->dst.x1;
> >       int y_dst = src_composer->dst.y1;
> > +
> >       int h_dst = drm_rect_height(&src_composer->dst);
> > +     int length = drm_rect_width(&src_composer->dst);
> >
> >       int y_limit = y_src + h_dst;
> > +
> > +     u8 *src_pixels = packed_pixels_addr(src_composer, x_src, y_src);
> > +     u8 *dst_pixels = packed_pixels_addr(dst_composer, x_dst, y_dst);
> > +
> > +     int src_next_line_offset = src_composer->pitch;
> > +     int dst_next_line_offset = dst_composer->pitch;
> > +
> > +     for (i = y_src, i_dst = y_dst; i < y_limit; ++i, i_dst++) {
> > +             funcs->get_src_line(src_pixels, length, line_buffer);
> > +             funcs->set_output_line(dst_pixels, length, line_buffer);
> > +             src_pixels += src_next_line_offset;
> > +             dst_pixels += dst_next_line_offset;
> >       }
> >  }
> >
> > +static void ((*get_line_fmt_transform_function(u32 format))
> > +         (void *pixels_addr, int length, u64 *line_buffer))
> >  {
> > +     if (format == DRM_FORMAT_ARGB8888)
> > +             return &ARGB8888_to_ARGB16161616;
> > +     else if (format == DRM_FORMAT_ARGB16161616)
> > +             return &get_ARGB16161616;
> > +     else
> > +             return &XRGB8888_to_ARGB16161616;
> > +}
> >
> > +static void ((*get_output_line_function(u32 format))
> > +          (void *pixels_addr, int length, u64 *line_buffer))
> > +{
> > +     if (format == DRM_FORMAT_ARGB8888)
> > +             return &convert_to_ARGB8888;
> > +     else if (format == DRM_FORMAT_ARGB16161616)
> > +             return &convert_to_ARGB16161616;
> > +     else
> > +             return &convert_to_XRGB8888;
> > +}
> >
> > +static void compose_plane(struct vkms_composer *src_composer,
> > +                       struct vkms_composer *dst_composer,
>
> I'm confused by the vkms_composer concept. If there is a separate thing
> for source and destination and they are used together, then I don't
> think that thing is a "composer" but some kind of... image structure?

I didn't create this struct, but I think this is exactly what it represents.

> "Composer" is what compose_active_planes() does.

Do you think this struct needs a rename?

>
> > +                       struct vkms_pixel_composition_functions *funcs,
> > +                       u64 *line_buffer)
> > +{
> > +     u32 src_format = src_composer->fb->format->format;
> >
> > +     funcs->get_src_line = get_line_fmt_transform_function(src_format);
> >
> > +     blend(src_composer, dst_composer, funcs, line_buffer);
>
> This function is confusing. You get 'funcs' as argument, but you
> overwrite one field and then trust that the other field was already set
> by the caller. The policy of how 'funcs' argument here works is too
> complicated to me.
>
> If you need just one function pointer as argument, then do exactly
> that, and construct the vfunc struct inside this function.

I think this will be totally solved with the code redesign.

>
> >  }
> >
> > +static __le64 *struct vkms_composer *primary_composer,
> > +                                  struct vkms_crtc_state *crtc_state,
> > +                                  u64 *line_buffer)
> >  {
> > +     struct vkms_plane_state **active_planes = crtc_state->active_planes;
> > +     int h = drm_rect_height(&primary_composer->dst);
> > +     int w = drm_rect_width(&primary_composer->dst);
> > +     struct vkms_pixel_composition_functions funcs;
> > +     struct vkms_composer dst_composer;
> > +     __le64 *vaddr_out;
> >       int i;
> >
> >       if (WARN_ON(dma_buf_map_is_null(&primary_composer->map[0])))
> > +             return NULL;
> >
> > +     vaddr_out = kvzalloc(w * h * sizeof(__le64), GFP_KERNEL);
>
> Why allocate a full size image here in the compositing function?
>
> You should be able to do with just few line buffers instead.

Yes, indeed. I'm working on it :)

>
> > +     if (!vaddr_out) {
> > +             DRM_ERROR("Cannot allocate memory for output frame.");
> > +             return NULL;
> > +     }
> >
> > +     dst_composer = get_output_vkms_composer(vaddr_out, primary_composer);
> > +     funcs.set_output_line = get_output_line_function(DRM_FORMAT_ARGB16161616);
> > +     compose_plane(active_planes[0]->composer, &dst_composer,
> > +                   &funcs, line_buffer);
> >
> >       /* If there are other planes besides primary, we consider the active
> >        * planes should be in z-order and compose them associatively:
> >        * ((primary <- overlay) <- cursor)
> >        */
> > +     funcs.set_output_line = alpha_blend;
> >       for (i = 1; i < crtc_state->num_active_planes; i++)
> > +             compose_plane(active_planes[i]->composer, &dst_composer,
> > +                           &funcs, line_buffer);
> >
> > +     return vaddr_out;
> > +}
> > +
> > +static void write_wb_buffer(struct vkms_writeback_job *active_wb,
> > +                         struct vkms_composer *primary_composer,
> > +                         __le64 *vaddr_out, u64 *line_buffer)
> > +{
> > +     u32 dst_fb_format = active_wb->composer.fb->format->format;
> > +     struct vkms_pixel_composition_functions funcs;
> > +     struct vkms_composer src_composer;
> > +
> > +     src_composer = get_output_vkms_composer(vaddr_out, primary_composer);
> > +     funcs.set_output_line = get_output_line_function(dst_fb_format);
> > +     active_wb->composer.src = primary_composer->src;
> > +     active_wb->composer.dst = primary_composer->dst;
> > +
> > +     compose_plane(&src_composer, &active_wb->composer, &funcs, line_buffer);
> > +}
> > +
> > +u64 *alloc_line_buffer(struct vkms_composer *primary_composer)
> > +{
> > +     int line_width = drm_rect_width(&primary_composer->dst);
> > +     u64 *line_buffer;
> > +
> > +     line_buffer = kvmalloc(line_width * sizeof(u64), GFP_KERNEL);
> > +     if (!line_buffer)
> > +             DRM_ERROR("Cannot allocate memory for intermediate line buffer");
> > +
> > +     return line_buffer;
> >  }
> >
> >  /**
> > @@ -221,14 +272,14 @@ void vkms_composer_worker(struct work_struct *work)
> >                                               struct vkms_crtc_state,
> >                                               composer_work);
> >       struct drm_crtc *crtc = crtc_state->base.crtc;
> > +     struct vkms_writeback_job *active_wb = crtc_state->active_writeback;
> >       struct vkms_output *out = drm_crtc_to_vkms_output(crtc);
> >       struct vkms_composer *primary_composer = NULL;
> >       struct vkms_plane_state *act_plane = NULL;
> > +     u64 frame_start, frame_end, *line_buffer;
> >       bool crc_pending, wb_pending;
> > -     void *vaddr_out = NULL;
> > +     __le64 *vaddr_out = NULL;
> >       u32 crc32 = 0;
> > -     u64 frame_start, frame_end;
> > -     int ret;
> >
> >       spin_lock_irq(&out->composer_lock);
> >       frame_start = crtc_state->frame_start;
> > @@ -256,28 +307,30 @@ void vkms_composer_worker(struct work_struct *work)
> >       if (!primary_composer)
> >               return;
> >
> > -     if (wb_pending)
> > -             vaddr_out = crtc_state->active_writeback->data[0].vaddr;
> > +     line_buffer = alloc_line_buffer(primary_composer);
> > +     if (!line_buffer)
> > +             return;
> >
> > -     ret = compose_active_planes(&vaddr_out, primary_composer,
> > -                                 crtc_state);
> > -     if (ret) {
> > -             if (ret == -EINVAL && !wb_pending)
> > -                     kvfree(vaddr_out);
> > +     vaddr_out = compose_active_planes(primary_composer, crtc_state,
> > +                                       line_buffer);
> > +     if (!vaddr_out) {
> > +             kvfree(line_buffer);
> >               return;
> >       }
> >
> > -     crc32 = compute_crc(vaddr_out, primary_composer);
> > -
> >       if (wb_pending) {
> > +             write_wb_buffer(active_wb, primary_composer,
> > +                             vaddr_out, line_buffer);
> >               drm_writeback_signal_completion(&out->wb_connector, 0);
> >               spin_lock_irq(&out->composer_lock);
> >               crtc_state->wb_pending = false;
> >               spin_unlock_irq(&out->composer_lock);
> > -     } else {
> > -             kvfree(vaddr_out);
> >       }
> >
> > +     kvfree(line_buffer);
> > +     crc32 = compute_crc(vaddr_out, primary_composer);
> > +     kvfree(vaddr_out);
> > +
> >       /*
> >        * The worker can fall behind the vblank hrtimer, make sure we catch up.
> >        */
> > diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
> > new file mode 100644
> > index 000000000000..5b850fce69f3
> > --- /dev/null
> > +++ b/drivers/gpu/drm/vkms/vkms_formats.h
> > @@ -0,0 +1,155 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ */
> > +
> > +#ifndef _VKMS_FORMATS_H_
> > +#define _VKMS_FORMATS_H_
> > +
> > +#include <drm/drm_rect.h>
> > +
> > +#define pixel_offset(composer, x, y) \
> > +     ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
>
> Why macro instead of a static inline function?

Again, I don't have a good answer for that :(

>
> > +
> > +/*
> > + * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
> > + *
> > + * @composer: Buffer metadata
> > + * @x: The x(width) coordinate of the 2D buffer
> > + * @y: The y(Heigth) coordinate of the 2D buffer
> > + *
> > + * Takes the information stored in the composer, a pair of coordinates, and
> > + * returns the address of the first color channel.
> > + * This function assumes the channels are packed together, i.e. a color channel
> > + * comes immediately after another. And therefore, this function doesn't work
> > + * for YUV with chroma subsampling (e.g. YUV420 and NV21).
> > + */
> > +static void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
>
> Is it normal in the kernel to have non-inline functions in headers?
>
> Actually this file does not look like a header at all, it should
> probably be a .c file and not #included.

Oops. This should not be that way. I will fix it.

>
> > +{
> > +     int offset = pixel_offset(composer, x, y);
> > +
> > +     return (u8 *)composer->map[0].vaddr + offset;
> > +}
> > +
> > +static void ARGB8888_to_ARGB16161616(void *pixels_addr, int length,
> > +                                  u64 *line_buffer)
> > +{
> > +     u8 *src_pixels = pixels_addr;
> > +     int i;
> > +
> > +     for (i = 0; i < length; i++) {
> > +             /*
> > +              * Organizes the channels in their respective positions and converts
> > +              * the 8 bits channel to 16.
> > +              * The 257 is the "conversion ratio". This number is obtained by the
> > +              * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
> > +              * the best color value in a pixel format with more possibilities.
> > +              * And a similar idea applies to others RGB color conversions.
> > +              */
> > +             line_buffer[i] = ((u64)src_pixels[3] * 257) << 48 |
> > +                              ((u64)src_pixels[2] * 257) << 32 |
> > +                              ((u64)src_pixels[1] * 257) << 16 |
> > +                              ((u64)src_pixels[0] * 257);
> > +
> > +             src_pixels += 4;
> > +     }
> > +}
> > +
> > +static void XRGB8888_to_ARGB16161616(void *pixels_addr, int length,
> > +                                  u64 *line_buffer)
> > +{
> > +     u8 *src_pixels = pixels_addr;
> > +     int i;
> > +
> > +     for (i = 0; i < length; i++) {
> > +             /*
> > +              * The same as the ARGB8888 but with the alpha channel as the
> > +              * maximum value as possible.
> > +              */
> > +             line_buffer[i] = 0xffffllu << 48 |
> > +                              ((u64)src_pixels[2] * 257) << 32 |
> > +                              ((u64)src_pixels[1] * 257) << 16 |
> > +                              ((u64)src_pixels[0] * 257);
> > +
> > +             src_pixels += 4;
> > +     }
> > +}
> > +
> > +static void get_ARGB16161616(void *pixels_addr, int length, u64 *line_buffer)
> > +{
> > +     __le64 *src_pixels = pixels_addr;
> > +     int i;
> > +
> > +     for (i = 0; i < length; i++) {
> > +             /*
> > +              * Because the format byte order is in little-endian and this code
> > +              * needs to run on big-endian machines too, we need modify
> > +              * the byte order from little-endian to the CPU native byte order.
> > +              */
> > +             line_buffer[i] = le64_to_cpu(*src_pixels);
> > +
> > +             src_pixels++;
> > +     }
> > +}
> > +
> > +/*
> > + * The following functions are used as blend operations. But unlike the
> > + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
> > + * source, convert it to a specific format, and store it in the destination.
>
> This is a surprising trick I don't like. Blending operation and storing
> operation are fundamentally different. Once you have more obvious
> function signatures, this trick is not possible anymore.

This is another thing that will be improved with the redesign from scratch.

>
>
> Thanks,
> pq
>
> > + *
> > + * They are used in the `compose_active_planes` and `write_wb_buffer` to
> > + * copy and convert one line of the frame from/to the output buffer to/from
> > + * another buffer (e.g. writeback buffer, primary plane buffer).
> > + */
> > +
> > +static void convert_to_ARGB8888(void *pixels_addr, int length, u64 *line_buffer)
> > +{
> > +     u8 *dst_pixels = pixels_addr;
> > +     int i;
> > +
> > +     for (i = 0; i < length; i++) {
> > +             /*
> > +              * This sequence below is important because the format's byte order is
> > +              * in little-endian. In the case of the ARGB8888 the memory is
> > +              * organized this way:
> > +              *
> > +              * | Addr     | = blue channel
> > +              * | Addr + 1 | = green channel
> > +              * | Addr + 2 | = Red channel
> > +              * | Addr + 3 | = Alpha channel
> > +              */
> > +             dst_pixels[0] = DIV_ROUND_UP(line_buffer[i] & 0xffff, 257);
> > +             dst_pixels[1] = DIV_ROUND_UP((line_buffer[i] >> 16) & 0xffff, 257);
> > +             dst_pixels[2] = DIV_ROUND_UP((line_buffer[i] >> 32) & 0xffff, 257);
> > +             dst_pixels[3] = DIV_ROUND_UP(line_buffer[i] >> 48, 257);
> > +
> > +             dst_pixels += 4;
> > +     }
> > +}
> > +
> > +static void convert_to_XRGB8888(void *pixels_addr, int length, u64 *line_buffer)
> > +{
> > +     u8 *dst_pixels = pixels_addr;
> > +     int i;
> > +
> > +     for (i = 0; i < length; i++) {
> > +             dst_pixels[0] = DIV_ROUND_UP(line_buffer[i] & 0xffff, 257);
> > +             dst_pixels[1] = DIV_ROUND_UP((line_buffer[i] >> 16) & 0xffff, 257);
> > +             dst_pixels[2] = DIV_ROUND_UP((line_buffer[i] >> 32) & 0xffff, 257);
> > +             dst_pixels[3] = 0xff;
> > +
> > +             dst_pixels += 4;
> > +     }
> > +}
> > +
> > +static void convert_to_ARGB16161616(void *pixels_addr, int length,
> > +                                 u64 *line_buffer)
> > +{
> > +     __le64 *dst_pixels = pixels_addr;
> > +     int i;
> > +
> > +     for (i = 0; i < length; i++) {
> > +
> > +             *dst_pixels = cpu_to_le64(line_buffer[i]);
> > +             dst_pixels++;
> > +     }
> > +}
> > +
> > +#endif /* _VKMS_FORMATS_H_ */
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/8] Add new formats support to vkms
  2021-11-09  9:32 ` [PATCH v2 0/8] Add new formats support to vkms Pekka Paalanen
@ 2021-11-10 17:32   ` Igor Torrente
  2021-11-11  8:32     ` Pekka Paalanen
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-11-10 17:32 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: hamohammed.sa, Thomas Zimmermann, rodrigosiqueiramelo, airlied,
	Leandro Ribeiro, melissa.srw, dri-devel

Hi Pekka,

On Tue, Nov 9, 2021 at 6:32 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Tue, 26 Oct 2021 08:34:00 -0300
> Igor Torrente <igormtorrente@gmail.com> wrote:
>
> > Summary
> > =======
> > This series of patches refactor some vkms components in order to introduce
> > new formats to the planes and writeback connector.
> >
> > Now in the blend function, the plane's pixels are converted to ARGB16161616
> > and then blended together.
> >
> > The CRC is calculated based on the ARGB1616161616 buffer. And if required,
> > this buffer is copied/converted to the writeback buffer format.
> >
> > And to handle the pixel conversion, new functions were added to convert
> > from a specific format to ARGB16161616 (the reciprocal is also true).
> >
> > Tests
> > =====
> > This patch series was tested using the following igt tests:
> > -t ".*kms_plane.*"
> > -t ".*kms_writeback.*"
> > -t ".*kms_cursor_crc*"
> > -t ".*kms_flip.*"
> >
> > New tests passing
> > -------------------
> > - pipe-A-cursor-size-change
> > - pipe-A-cursor-alpha-transparent
> >
> > Performance
> > -----------
> > Following some optimization proposed by Pekka Paalanen, now the code
> > runs way faster than V1 and slightly faster than the current implementation.
> >
> > |                          Frametime                          |
> > |:---------------:|:---------:|:--------------:|:------------:|
> > |  implmentation  |  Current  |  Per-pixel(V1) | Per-line(V2) |
> > | frametime range |  8~22 ms  |    32~56 ms    |    6~19 ms   |
> > |     Average     |  10.0 ms  |     35.8 ms    |    8.6 ms    |
>
> Wow, that's much better than I expected.
>
> What is your benchmark? That is, what program do you use and what
> operations does it trigger to produce these measurements? What are the
> sizes of all the planes/buffers involved? What kind of CPU was this ran
> on?

1 and 2) I just measured the frametime of the IGT test ".*kms_cursor_crc*"
using jiffies. I Collected all the frametimes, put all of them into a
spreadsheet, calculated some values and drew some histograms.

I mean, it is not the best benchmark, but at least give an idea of what
is happening.

3) The primary plane was 1024x768, but the cursor plane
varies between the tests. All XRGB_8888, if I'm not mistaken.

4) I tested it on a Qemu VM running on the Intel core i5 4440. ~3.3GHz

>
>
> Thanks,
> pq
>
> >
> > Writeback test
> > --------------
> > During the development of this patch series, I discovered that the
> > writeback-check-output test wasn't filling the plane correctly.
> >
> > So, currently, this patch series is failing in this test. But I sent a
> > patch to igt to fix it[1].
> >
> > XRGB to ARGB behavior
> > =====================
> > During the development, I decided to always fill the alpha channel of
> > the output pixel whenever the conversion from a format without an alpha
> > channel to ARGB16161616 is necessary. Therefore, I ignore the value
> > received from the XRGB and overwrite the value with 0xFFFF.
> >
> > ---
> > Igor Torrente (8):
> >   drm: vkms: Replace the deprecated drm_mode_config_init
> >   drm: vkms: Alloc the compose frame using vzalloc
> >   drm: vkms: Replace hardcoded value of `vkms_composer.map` to
> >     DRM_FORMAT_MAX_PLANES
> >   drm: vkms: Add fb information to `vkms_writeback_job`
> >   drm: drm_atomic_helper: Add a new helper to deal with the writeback
> >     connector validation
> >   drm: vkms: Refactor the plane composer to accept new formats
> >   drm: vkms: Exposes ARGB_1616161616 and adds XRGB_16161616 formats
> >   drm: vkms: Add support to the RGB565 format
> >
> >  drivers/gpu/drm/drm_atomic_helper.c   |  47 ++++
> >  drivers/gpu/drm/vkms/vkms_composer.c  | 329 +++++++++++++++-----------
> >  drivers/gpu/drm/vkms/vkms_drv.c       |   6 +-
> >  drivers/gpu/drm/vkms/vkms_drv.h       |  14 +-
> >  drivers/gpu/drm/vkms/vkms_formats.h   | 252 ++++++++++++++++++++
> >  drivers/gpu/drm/vkms/vkms_plane.c     |  17 +-
> >  drivers/gpu/drm/vkms/vkms_writeback.c |  33 ++-
> >  include/drm/drm_atomic_helper.h       |   3 +
> >  8 files changed, 545 insertions(+), 156 deletions(-)
> >  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> >
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/8] Add new formats support to vkms
  2021-11-10 17:32   ` Igor Torrente
@ 2021-11-11  8:32     ` Pekka Paalanen
  0 siblings, 0 replies; 28+ messages in thread
From: Pekka Paalanen @ 2021-11-11  8:32 UTC (permalink / raw)
  To: Igor Torrente
  Cc: hamohammed.sa, Thomas Zimmermann, rodrigosiqueiramelo, airlied,
	Leandro Ribeiro, melissa.srw, dri-devel

[-- Attachment #1: Type: text/plain, Size: 3445 bytes --]

On Wed, 10 Nov 2021 14:32:26 -0300
Igor Torrente <igormtorrente@gmail.com> wrote:

> Hi Pekka,
> 
> On Tue, Nov 9, 2021 at 6:32 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Tue, 26 Oct 2021 08:34:00 -0300
> > Igor Torrente <igormtorrente@gmail.com> wrote:
> >  
> > > Summary
> > > =======
> > > This series of patches refactor some vkms components in order to introduce
> > > new formats to the planes and writeback connector.
> > >
> > > Now in the blend function, the plane's pixels are converted to ARGB16161616
> > > and then blended together.
> > >
> > > The CRC is calculated based on the ARGB1616161616 buffer. And if required,
> > > this buffer is copied/converted to the writeback buffer format.
> > >
> > > And to handle the pixel conversion, new functions were added to convert
> > > from a specific format to ARGB16161616 (the reciprocal is also true).
> > >
> > > Tests
> > > =====
> > > This patch series was tested using the following igt tests:
> > > -t ".*kms_plane.*"
> > > -t ".*kms_writeback.*"
> > > -t ".*kms_cursor_crc*"
> > > -t ".*kms_flip.*"
> > >
> > > New tests passing
> > > -------------------
> > > - pipe-A-cursor-size-change
> > > - pipe-A-cursor-alpha-transparent
> > >
> > > Performance
> > > -----------
> > > Following some optimization proposed by Pekka Paalanen, now the code
> > > runs way faster than V1 and slightly faster than the current implementation.
> > >
> > > |                          Frametime                          |
> > > |:---------------:|:---------:|:--------------:|:------------:|
> > > |  implmentation  |  Current  |  Per-pixel(V1) | Per-line(V2) |
> > > | frametime range |  8~22 ms  |    32~56 ms    |    6~19 ms   |
> > > |     Average     |  10.0 ms  |     35.8 ms    |    8.6 ms    |  
> >
> > Wow, that's much better than I expected.
> >
> > What is your benchmark? That is, what program do you use and what
> > operations does it trigger to produce these measurements? What are the
> > sizes of all the planes/buffers involved? What kind of CPU was this ran
> > on?  
> 
> 1 and 2) I just measured the frametime of the IGT test ".*kms_cursor_crc*"
> using jiffies. I Collected all the frametimes, put all of them into a
> spreadsheet, calculated some values and drew some histograms.
> 
> I mean, it is not the best benchmark, but at least give an idea of what
> is happening.
> 
> 3) The primary plane was 1024x768, but the cursor plane
> varies between the tests. All XRGB_8888, if I'm not mistaken.
> 
> 4) I tested it on a Qemu VM running on the Intel core i5 4440. ~3.3GHz

Hi Igor,

alright, that analysis sounds fine, even though varying cursor plane
size is casting some ambiguity on the results.

If you want to dig deeper into measuring this, I would suggest some
scenarios if at all possible:

- large primary plane and large cursor plane with 100% overlap, to
  measure the raw pixel throughput

- large primary plane and small cursor plane with 100% overlap, to
  measure the efficiency of skipping pixels that do not need blending

- large primary plane and large cursor plane with only a little
  overlap (cursor largely off-screen), to measure the efficiency of
  skipping pixels that do not contribute to the end result at all

But that's only curiosity, I think your existing benchmarks sound
perfectly fine as the difference is so big.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats
  2021-11-10 16:56     ` Igor Torrente
@ 2021-11-11  9:33       ` Pekka Paalanen
  2021-11-11 14:07         ` Igor Torrente
  0 siblings, 1 reply; 28+ messages in thread
From: Pekka Paalanen @ 2021-11-11  9:33 UTC (permalink / raw)
  To: Igor Torrente
  Cc: hamohammed.sa, Thomas Zimmermann, rodrigosiqueiramelo, airlied,
	Leandro Ribeiro, melissa.srw, dri-devel, kernel test robot

[-- Attachment #1: Type: text/plain, Size: 27591 bytes --]

On Wed, 10 Nov 2021 13:56:54 -0300
Igor Torrente <igormtorrente@gmail.com> wrote:

> On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > Hi Igor,
> >
> > again, that is a really nice speed-up. Unfortunately, I find the code
> > rather messy and hard to follow. I hope my comments below help with
> > re-designing it to be easier to understand.
> >
> >
> > On Tue, 26 Oct 2021 08:34:06 -0300
> > Igor Torrente <igormtorrente@gmail.com> wrote:
> >  
> > > Currently the blend function only accepts XRGB_8888 and ARGB_8888
> > > as a color input.
> > >
> > > This patch refactors all the functions related to the plane composition
> > > to overcome this limitation.
> > >
> > > Now the blend function receives a struct `vkms_pixel_composition_functions`
> > > containing two handlers.
> > >
> > > One will generate a buffer of each line of the frame with the pixels
> > > converted to ARGB16161616. And the other will take this line buffer,
> > > do some computation on it, and store the pixels in the destination.
> > >
> > > Both the handlers have the same signature. They receive a pointer to
> > > the pixels that will be processed(`pixels_addr`), the number of pixels
> > > that will be treated(`length`), and the intermediate buffer of the size
> > > of a frame line (`line_buffer`).
> > >
> > > The first function has been totally described previously.  
> >
> > What does this sentence mean?  
> 
> In the sentence "One will generate...", I give an overview of the two types of
> handlers. And the overview of the first handler describes the full behavior of
> it.
> 
> But it doesn't look clear enough, I will improve it in the future.
> 
> >  
> > >
> > > The second is more interesting, as it has to perform two roles depending
> > > on where it is called in the code.
> > >
> > > The first is to convert(if necessary) the data received in the
> > > `line_buffer` and write in the memory pointed by `pixels_addr`.
> > >
> > > The second role is to perform the `alpha_blend`. So, it takes the pixels
> > > in the `line_buffer` and `pixels_addr`, executes the blend, and stores
> > > the result back to the `pixels_addr`.
> > >
> > > The per-line implementation was chosen for performance reasons.
> > > The per-pixel functions were having performance issues due to indirect
> > > function call overhead.
> > >
> > > The per-line code trades off memory for execution time. The `line_buffer`
> > > allows us to diminish the number of function calls.
> > >
> > > Results in the IGT test `kms_cursor_crc`:
> > >
> > > |                     Frametime                       |
> > > |:---------------:|:---------:|:----------:|:--------:|
> > > |  implmentation  |  Current  |  Per-pixel | Per-line |
> > > | frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
> > > |     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |
> > >
> > > Reported-by: kernel test robot <lkp@intel.com>
> > > Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> > > ---
> > > V2: Improves the performance drastically, by perfoming the operations
> > >     per-line and not per-pixel(Pekka Paalanen).
> > >     Minor improvements(Pekka Paalanen).
> > > ---
> > >  drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
> > >  drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
> > >  2 files changed, 342 insertions(+), 134 deletions(-)
> > >  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> > >
> > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> > > index 383ca657ddf7..69fe3a89bdc9 100644
> > > --- a/drivers/gpu/drm/vkms/vkms_composer.c
> > > +++ b/drivers/gpu/drm/vkms/vkms_composer.c
> > > @@ -9,18 +9,26 @@
> > >  #include <drm/drm_vblank.h>
> > >
> > >  #include "vkms_drv.h"
> > > -
> > > -static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
> > > -                              const struct vkms_composer *composer)
> > > -{
> > > -     u32 pixel;
> > > -     int src_offset = composer->offset + (y * composer->pitch)
> > > -                                   + (x * composer->cpp);
> > > -
> > > -     pixel = *(u32 *)&buffer[src_offset];
> > > -
> > > -     return pixel;
> > > -}
> > > +#include "vkms_formats.h"
> > > +
> > > +#define get_output_vkms_composer(buffer_pointer, composer)           \
> > > +     ((struct vkms_composer) {                                       \
> > > +             .fb = &(struct drm_framebuffer) {                       \
> > > +                     .format = &(struct drm_format_info) {           \
> > > +                             .format = DRM_FORMAT_ARGB16161616,      \
> > > +                     },                                              \  
> >
> > Is that really how one can initialize a drm_format_info? Does that
> > struct not have a lot more fields? Shouldn't you call a function to
> > look up the proper struct with all fields populated?  
> 
> I did this macro to just fill the necessary fields, and add more of them
> as necessary.
> 
> I was implementing something very similar to the algorithm that
> you described below. So this macro will not exist in the next version.
> 
> >  
> > > +             },                                                      \
> > > +             .map[0].vaddr = (buffer_pointer),                       \
> > > +             .src = (composer)->src,                                 \
> > > +             .dst = (composer)->dst,                                 \
> > > +             .cpp = sizeof(u64),                                     \
> > > +             .pitch = drm_rect_width(&(composer)->dst) * sizeof(u64) \
> > > +     })  
> >
> > Why is this a macro rather than a function?  
> 
> I don't have a good answer for that. I'm just more used to these kinds of
> initializations using macro instead of function.
> 
> >  
> > > +
> > > +struct vkms_pixel_composition_functions {
> > > +     void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
> > > +     void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);  
> >
> > I would be a little more comfortable if instead of u64 *line_buffer you
> > would have something like
> >
> > struct line_buffer {
> >         u16 *row;
> >         size_t nelem;
> > }
> >
> > so that the functions to be plugged into these function pointers could
> > assert that you do not accidentally overflow the array (which would
> > imply a code bug in kernel).
> >
> > One could perhaps go even for:
> >
> > struct line_pixel {
> >         u16 r, g, b, a;
> > };
> >
> > struct line_buffer {
> >         struct line_pixel *row;
> >         size_t npixels;
> > };  
> 
> If we decide to follow this representation, would it be possible
> to calculate the crc in the similar way that is being done currently?
> 
> Something like that:
> 
> crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel));

Hi Igor,

yes. I think the CRC calculated does not need to be reproducible in
userspace, so you can very well compute it from the internal
intermediate representation. It also does not need to be portable
between architectures, AFAIU.

> I mean, If the compiler can decide to put a padding somewhere, it
> would mess with the crc value. Right?

Padding could mess it up, yes. However, I think in kernel it is a
convention to define structs (especially UAPI structs but this is not
one) such that there is no implicit padding. So there must be some
recommended practises on how to achieve and ensure that.

The size of struct line_pixel as defined above is 8 bytes which is a
"very round" number, and every field has the same type, so there won't
be gaps between fields either. So I think the struct should already be
fine and have no padding, but how to make sure it is, I'm not sure what
you would do in kernel land.

In userspace I would put a static assert to ensure that
sizeof(struct line_pixel) = 8. That would be enough, because sizeof
counts not just internal implicit padding but also the needed size
extension for alignment in an array of those. The accumulated size of
the fields individually is 8 bytes, so if the struct size is 8, there
cannot be padding.

> >
> > Because as I mention further down, there is no need for the line buffer
> > to use an existing DRM pixel format at all.
> >  
> 
> All this is fine for me. I will change that to the next patch version.
> 
> > > +};
> > >
> > >  /**
> > >   * compute_crc - Compute CRC value on output frame
> > > @@ -31,179 +39,222 @@ static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
> > >   * returns CRC value computed using crc32 on the visible portion of
> > >   * the final framebuffer at vaddr_out
> > >   */
> > > -static uint32_t compute_crc(const u8 *vaddr,
> > > +static uint32_t compute_crc(const __le64 *vaddr,
> > >                           const struct vkms_composer *composer)
> > >  {
> > > -     int x, y;
> > > -     u32 crc = 0, pixel = 0;
> > > -     int x_src = composer->src.x1 >> 16;
> > > -     int y_src = composer->src.y1 >> 16;
> > > -     int h_src = drm_rect_height(&composer->src) >> 16;
> > > -     int w_src = drm_rect_width(&composer->src) >> 16;
> > > -
> > > -     for (y = y_src; y < y_src + h_src; ++y) {
> > > -             for (x = x_src; x < x_src + w_src; ++x) {
> > > -                     pixel = get_pixel_from_buffer(x, y, vaddr, composer);
> > > -                     crc = crc32_le(crc, (void *)&pixel, sizeof(u32));
> > > -             }
> > > -     }
> > > +     int h = drm_rect_height(&composer->dst);
> > > +     int w = drm_rect_width(&composer->dst);
> > >
> > > -     return crc;
> > > +     return crc32_le(0, (void *)vaddr, w * h * sizeof(u64));
> > >  }
> > >
> > > -static u8 blend_channel(u8 src, u8 dst, u8 alpha)
> > > +static __le16 blend_channel(u16 src, u16 dst, u16 alpha)  
> >
> > This function is doing the OVER operation (Porter-Duff classification)
> > assuming pre-multiplied alpha. I think the function name should reflect
> > that. At the very least it should somehow note pre-multiplied alpha,
> > because KMS property "pixel blend mode" can change that.  
> 
> The closest that it has is a comment in the alpha_blend function.
> 
> But, aside from that, `pre_mul_channel_blend` look good to you?

That would be fine, or just 'blend_premult'.

Later it could get two siblings, blend_none and blend_coverage, to
match "pixel blend mode" property.

> >
> > 'alpha' should be named 'src_alpha'.
> >  
> > >  {
> > > -     u32 pre_blend;
> > > -     u8 new_color;
> > > +     u64 pre_blend;  
> >
> > I'm not quite sure if u32 would suffice... max value for src is
> > 0xffff * src_alpha / 0xffff = src_alpha. Max value for dst is 0xffff.  
> 
> I didn't understand this division. What does the second 0xffff represent?

src_alpha is u16, so the divisor is the normalising factor.

Channel value and src_alpha are u16 which means they are essentially
0.16 fixed point format. If you multiply the two together as u16, the
result would be a 0.32 fixed point format in u32. To get back to 0.16
format, you divide by 0xffff.

Actually, this should be obvious, I was just thinking about it too
complicated.

Since src is pre-multiplied, it follows that src <= src_alpha. If you
think in real numbers [0.0, 1.0], it should be easy to see. If
src > src_alpha, then it would mean that the original straight color value
was out of range (greater than 1.0).

> 
> >
> > So we have at max
> >
> > src_alpha * 0xffff + 0xffff * (0xffff - src_alpha)
> >
> > Each multiplication independently will fit in u32.
> >
> > Rearranging we get
> >
> > src_alpha * 0xffff + 0xffff * 0xffff - 0xffff * src_alpha
> >
> > which equals
> >
> > 0xffff * 0xffff
> >
> > which fits in u32 and does not depend on src_alpha.
> >
> > So unless I made a mistake, looks like u32 should be enough. On 32-bit
> > CPUs it should have speed benefits compared to u64.
> >  
> > > +     u16 new_color;
> > >
> > > -     pre_blend = (src * 255 + dst * (255 - alpha));
> > > +     pre_blend = (src * 0xffff + dst * (0xffff - alpha));  
> >
> > 'pre_blend' means "before blending" so maybe a better name here as the
> > blending is already done.
> >  
> 
> I don't have a good name right now, but I will think of something.
> 
> > >
> > > -     /* Faster div by 255 */
> > > -     new_color = ((pre_blend + ((pre_blend + 257) >> 8)) >> 8);
> > > +     new_color = DIV_ROUND_UP(pre_blend, 0xffff);
> > >
> > > -     return new_color;
> > > +     return cpu_to_le16(new_color);  
> >
> > What's the thing with cpu_to_le16 here?
> >
> > I think the temporary line buffers could just be using the cpu-native
> > u16 type. There is no DRM format code for that, but we don't need one
> > either. This format is not for interoperation with anything else, it's
> > just internal here, and the main goals with it are precision and speed.
> >
> > As such, the temporary line buffers could be simply u16 arrays, so you
> > don't need to consider the channel packing into a u64.
> >  
> 
> This wouldn't cause a problem to calculate the crc in BE machines?

I don't think so, because userspace cannot expect CRC values to be
portable between machines, drivers or display chips.

https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#display-crc-support

Thanks to Simon Ser for finding that piece of doc.

> > >  }
> > >  
> >
> >
> >
> > From here on, I will be removing the diff minus lines from the quoted
> > code, because these functions are completely new.
> >  
> > >  /**
> > >   * alpha_blend - alpha blending equation  
> >
> > This is specifically the pre-multiplied alpha blending, so reflect that
> > in the function name.
> >  
> 
> OK, I will use `pre_mul_alpha_blend`. Or something similar.
> 
> > > + * @src_composer: source framebuffer's metadata
> > > + * @dst_composer: destination framebuffer's metadata
> > > + * @y: The y coodinate(heigth) of the line that will be processed
> > > + * @line_buffer: The line with the pixels from src_compositor
> > >   *
> > >   * blend pixels using premultiplied blend formula. The current DRM assumption
> > >   * is that pixel color values have been already pre-multiplied with the alpha
> > >   * channel values. See more drm_plane_create_blend_mode_property(). Also, this
> > >   * formula assumes a completely opaque background.
> > > + *
> > > + * For performance reasons this function also fetches the pixels from the
> > > + * destination of the frame line y.
> > > + * We use the information that one of the source pixels are in the output
> > > + * buffer to fetch it here instead of separate function. And because the
> > > + * output format is ARGB16161616, we know that they don't need to be
> > > + * converted.
> > > + * This save us a indirect function call for each line.  
> >
> > I think this paragraph should be obvious from the type of 'line_buffer'
> > parameter and that you are blending src into dst.
> >  
> > >   */
> > > +static void alpha_blend(void *pixels_addr, int length, u64 *line_buffer)
> > >  {
> > > +     __le16 *output_pixel = pixels_addr;  
> >
> > Aren't you supposed to be writing into line_buffer, not into src?
> >
> > There is something very strange with the logic here.
> >
> > In fact, the function signature of the blending function is unexpected.
> > A blending function should operate on two line_buffers, not what looks
> > like arbitrary buffer pixels.
> >
> > I think you should forget the old code and design these from scratch.
> > You would have three different kinds of functions:
> >
> > - loading: fetch a row from an image and convert into a line buffer
> > - blending: take two line buffers and blend them into one of the line
> >   buffers
> > - storing: convert a line buffer and write it into an image row
> >
> > I would not coerce these three different operations into less than
> > three function pointer types.
> >
> > To actually run a blending operation between source and destination
> > images, you would need four function pointers:
> > - loader for source (by pixel format)
> > - loader for destination (by pixel format)
> > - blender (by chosen blending operation)
> > - storing for destination (by pixel format)
> >
> > Function parameter types should make it obvious whether something is an
> > image or row in arbitrary format, or a line buffer in the special
> > internal format.
> >
> > Then the algorithm would work roughly like this:
> >
> > for each plane:
> >         for each row:
> >                 load source into lb1
> >                 load destination into lb2
> >                 blend lb1 into lb2
> >                 store lb2 into destination
> >
> > This is not optimal, you see how destination is repeatedly loaded and
> > stored for each plane. So you could swap the loops:
> >
> > allocate lb1, lb2 with destination width
> > for each destination row:
> >         load destination into lb2
> >
> >         for each plane:
> >                 load source into lb1
> >                 blend lb1 into lb2
> >
> >         store lb2 into destination  
> 
> I'm doing something very similar right now, based on comments from the
> previous emails. It looks very similar to your pseudocode.
> 
> And this solves several weirdnesses of my code that you commented
> throughout this review.
> 
> But I made a decision that I would like to hear your thoughts about it.
> 
> Using your variables, instead of storing the lb2 in the destination,
> I'm using it to calculate the CRC in the middle of the compositing loop.
> And if necessary, storing/converting the lb2 into the wb buffer.
> 
> So the pseudocode looks like that:
> 
> allocate lb1, lb2 with destination width
> for each destination row:
>         load destination into lb2
> 
>         for each plane:
>                 load source into lb1
>                 blend lb1 into lb2
> 
>         compute crc of lb2
> 
>         if wb pending
>                  convert and store ib2 to wb buffer
> 
> return crc
> 
> With that we avoid the allocation of the full image buffer.

Yes, exactly. Sounds good.


> >
> > Inside the loop over plane, you need to check if the plane overlaps the
> > current destination row at all. If not, continue on the next plane. If
> > yes, load source into lb1 and compute the offset into lb2 where it
> > needs to be blended.  
> 
> Thanks for this tip, this is an optimization that, currently, my code doesn't
> have.
> 
> >
> > Since we don't support scaling yet, lb1 length will never exceed
> > destination width, because there is no need to load plane buffer pixels
> > we would not be writing out.
> >
> > Also "load destination into lb2" could be replaced with just "clear
> > lb2" is the old destination contents are to be discarded. Then you also
> > don't need the function pointer for "loader for destination".
> >
> > I think you already had all these ideas, just the execution in code got
> > really messy somehow.
> >  
> > > +     int i;
> > >
> > > +     for (i = 0; i < length; i++) {
> > > +             u16 src1_a = line_buffer[i] >> 48;
> > > +             u16 src1_r = (line_buffer[i] >> 32) & 0xffff;
> > > +             u16 src1_g = (line_buffer[i] >> 16) & 0xffff;
> > > +             u16 src1_b = line_buffer[i] & 0xffff;  
> >
> > If you used native u16 array for line buffers, all this arithmetic
> > would be unnecessary.
> >  
> > >
> > > +             u16 src2_r = le16_to_cpu(output_pixel[2]);
> > > +             u16 src2_g = le16_to_cpu(output_pixel[1]);
> > > +             u16 src2_b = le16_to_cpu(output_pixel[0]);
> > > +
> > > +             output_pixel[0] = blend_channel(src1_b, src2_b, src1_a);
> > > +             output_pixel[1] = blend_channel(src1_g, src2_g, src1_a);
> > > +             output_pixel[2] = blend_channel(src1_r, src2_r, src1_a);
> > > +             output_pixel[3] = 0xffff;
> > > +
> > > +             output_pixel += 4;
> > > +     }
> > >  }
> > >
> > >  /**
> > >   * @src_composer: source framebuffer's metadata
> > > + * @dst_composer: destiny framebuffer's metadata
> > > + * @funcs: A struct containing all the composition functions(get_src_line,
> > > + *         and set_output_pixel)
> > > + * @line_buffer: The line with the pixels from src_compositor
> > >   *
> > > + * Using the pixel_blend function passed as parameter, this function blends
> > > + * all pixels from src plane into a output buffer (with a blend function
> > > + * passed as parameter).
> > > + * Information of the output buffer is in the dst_composer parameter
> > > + * and the source plane in the src_composer.
> > > + * The get_src_line will use the src_composer to get the respective line,
> > > + * convert, and return it as ARGB_16161616.
> > > + * And finally, the blend function will receive the dst_composer, dst_composer,
> > > + * the line y coodinate, and the line buffer. Blend all pixels, and store the
> > > + * result in the output.
> > >   *
> > >   * TODO: completely clear the primary plane (a = 0xff) before starting to blend
> > >   * pixel color values
> > >   */
> > > +static void blend(struct vkms_composer *src_composer,
> > >                 struct vkms_composer *dst_composer,
> > > +               struct vkms_pixel_composition_functions *funcs,
> > > +               u64 *line_buffer)
> > >  {
> > > +     int i, i_dst;
> > >
> > >       int x_src = src_composer->src.x1 >> 16;
> > >       int y_src = src_composer->src.y1 >> 16;
> > >
> > >       int x_dst = src_composer->dst.x1;
> > >       int y_dst = src_composer->dst.y1;
> > > +
> > >       int h_dst = drm_rect_height(&src_composer->dst);
> > > +     int length = drm_rect_width(&src_composer->dst);
> > >
> > >       int y_limit = y_src + h_dst;
> > > +
> > > +     u8 *src_pixels = packed_pixels_addr(src_composer, x_src, y_src);
> > > +     u8 *dst_pixels = packed_pixels_addr(dst_composer, x_dst, y_dst);
> > > +
> > > +     int src_next_line_offset = src_composer->pitch;
> > > +     int dst_next_line_offset = dst_composer->pitch;
> > > +
> > > +     for (i = y_src, i_dst = y_dst; i < y_limit; ++i, i_dst++) {
> > > +             funcs->get_src_line(src_pixels, length, line_buffer);
> > > +             funcs->set_output_line(dst_pixels, length, line_buffer);
> > > +             src_pixels += src_next_line_offset;
> > > +             dst_pixels += dst_next_line_offset;
> > >       }
> > >  }
> > >
> > > +static void ((*get_line_fmt_transform_function(u32 format))
> > > +         (void *pixels_addr, int length, u64 *line_buffer))
> > >  {
> > > +     if (format == DRM_FORMAT_ARGB8888)
> > > +             return &ARGB8888_to_ARGB16161616;
> > > +     else if (format == DRM_FORMAT_ARGB16161616)
> > > +             return &get_ARGB16161616;
> > > +     else
> > > +             return &XRGB8888_to_ARGB16161616;
> > > +}
> > >
> > > +static void ((*get_output_line_function(u32 format))
> > > +          (void *pixels_addr, int length, u64 *line_buffer))
> > > +{
> > > +     if (format == DRM_FORMAT_ARGB8888)
> > > +             return &convert_to_ARGB8888;
> > > +     else if (format == DRM_FORMAT_ARGB16161616)
> > > +             return &convert_to_ARGB16161616;
> > > +     else
> > > +             return &convert_to_XRGB8888;
> > > +}
> > >
> > > +static void compose_plane(struct vkms_composer *src_composer,
> > > +                       struct vkms_composer *dst_composer,  
> >
> > I'm confused by the vkms_composer concept. If there is a separate thing
> > for source and destination and they are used together, then I don't
> > think that thing is a "composer" but some kind of... image structure?  
> 
> I didn't create this struct, but I think this is exactly what it represents.
> 
> > "Composer" is what compose_active_planes() does.  
> 
> Do you think this struct needs a rename?

In the long run, yes.

> >  
> > > +                       struct vkms_pixel_composition_functions *funcs,
> > > +                       u64 *line_buffer)
> > > +{
> > > +     u32 src_format = src_composer->fb->format->format;
> > >
> > > +     funcs->get_src_line = get_line_fmt_transform_function(src_format);
> > >
> > > +     blend(src_composer, dst_composer, funcs, line_buffer);  
> >
> > This function is confusing. You get 'funcs' as argument, but you
> > overwrite one field and then trust that the other field was already set
> > by the caller. The policy of how 'funcs' argument here works is too
> > complicated to me.
> >
> > If you need just one function pointer as argument, then do exactly
> > that, and construct the vfunc struct inside this function.  
> 
> I think this will be totally solved with the code redesign.

I think so too.

...

> > > diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
> > > new file mode 100644
> > > index 000000000000..5b850fce69f3
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/vkms/vkms_formats.h
> > > @@ -0,0 +1,155 @@
> > > +/* SPDX-License-Identifier: GPL-2.0+ */
> > > +
> > > +#ifndef _VKMS_FORMATS_H_
> > > +#define _VKMS_FORMATS_H_
> > > +
> > > +#include <drm/drm_rect.h>
> > > +
> > > +#define pixel_offset(composer, x, y) \
> > > +     ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))  
> >
> > Why macro instead of a static inline function?  
> 
> Again, I don't have a good answer for that :(

I would recommend to use a static inline function always when possible,
and macros only when an inline function cannot work. The reason is that
an inline function has types in its signature so you get some type
safety, and it cannot accidentally mess up other variables in the call
sites. A function also cannot "secretly" use variables from the call
site like a macro can, so the reader can be sure that the function call
will not access anything not listed in the parameters.


> > > +
> > > +/*
> > > + * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
> > > + *
> > > + * @composer: Buffer metadata
> > > + * @x: The x(width) coordinate of the 2D buffer
> > > + * @y: The y(Heigth) coordinate of the 2D buffer
> > > + *
> > > + * Takes the information stored in the composer, a pair of coordinates, and
> > > + * returns the address of the first color channel.
> > > + * This function assumes the channels are packed together, i.e. a color channel
> > > + * comes immediately after another. And therefore, this function doesn't work
> > > + * for YUV with chroma subsampling (e.g. YUV420 and NV21).
> > > + */
> > > +static void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)  
> >
> > Is it normal in the kernel to have non-inline functions in headers?
> >
> > Actually this file does not look like a header at all, it should
> > probably be a .c file and not #included.  
> 
> Oops. This should not be that way. I will fix it.

While you do that, I wonder if it makes sense to put the functions like
get_line_fmt_transform_function() in this file as well, so you only
need to expose the getters, and the implementations can remain static
functions.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats
  2021-11-11  9:33       ` Pekka Paalanen
@ 2021-11-11 14:07         ` Igor Torrente
  2021-11-11 14:37           ` Pekka Paalanen
  0 siblings, 1 reply; 28+ messages in thread
From: Igor Torrente @ 2021-11-11 14:07 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: hamohammed.sa, Thomas Zimmermann, rodrigosiqueiramelo, airlied,
	Leandro Ribeiro, melissa.srw, dri-devel, kernel test robot

Hi Pekka,

On Thu, Nov 11, 2021 at 6:33 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Wed, 10 Nov 2021 13:56:54 -0300
> Igor Torrente <igormtorrente@gmail.com> wrote:
>
> > On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> > > Hi Igor,
> > >
> > > again, that is a really nice speed-up. Unfortunately, I find the code
> > > rather messy and hard to follow. I hope my comments below help with
> > > re-designing it to be easier to understand.
> > >
> > >
> > > On Tue, 26 Oct 2021 08:34:06 -0300
> > > Igor Torrente <igormtorrente@gmail.com> wrote:
> > >
> > > > Currently the blend function only accepts XRGB_8888 and ARGB_8888
> > > > as a color input.
> > > >
> > > > This patch refactors all the functions related to the plane composition
> > > > to overcome this limitation.
> > > >
> > > > Now the blend function receives a struct `vkms_pixel_composition_functions`
> > > > containing two handlers.
> > > >
> > > > One will generate a buffer of each line of the frame with the pixels
> > > > converted to ARGB16161616. And the other will take this line buffer,
> > > > do some computation on it, and store the pixels in the destination.
> > > >
> > > > Both the handlers have the same signature. They receive a pointer to
> > > > the pixels that will be processed(`pixels_addr`), the number of pixels
> > > > that will be treated(`length`), and the intermediate buffer of the size
> > > > of a frame line (`line_buffer`).
> > > >
> > > > The first function has been totally described previously.
> > >
> > > What does this sentence mean?
> >
> > In the sentence "One will generate...", I give an overview of the two types of
> > handlers. And the overview of the first handler describes the full behavior of
> > it.
> >
> > But it doesn't look clear enough, I will improve it in the future.
> >
> > >
> > > >
> > > > The second is more interesting, as it has to perform two roles depending
> > > > on where it is called in the code.
> > > >
> > > > The first is to convert(if necessary) the data received in the
> > > > `line_buffer` and write in the memory pointed by `pixels_addr`.
> > > >
> > > > The second role is to perform the `alpha_blend`. So, it takes the pixels
> > > > in the `line_buffer` and `pixels_addr`, executes the blend, and stores
> > > > the result back to the `pixels_addr`.
> > > >
> > > > The per-line implementation was chosen for performance reasons.
> > > > The per-pixel functions were having performance issues due to indirect
> > > > function call overhead.
> > > >
> > > > The per-line code trades off memory for execution time. The `line_buffer`
> > > > allows us to diminish the number of function calls.
> > > >
> > > > Results in the IGT test `kms_cursor_crc`:
> > > >
> > > > |                     Frametime                       |
> > > > |:---------------:|:---------:|:----------:|:--------:|
> > > > |  implmentation  |  Current  |  Per-pixel | Per-line |
> > > > | frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
> > > > |     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> > > > ---
> > > > V2: Improves the performance drastically, by perfoming the operations
> > > >     per-line and not per-pixel(Pekka Paalanen).
> > > >     Minor improvements(Pekka Paalanen).
> > > > ---
> > > >  drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
> > > >  drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
> > > >  2 files changed, 342 insertions(+), 134 deletions(-)
> > > >  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> > > >
> > > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> > > > index 383ca657ddf7..69fe3a89bdc9 100644
> > > > --- a/drivers/gpu/drm/vkms/vkms_composer.c
> > > > +++ b/drivers/gpu/drm/vkms/vkms_composer.c
> > > > @@ -9,18 +9,26 @@
> > > >  #include <drm/drm_vblank.h>
> > > >
> > > >  #include "vkms_drv.h"
> > > > -
> > > > -static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
> > > > -                              const struct vkms_composer *composer)
> > > > -{
> > > > -     u32 pixel;
> > > > -     int src_offset = composer->offset + (y * composer->pitch)
> > > > -                                   + (x * composer->cpp);
> > > > -
> > > > -     pixel = *(u32 *)&buffer[src_offset];
> > > > -
> > > > -     return pixel;
> > > > -}
> > > > +#include "vkms_formats.h"
> > > > +
> > > > +#define get_output_vkms_composer(buffer_pointer, composer)           \
> > > > +     ((struct vkms_composer) {                                       \
> > > > +             .fb = &(struct drm_framebuffer) {                       \
> > > > +                     .format = &(struct drm_format_info) {           \
> > > > +                             .format = DRM_FORMAT_ARGB16161616,      \
> > > > +                     },                                              \
> > >
> > > Is that really how one can initialize a drm_format_info? Does that
> > > struct not have a lot more fields? Shouldn't you call a function to
> > > look up the proper struct with all fields populated?
> >
> > I did this macro to just fill the necessary fields, and add more of them
> > as necessary.
> >
> > I was implementing something very similar to the algorithm that
> > you described below. So this macro will not exist in the next version.
> >
> > >
> > > > +             },                                                      \
> > > > +             .map[0].vaddr = (buffer_pointer),                       \
> > > > +             .src = (composer)->src,                                 \
> > > > +             .dst = (composer)->dst,                                 \
> > > > +             .cpp = sizeof(u64),                                     \
> > > > +             .pitch = drm_rect_width(&(composer)->dst) * sizeof(u64) \
> > > > +     })
> > >
> > > Why is this a macro rather than a function?
> >
> > I don't have a good answer for that. I'm just more used to these kinds of
> > initializations using macro instead of function.
> >
> > >
> > > > +
> > > > +struct vkms_pixel_composition_functions {
> > > > +     void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
> > > > +     void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);
> > >
> > > I would be a little more comfortable if instead of u64 *line_buffer you
> > > would have something like
> > >
> > > struct line_buffer {
> > >         u16 *row;
> > >         size_t nelem;
> > > }
> > >
> > > so that the functions to be plugged into these function pointers could
> > > assert that you do not accidentally overflow the array (which would
> > > imply a code bug in kernel).
> > >
> > > One could perhaps go even for:
> > >
> > > struct line_pixel {
> > >         u16 r, g, b, a;
> > > };
> > >
> > > struct line_buffer {
> > >         struct line_pixel *row;
> > >         size_t npixels;
> > > };
> >
> > If we decide to follow this representation, would it be possible
> > to calculate the crc in the similar way that is being done currently?
> >
> > Something like that:
> >
> > crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel));
>
> Hi Igor,
>
> yes. I think the CRC calculated does not need to be reproducible in
> userspace, so you can very well compute it from the internal
> intermediate representation. It also does not need to be portable
> between architectures, AFAIU.

Great! This will make things easier.

>
> > I mean, If the compiler can decide to put a padding somewhere, it
> > would mess with the crc value. Right?
>
> Padding could mess it up, yes. However, I think in kernel it is a
> convention to define structs (especially UAPI structs but this is not
> one) such that there is no implicit padding. So there must be some
> recommended practises on how to achieve and ensure that.
>
> The size of struct line_pixel as defined above is 8 bytes which is a
> "very round" number, and every field has the same type, so there won't
> be gaps between fields either. So I think the struct should already be
> fine and have no padding, but how to make sure it is, I'm not sure what
> you would do in kernel land.
>
> In userspace I would put a static assert to ensure that
> sizeof(struct line_pixel) = 8. That would be enough, because sizeof
> counts not just internal implicit padding but also the needed size
> extension for alignment in an array of those. The accumulated size of
> the fields individually is 8 bytes, so if the struct size is 8, there
> cannot be padding.
>

Apparently the kernel uses a compiler extension in a macro to do this
kind of struct packing.

include/linux/compiler_attributes.h
265:#define __packed                        __attribute__((__packed__))

> > >
> > > Because as I mention further down, there is no need for the line buffer
> > > to use an existing DRM pixel format at all.
> > >
> >
> > All this is fine for me. I will change that to the next patch version.
> >
> > > > +};
> > > >
> > > >  /**
> > > >   * compute_crc - Compute CRC value on output frame
> > > > @@ -31,179 +39,222 @@ static u32 get_pixel_from_buffer(int x, int y, const u8 *buffer,
> > > >   * returns CRC value computed using crc32 on the visible portion of
> > > >   * the final framebuffer at vaddr_out
> > > >   */
> > > > -static uint32_t compute_crc(const u8 *vaddr,
> > > > +static uint32_t compute_crc(const __le64 *vaddr,
> > > >                           const struct vkms_composer *composer)
> > > >  {
> > > > -     int x, y;
> > > > -     u32 crc = 0, pixel = 0;
> > > > -     int x_src = composer->src.x1 >> 16;
> > > > -     int y_src = composer->src.y1 >> 16;
> > > > -     int h_src = drm_rect_height(&composer->src) >> 16;
> > > > -     int w_src = drm_rect_width(&composer->src) >> 16;
> > > > -
> > > > -     for (y = y_src; y < y_src + h_src; ++y) {
> > > > -             for (x = x_src; x < x_src + w_src; ++x) {
> > > > -                     pixel = get_pixel_from_buffer(x, y, vaddr, composer);
> > > > -                     crc = crc32_le(crc, (void *)&pixel, sizeof(u32));
> > > > -             }
> > > > -     }
> > > > +     int h = drm_rect_height(&composer->dst);
> > > > +     int w = drm_rect_width(&composer->dst);
> > > >
> > > > -     return crc;
> > > > +     return crc32_le(0, (void *)vaddr, w * h * sizeof(u64));
> > > >  }
> > > >
> > > > -static u8 blend_channel(u8 src, u8 dst, u8 alpha)
> > > > +static __le16 blend_channel(u16 src, u16 dst, u16 alpha)
> > >
> > > This function is doing the OVER operation (Porter-Duff classification)
> > > assuming pre-multiplied alpha. I think the function name should reflect
> > > that. At the very least it should somehow note pre-multiplied alpha,
> > > because KMS property "pixel blend mode" can change that.
> >
> > The closest that it has is a comment in the alpha_blend function.
> >
> > But, aside from that, `pre_mul_channel_blend` look good to you?
>
> That would be fine, or just 'blend_premult'.
>
> Later it could get two siblings, blend_none and blend_coverage, to
> match "pixel blend mode" property.

OK.

>
> > >
> > > 'alpha' should be named 'src_alpha'.
> > >
> > > >  {
> > > > -     u32 pre_blend;
> > > > -     u8 new_color;
> > > > +     u64 pre_blend;
> > >
> > > I'm not quite sure if u32 would suffice... max value for src is
> > > 0xffff * src_alpha / 0xffff = src_alpha. Max value for dst is 0xffff.
> >
> > I didn't understand this division. What does the second 0xffff represent?
>
> src_alpha is u16, so the divisor is the normalising factor.
>
> Channel value and src_alpha are u16 which means they are essentially
> 0.16 fixed point format. If you multiply the two together as u16, the
> result would be a 0.32 fixed point format in u32. To get back to 0.16
> format, you divide by 0xffff.
>
> Actually, this should be obvious, I was just thinking about it too
> complicated.
>
> Since src is pre-multiplied, it follows that src <= src_alpha. If you
> think in real numbers [0.0, 1.0], it should be easy to see. If
> src > src_alpha, then it would mean that the original straight color value
> was out of range (greater than 1.0).
>

Ohh. Got it.

> >
> > >
> > > So we have at max
> > >
> > > src_alpha * 0xffff + 0xffff * (0xffff - src_alpha)
> > >
> > > Each multiplication independently will fit in u32.
> > >
> > > Rearranging we get
> > >
> > > src_alpha * 0xffff + 0xffff * 0xffff - 0xffff * src_alpha
> > >
> > > which equals
> > >
> > > 0xffff * 0xffff
> > >
> > > which fits in u32 and does not depend on src_alpha.
> > >
> > > So unless I made a mistake, looks like u32 should be enough. On 32-bit
> > > CPUs it should have speed benefits compared to u64.
> > >
> > > > +     u16 new_color;
> > > >
> > > > -     pre_blend = (src * 255 + dst * (255 - alpha));
> > > > +     pre_blend = (src * 0xffff + dst * (0xffff - alpha));
> > >
> > > 'pre_blend' means "before blending" so maybe a better name here as the
> > > blending is already done.
> > >
> >
> > I don't have a good name right now, but I will think of something.
> >
> > > >
> > > > -     /* Faster div by 255 */
> > > > -     new_color = ((pre_blend + ((pre_blend + 257) >> 8)) >> 8);
> > > > +     new_color = DIV_ROUND_UP(pre_blend, 0xffff);
> > > >
> > > > -     return new_color;
> > > > +     return cpu_to_le16(new_color);
> > >
> > > What's the thing with cpu_to_le16 here?
> > >
> > > I think the temporary line buffers could just be using the cpu-native
> > > u16 type. There is no DRM format code for that, but we don't need one
> > > either. This format is not for interoperation with anything else, it's
> > > just internal here, and the main goals with it are precision and speed.
> > >
> > > As such, the temporary line buffers could be simply u16 arrays, so you
> > > don't need to consider the channel packing into a u64.
> > >
> >
> > This wouldn't cause a problem to calculate the crc in BE machines?
>
> I don't think so, because userspace cannot expect CRC values to be
> portable between machines, drivers or display chips.
>
> https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#display-crc-support
>
> Thanks to Simon Ser for finding that piece of doc.

I will drop the `cpu_to_le16` then. Thanks.

>
> > > >  }
> > > >
> > >
> > >
> > >
> > > From here on, I will be removing the diff minus lines from the quoted
> > > code, because these functions are completely new.
> > >
> > > >  /**
> > > >   * alpha_blend - alpha blending equation
> > >
> > > This is specifically the pre-multiplied alpha blending, so reflect that
> > > in the function name.
> > >
> >
> > OK, I will use `pre_mul_alpha_blend`. Or something similar.
> >
> > > > + * @src_composer: source framebuffer's metadata
> > > > + * @dst_composer: destination framebuffer's metadata
> > > > + * @y: The y coodinate(heigth) of the line that will be processed
> > > > + * @line_buffer: The line with the pixels from src_compositor
> > > >   *
> > > >   * blend pixels using premultiplied blend formula. The current DRM assumption
> > > >   * is that pixel color values have been already pre-multiplied with the alpha
> > > >   * channel values. See more drm_plane_create_blend_mode_property(). Also, this
> > > >   * formula assumes a completely opaque background.
> > > > + *
> > > > + * For performance reasons this function also fetches the pixels from the
> > > > + * destination of the frame line y.
> > > > + * We use the information that one of the source pixels are in the output
> > > > + * buffer to fetch it here instead of separate function. And because the
> > > > + * output format is ARGB16161616, we know that they don't need to be
> > > > + * converted.
> > > > + * This save us a indirect function call for each line.
> > >
> > > I think this paragraph should be obvious from the type of 'line_buffer'
> > > parameter and that you are blending src into dst.
> > >
> > > >   */
> > > > +static void alpha_blend(void *pixels_addr, int length, u64 *line_buffer)
> > > >  {
> > > > +     __le16 *output_pixel = pixels_addr;
> > >
> > > Aren't you supposed to be writing into line_buffer, not into src?
> > >
> > > There is something very strange with the logic here.
> > >
> > > In fact, the function signature of the blending function is unexpected.
> > > A blending function should operate on two line_buffers, not what looks
> > > like arbitrary buffer pixels.
> > >
> > > I think you should forget the old code and design these from scratch.
> > > You would have three different kinds of functions:
> > >
> > > - loading: fetch a row from an image and convert into a line buffer
> > > - blending: take two line buffers and blend them into one of the line
> > >   buffers
> > > - storing: convert a line buffer and write it into an image row
> > >
> > > I would not coerce these three different operations into less than
> > > three function pointer types.
> > >
> > > To actually run a blending operation between source and destination
> > > images, you would need four function pointers:
> > > - loader for source (by pixel format)
> > > - loader for destination (by pixel format)
> > > - blender (by chosen blending operation)
> > > - storing for destination (by pixel format)
> > >
> > > Function parameter types should make it obvious whether something is an
> > > image or row in arbitrary format, or a line buffer in the special
> > > internal format.
> > >
> > > Then the algorithm would work roughly like this:
> > >
> > > for each plane:
> > >         for each row:
> > >                 load source into lb1
> > >                 load destination into lb2
> > >                 blend lb1 into lb2
> > >                 store lb2 into destination
> > >
> > > This is not optimal, you see how destination is repeatedly loaded and
> > > stored for each plane. So you could swap the loops:
> > >
> > > allocate lb1, lb2 with destination width
> > > for each destination row:
> > >         load destination into lb2
> > >
> > >         for each plane:
> > >                 load source into lb1
> > >                 blend lb1 into lb2
> > >
> > >         store lb2 into destination
> >
> > I'm doing something very similar right now, based on comments from the
> > previous emails. It looks very similar to your pseudocode.
> >
> > And this solves several weirdnesses of my code that you commented
> > throughout this review.
> >
> > But I made a decision that I would like to hear your thoughts about it.
> >
> > Using your variables, instead of storing the lb2 in the destination,
> > I'm using it to calculate the CRC in the middle of the compositing loop.
> > And if necessary, storing/converting the lb2 into the wb buffer.
> >
> > So the pseudocode looks like that:
> >
> > allocate lb1, lb2 with destination width
> > for each destination row:
> >         load destination into lb2
> >
> >         for each plane:
> >                 load source into lb1
> >                 blend lb1 into lb2
> >
> >         compute crc of lb2
> >
> >         if wb pending
> >                  convert and store ib2 to wb buffer
> >
> > return crc
> >
> > With that we avoid the allocation of the full image buffer.
>
> Yes, exactly. Sounds good.
>
>
> > >
> > > Inside the loop over plane, you need to check if the plane overlaps the
> > > current destination row at all. If not, continue on the next plane. If
> > > yes, load source into lb1 and compute the offset into lb2 where it
> > > needs to be blended.
> >
> > Thanks for this tip, this is an optimization that, currently, my code doesn't
> > have.
> >
> > >
> > > Since we don't support scaling yet, lb1 length will never exceed
> > > destination width, because there is no need to load plane buffer pixels
> > > we would not be writing out.
> > >
> > > Also "load destination into lb2" could be replaced with just "clear
> > > lb2" is the old destination contents are to be discarded. Then you also
> > > don't need the function pointer for "loader for destination".
> > >
> > > I think you already had all these ideas, just the execution in code got
> > > really messy somehow.
> > >
> > > > +     int i;
> > > >
> > > > +     for (i = 0; i < length; i++) {
> > > > +             u16 src1_a = line_buffer[i] >> 48;
> > > > +             u16 src1_r = (line_buffer[i] >> 32) & 0xffff;
> > > > +             u16 src1_g = (line_buffer[i] >> 16) & 0xffff;
> > > > +             u16 src1_b = line_buffer[i] & 0xffff;
> > >
> > > If you used native u16 array for line buffers, all this arithmetic
> > > would be unnecessary.
> > >
> > > >
> > > > +             u16 src2_r = le16_to_cpu(output_pixel[2]);
> > > > +             u16 src2_g = le16_to_cpu(output_pixel[1]);
> > > > +             u16 src2_b = le16_to_cpu(output_pixel[0]);
> > > > +
> > > > +             output_pixel[0] = blend_channel(src1_b, src2_b, src1_a);
> > > > +             output_pixel[1] = blend_channel(src1_g, src2_g, src1_a);
> > > > +             output_pixel[2] = blend_channel(src1_r, src2_r, src1_a);
> > > > +             output_pixel[3] = 0xffff;
> > > > +
> > > > +             output_pixel += 4;
> > > > +     }
> > > >  }
> > > >
> > > >  /**
> > > >   * @src_composer: source framebuffer's metadata
> > > > + * @dst_composer: destiny framebuffer's metadata
> > > > + * @funcs: A struct containing all the composition functions(get_src_line,
> > > > + *         and set_output_pixel)
> > > > + * @line_buffer: The line with the pixels from src_compositor
> > > >   *
> > > > + * Using the pixel_blend function passed as parameter, this function blends
> > > > + * all pixels from src plane into a output buffer (with a blend function
> > > > + * passed as parameter).
> > > > + * Information of the output buffer is in the dst_composer parameter
> > > > + * and the source plane in the src_composer.
> > > > + * The get_src_line will use the src_composer to get the respective line,
> > > > + * convert, and return it as ARGB_16161616.
> > > > + * And finally, the blend function will receive the dst_composer, dst_composer,
> > > > + * the line y coodinate, and the line buffer. Blend all pixels, and store the
> > > > + * result in the output.
> > > >   *
> > > >   * TODO: completely clear the primary plane (a = 0xff) before starting to blend
> > > >   * pixel color values
> > > >   */
> > > > +static void blend(struct vkms_composer *src_composer,
> > > >                 struct vkms_composer *dst_composer,
> > > > +               struct vkms_pixel_composition_functions *funcs,
> > > > +               u64 *line_buffer)
> > > >  {
> > > > +     int i, i_dst;
> > > >
> > > >       int x_src = src_composer->src.x1 >> 16;
> > > >       int y_src = src_composer->src.y1 >> 16;
> > > >
> > > >       int x_dst = src_composer->dst.x1;
> > > >       int y_dst = src_composer->dst.y1;
> > > > +
> > > >       int h_dst = drm_rect_height(&src_composer->dst);
> > > > +     int length = drm_rect_width(&src_composer->dst);
> > > >
> > > >       int y_limit = y_src + h_dst;
> > > > +
> > > > +     u8 *src_pixels = packed_pixels_addr(src_composer, x_src, y_src);
> > > > +     u8 *dst_pixels = packed_pixels_addr(dst_composer, x_dst, y_dst);
> > > > +
> > > > +     int src_next_line_offset = src_composer->pitch;
> > > > +     int dst_next_line_offset = dst_composer->pitch;
> > > > +
> > > > +     for (i = y_src, i_dst = y_dst; i < y_limit; ++i, i_dst++) {
> > > > +             funcs->get_src_line(src_pixels, length, line_buffer);
> > > > +             funcs->set_output_line(dst_pixels, length, line_buffer);
> > > > +             src_pixels += src_next_line_offset;
> > > > +             dst_pixels += dst_next_line_offset;
> > > >       }
> > > >  }
> > > >
> > > > +static void ((*get_line_fmt_transform_function(u32 format))
> > > > +         (void *pixels_addr, int length, u64 *line_buffer))
> > > >  {
> > > > +     if (format == DRM_FORMAT_ARGB8888)
> > > > +             return &ARGB8888_to_ARGB16161616;
> > > > +     else if (format == DRM_FORMAT_ARGB16161616)
> > > > +             return &get_ARGB16161616;
> > > > +     else
> > > > +             return &XRGB8888_to_ARGB16161616;
> > > > +}
> > > >
> > > > +static void ((*get_output_line_function(u32 format))
> > > > +          (void *pixels_addr, int length, u64 *line_buffer))
> > > > +{
> > > > +     if (format == DRM_FORMAT_ARGB8888)
> > > > +             return &convert_to_ARGB8888;
> > > > +     else if (format == DRM_FORMAT_ARGB16161616)
> > > > +             return &convert_to_ARGB16161616;
> > > > +     else
> > > > +             return &convert_to_XRGB8888;
> > > > +}
> > > >
> > > > +static void compose_plane(struct vkms_composer *src_composer,
> > > > +                       struct vkms_composer *dst_composer,
> > >
> > > I'm confused by the vkms_composer concept. If there is a separate thing
> > > for source and destination and they are used together, then I don't
> > > think that thing is a "composer" but some kind of... image structure?
> >
> > I didn't create this struct, but I think this is exactly what it represents.
> >
> > > "Composer" is what compose_active_planes() does.
> >
> > Do you think this struct needs a rename?
>
> In the long run, yes.
>
> > >
> > > > +                       struct vkms_pixel_composition_functions *funcs,
> > > > +                       u64 *line_buffer)
> > > > +{
> > > > +     u32 src_format = src_composer->fb->format->format;
> > > >
> > > > +     funcs->get_src_line = get_line_fmt_transform_function(src_format);
> > > >
> > > > +     blend(src_composer, dst_composer, funcs, line_buffer);
> > >
> > > This function is confusing. You get 'funcs' as argument, but you
> > > overwrite one field and then trust that the other field was already set
> > > by the caller. The policy of how 'funcs' argument here works is too
> > > complicated to me.
> > >
> > > If you need just one function pointer as argument, then do exactly
> > > that, and construct the vfunc struct inside this function.
> >
> > I think this will be totally solved with the code redesign.
>
> I think so too.
>
> ...
>
> > > > diff --git a/drivers/gpu/drm/vkms/vkms_formats.h b/drivers/gpu/drm/vkms/vkms_formats.h
> > > > new file mode 100644
> > > > index 000000000000..5b850fce69f3
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/vkms/vkms_formats.h
> > > > @@ -0,0 +1,155 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0+ */
> > > > +
> > > > +#ifndef _VKMS_FORMATS_H_
> > > > +#define _VKMS_FORMATS_H_
> > > > +
> > > > +#include <drm/drm_rect.h>
> > > > +
> > > > +#define pixel_offset(composer, x, y) \
> > > > +     ((composer)->offset + ((y) * (composer)->pitch) + ((x) * (composer)->cpp))
> > >
> > > Why macro instead of a static inline function?
> >
> > Again, I don't have a good answer for that :(
>
> I would recommend to use a static inline function always when possible,
> and macros only when an inline function cannot work. The reason is that
> an inline function has types in its signature so you get some type
> safety, and it cannot accidentally mess up other variables in the call
> sites. A function also cannot "secretly" use variables from the call
> site like a macro can, so the reader can be sure that the function call
> will not access anything not listed in the parameters.

That makes sense to me. I will follow this guideline in my future code.
Thanks!

>
>
> > > > +
> > > > +/*
> > > > + * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates
> > > > + *
> > > > + * @composer: Buffer metadata
> > > > + * @x: The x(width) coordinate of the 2D buffer
> > > > + * @y: The y(Heigth) coordinate of the 2D buffer
> > > > + *
> > > > + * Takes the information stored in the composer, a pair of coordinates, and
> > > > + * returns the address of the first color channel.
> > > > + * This function assumes the channels are packed together, i.e. a color channel
> > > > + * comes immediately after another. And therefore, this function doesn't work
> > > > + * for YUV with chroma subsampling (e.g. YUV420 and NV21).
> > > > + */
> > > > +static void *packed_pixels_addr(struct vkms_composer *composer, int x, int y)
> > >
> > > Is it normal in the kernel to have non-inline functions in headers?
> > >
> > > Actually this file does not look like a header at all, it should
> > > probably be a .c file and not #included.
> >
> > Oops. This should not be that way. I will fix it.
>
> While you do that, I wonder if it makes sense to put the functions like
> get_line_fmt_transform_function() in this file as well, so you only
> need to expose the getters, and the implementations can remain static
> functions.

This makes sense for me at least. Considering that the vkms_formats
handles everything related to formats.

And it will be one less file to modify while adding a new format.

>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats
  2021-11-11 14:07         ` Igor Torrente
@ 2021-11-11 14:37           ` Pekka Paalanen
  2021-11-12 12:50             ` Igor Torrente
  0 siblings, 1 reply; 28+ messages in thread
From: Pekka Paalanen @ 2021-11-11 14:37 UTC (permalink / raw)
  To: Igor Torrente
  Cc: hamohammed.sa, Thomas Zimmermann, rodrigosiqueiramelo, airlied,
	Leandro Ribeiro, melissa.srw, dri-devel, kernel test robot

[-- Attachment #1: Type: text/plain, Size: 7655 bytes --]

On Thu, 11 Nov 2021 11:07:21 -0300
Igor Torrente <igormtorrente@gmail.com> wrote:

> Hi Pekka,
> 
> On Thu, Nov 11, 2021 at 6:33 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Wed, 10 Nov 2021 13:56:54 -0300
> > Igor Torrente <igormtorrente@gmail.com> wrote:
> >  
> > > On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:  
> > > >
> > > > Hi Igor,
> > > >
> > > > again, that is a really nice speed-up. Unfortunately, I find the code
> > > > rather messy and hard to follow. I hope my comments below help with
> > > > re-designing it to be easier to understand.
> > > >
> > > >
> > > > On Tue, 26 Oct 2021 08:34:06 -0300
> > > > Igor Torrente <igormtorrente@gmail.com> wrote:
> > > >  
> > > > > Currently the blend function only accepts XRGB_8888 and ARGB_8888
> > > > > as a color input.
> > > > >
> > > > > This patch refactors all the functions related to the plane composition
> > > > > to overcome this limitation.
> > > > >
> > > > > Now the blend function receives a struct `vkms_pixel_composition_functions`
> > > > > containing two handlers.
> > > > >
> > > > > One will generate a buffer of each line of the frame with the pixels
> > > > > converted to ARGB16161616. And the other will take this line buffer,
> > > > > do some computation on it, and store the pixels in the destination.
> > > > >
> > > > > Both the handlers have the same signature. They receive a pointer to
> > > > > the pixels that will be processed(`pixels_addr`), the number of pixels
> > > > > that will be treated(`length`), and the intermediate buffer of the size
> > > > > of a frame line (`line_buffer`).
> > > > >
> > > > > The first function has been totally described previously.  
> > > >
> > > > What does this sentence mean?  
> > >
> > > In the sentence "One will generate...", I give an overview of the two types of
> > > handlers. And the overview of the first handler describes the full behavior of
> > > it.
> > >
> > > But it doesn't look clear enough, I will improve it in the future.
> > >  
> > > >  
> > > > >
> > > > > The second is more interesting, as it has to perform two roles depending
> > > > > on where it is called in the code.
> > > > >
> > > > > The first is to convert(if necessary) the data received in the
> > > > > `line_buffer` and write in the memory pointed by `pixels_addr`.
> > > > >
> > > > > The second role is to perform the `alpha_blend`. So, it takes the pixels
> > > > > in the `line_buffer` and `pixels_addr`, executes the blend, and stores
> > > > > the result back to the `pixels_addr`.
> > > > >
> > > > > The per-line implementation was chosen for performance reasons.
> > > > > The per-pixel functions were having performance issues due to indirect
> > > > > function call overhead.
> > > > >
> > > > > The per-line code trades off memory for execution time. The `line_buffer`
> > > > > allows us to diminish the number of function calls.
> > > > >
> > > > > Results in the IGT test `kms_cursor_crc`:
> > > > >
> > > > > |                     Frametime                       |
> > > > > |:---------------:|:---------:|:----------:|:--------:|
> > > > > |  implmentation  |  Current  |  Per-pixel | Per-line |
> > > > > | frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
> > > > > |     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |
> > > > >
> > > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > > Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> > > > > ---
> > > > > V2: Improves the performance drastically, by perfoming the operations
> > > > >     per-line and not per-pixel(Pekka Paalanen).
> > > > >     Minor improvements(Pekka Paalanen).
> > > > > ---
> > > > >  drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
> > > > >  drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
> > > > >  2 files changed, 342 insertions(+), 134 deletions(-)
> > > > >  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> > > > >
> > > > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> > > > > index 383ca657ddf7..69fe3a89bdc9 100644
> > > > > --- a/drivers/gpu/drm/vkms/vkms_composer.c
> > > > > +++ b/drivers/gpu/drm/vkms/vkms_composer.c

...

> > > > > +struct vkms_pixel_composition_functions {
> > > > > +     void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
> > > > > +     void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);  
> > > >
> > > > I would be a little more comfortable if instead of u64 *line_buffer you
> > > > would have something like
> > > >
> > > > struct line_buffer {
> > > >         u16 *row;
> > > >         size_t nelem;
> > > > }
> > > >
> > > > so that the functions to be plugged into these function pointers could
> > > > assert that you do not accidentally overflow the array (which would
> > > > imply a code bug in kernel).
> > > >
> > > > One could perhaps go even for:
> > > >
> > > > struct line_pixel {
> > > >         u16 r, g, b, a;
> > > > };
> > > >
> > > > struct line_buffer {
> > > >         struct line_pixel *row;
> > > >         size_t npixels;
> > > > };  
> > >
> > > If we decide to follow this representation, would it be possible
> > > to calculate the crc in the similar way that is being done currently?
> > >
> > > Something like that:
> > >
> > > crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel));  
> >
> > Hi Igor,
> >
> > yes. I think the CRC calculated does not need to be reproducible in
> > userspace, so you can very well compute it from the internal
> > intermediate representation. It also does not need to be portable
> > between architectures, AFAIU.  
> 
> Great! This will make things easier.
> 
> >  
> > > I mean, If the compiler can decide to put a padding somewhere, it
> > > would mess with the crc value. Right?  
> >
> > Padding could mess it up, yes. However, I think in kernel it is a
> > convention to define structs (especially UAPI structs but this is not
> > one) such that there is no implicit padding. So there must be some
> > recommended practises on how to achieve and ensure that.
> >
> > The size of struct line_pixel as defined above is 8 bytes which is a
> > "very round" number, and every field has the same type, so there won't
> > be gaps between fields either. So I think the struct should already be
> > fine and have no padding, but how to make sure it is, I'm not sure what
> > you would do in kernel land.
> >
> > In userspace I would put a static assert to ensure that
> > sizeof(struct line_pixel) = 8. That would be enough, because sizeof
> > counts not just internal implicit padding but also the needed size
> > extension for alignment in an array of those. The accumulated size of
> > the fields individually is 8 bytes, so if the struct size is 8, there
> > cannot be padding.
> >  
> 
> Apparently the kernel uses a compiler extension in a macro to do this
> kind of struct packing.
> 
> include/linux/compiler_attributes.h
> 265:#define __packed                        __attribute__((__packed__))

Hi Igor,

we do not actually want to force packing, though.

If there would be padding without packing, then packing may incur a
noticeable speed penalty in accessing the fields. We don't want to risk
that.

So I think it's better to just assert that no padding exists instead.
There would be something quite strange going on if there was padding in
this case, but better safe than sorry, because debugging that would be
awful.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats
  2021-11-11 14:37           ` Pekka Paalanen
@ 2021-11-12 12:50             ` Igor Torrente
  0 siblings, 0 replies; 28+ messages in thread
From: Igor Torrente @ 2021-11-12 12:50 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: hamohammed.sa, Thomas Zimmermann, rodrigosiqueiramelo, airlied,
	Leandro Ribeiro, melissa.srw, dri-devel, kernel test robot

Hi Pekka,

On Thu, Nov 11, 2021 at 11:37 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Thu, 11 Nov 2021 11:07:21 -0300
> Igor Torrente <igormtorrente@gmail.com> wrote:
>
> > Hi Pekka,
> >
> > On Thu, Nov 11, 2021 at 6:33 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> > > On Wed, 10 Nov 2021 13:56:54 -0300
> > > Igor Torrente <igormtorrente@gmail.com> wrote:
> > >
> > > > On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > > > >
> > > > > Hi Igor,
> > > > >
> > > > > again, that is a really nice speed-up. Unfortunately, I find the code
> > > > > rather messy and hard to follow. I hope my comments below help with
> > > > > re-designing it to be easier to understand.
> > > > >
> > > > >
> > > > > On Tue, 26 Oct 2021 08:34:06 -0300
> > > > > Igor Torrente <igormtorrente@gmail.com> wrote:
> > > > >
> > > > > > Currently the blend function only accepts XRGB_8888 and ARGB_8888
> > > > > > as a color input.
> > > > > >
> > > > > > This patch refactors all the functions related to the plane composition
> > > > > > to overcome this limitation.
> > > > > >
> > > > > > Now the blend function receives a struct `vkms_pixel_composition_functions`
> > > > > > containing two handlers.
> > > > > >
> > > > > > One will generate a buffer of each line of the frame with the pixels
> > > > > > converted to ARGB16161616. And the other will take this line buffer,
> > > > > > do some computation on it, and store the pixels in the destination.
> > > > > >
> > > > > > Both the handlers have the same signature. They receive a pointer to
> > > > > > the pixels that will be processed(`pixels_addr`), the number of pixels
> > > > > > that will be treated(`length`), and the intermediate buffer of the size
> > > > > > of a frame line (`line_buffer`).
> > > > > >
> > > > > > The first function has been totally described previously.
> > > > >
> > > > > What does this sentence mean?
> > > >
> > > > In the sentence "One will generate...", I give an overview of the two types of
> > > > handlers. And the overview of the first handler describes the full behavior of
> > > > it.
> > > >
> > > > But it doesn't look clear enough, I will improve it in the future.
> > > >
> > > > >
> > > > > >
> > > > > > The second is more interesting, as it has to perform two roles depending
> > > > > > on where it is called in the code.
> > > > > >
> > > > > > The first is to convert(if necessary) the data received in the
> > > > > > `line_buffer` and write in the memory pointed by `pixels_addr`.
> > > > > >
> > > > > > The second role is to perform the `alpha_blend`. So, it takes the pixels
> > > > > > in the `line_buffer` and `pixels_addr`, executes the blend, and stores
> > > > > > the result back to the `pixels_addr`.
> > > > > >
> > > > > > The per-line implementation was chosen for performance reasons.
> > > > > > The per-pixel functions were having performance issues due to indirect
> > > > > > function call overhead.
> > > > > >
> > > > > > The per-line code trades off memory for execution time. The `line_buffer`
> > > > > > allows us to diminish the number of function calls.
> > > > > >
> > > > > > Results in the IGT test `kms_cursor_crc`:
> > > > > >
> > > > > > |                     Frametime                       |
> > > > > > |:---------------:|:---------:|:----------:|:--------:|
> > > > > > |  implmentation  |  Current  |  Per-pixel | Per-line |
> > > > > > | frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
> > > > > > |     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |
> > > > > >
> > > > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > > > Signed-off-by: Igor Torrente <igormtorrente@gmail.com>
> > > > > > ---
> > > > > > V2: Improves the performance drastically, by perfoming the operations
> > > > > >     per-line and not per-pixel(Pekka Paalanen).
> > > > > >     Minor improvements(Pekka Paalanen).
> > > > > > ---
> > > > > >  drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
> > > > > >  drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
> > > > > >  2 files changed, 342 insertions(+), 134 deletions(-)
> > > > > >  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> > > > > > index 383ca657ddf7..69fe3a89bdc9 100644
> > > > > > --- a/drivers/gpu/drm/vkms/vkms_composer.c
> > > > > > +++ b/drivers/gpu/drm/vkms/vkms_composer.c
>
> ...
>
> > > > > > +struct vkms_pixel_composition_functions {
> > > > > > +     void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
> > > > > > +     void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);
> > > > >
> > > > > I would be a little more comfortable if instead of u64 *line_buffer you
> > > > > would have something like
> > > > >
> > > > > struct line_buffer {
> > > > >         u16 *row;
> > > > >         size_t nelem;
> > > > > }
> > > > >
> > > > > so that the functions to be plugged into these function pointers could
> > > > > assert that you do not accidentally overflow the array (which would
> > > > > imply a code bug in kernel).
> > > > >
> > > > > One could perhaps go even for:
> > > > >
> > > > > struct line_pixel {
> > > > >         u16 r, g, b, a;
> > > > > };
> > > > >
> > > > > struct line_buffer {
> > > > >         struct line_pixel *row;
> > > > >         size_t npixels;
> > > > > };
> > > >
> > > > If we decide to follow this representation, would it be possible
> > > > to calculate the crc in the similar way that is being done currently?
> > > >
> > > > Something like that:
> > > >
> > > > crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel));
> > >
> > > Hi Igor,
> > >
> > > yes. I think the CRC calculated does not need to be reproducible in
> > > userspace, so you can very well compute it from the internal
> > > intermediate representation. It also does not need to be portable
> > > between architectures, AFAIU.
> >
> > Great! This will make things easier.
> >
> > >
> > > > I mean, If the compiler can decide to put a padding somewhere, it
> > > > would mess with the crc value. Right?
> > >
> > > Padding could mess it up, yes. However, I think in kernel it is a
> > > convention to define structs (especially UAPI structs but this is not
> > > one) such that there is no implicit padding. So there must be some
> > > recommended practises on how to achieve and ensure that.
> > >
> > > The size of struct line_pixel as defined above is 8 bytes which is a
> > > "very round" number, and every field has the same type, so there won't
> > > be gaps between fields either. So I think the struct should already be
> > > fine and have no padding, but how to make sure it is, I'm not sure what
> > > you would do in kernel land.
> > >
> > > In userspace I would put a static assert to ensure that
> > > sizeof(struct line_pixel) = 8. That would be enough, because sizeof
> > > counts not just internal implicit padding but also the needed size
> > > extension for alignment in an array of those. The accumulated size of
> > > the fields individually is 8 bytes, so if the struct size is 8, there
> > > cannot be padding.
> > >
> >
> > Apparently the kernel uses a compiler extension in a macro to do this
> > kind of struct packing.
> >
> > include/linux/compiler_attributes.h
> > 265:#define __packed                        __attribute__((__packed__))
>
> Hi Igor,
>
> we do not actually want to force packing, though.
>
> If there would be padding without packing, then packing may incur a
> noticeable speed penalty in accessing the fields. We don't want to risk
> that.

I understand...

>
> So I think it's better to just assert that no padding exists instead.
> There would be something quite strange going on if there was padding in
> this case, but better safe than sorry, because debugging that would be
> awful.
>

OK. I will do that and also test some alternatives.

>
> Thanks,
> pq

Thanks,
---
Igor M. A. Torrente

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2021-11-12 12:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-26 11:34 [PATCH v2 0/8] Add new formats support to vkms Igor Torrente
2021-10-26 11:34 ` [PATCH v2 1/8] drm: vkms: Replace the deprecated drm_mode_config_init Igor Torrente
2021-10-26 11:34 ` [PATCH v2 2/8] drm: vkms: Alloc the compose frame using vzalloc Igor Torrente
2021-10-26 11:34 ` [PATCH v2 3/8] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES Igor Torrente
2021-11-03 15:40   ` Thomas Zimmermann
2021-10-26 11:34 ` [PATCH v2 4/8] drm: vkms: Add fb information to `vkms_writeback_job` Igor Torrente
2021-11-03 15:45   ` Thomas Zimmermann
2021-11-03 19:18     ` Igor Torrente
2021-11-04  7:21       ` Thomas Zimmermann
2021-10-26 11:34 ` [PATCH v2 5/8] drm: drm_atomic_helper: Add a new helper to deal with the writeback connector validation Igor Torrente
2021-10-28 21:38   ` Leandro Ribeiro
2021-11-03 15:03     ` Igor Torrente
2021-11-03 15:11       ` Leandro Ribeiro
2021-11-03 15:37         ` Thomas Zimmermann
2021-11-03 18:41           ` Igor Torrente
2021-10-26 11:34 ` [PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats Igor Torrente
2021-11-09 11:40   ` Pekka Paalanen
2021-11-10 16:56     ` Igor Torrente
2021-11-11  9:33       ` Pekka Paalanen
2021-11-11 14:07         ` Igor Torrente
2021-11-11 14:37           ` Pekka Paalanen
2021-11-12 12:50             ` Igor Torrente
2021-10-26 11:34 ` [PATCH v2 7/8] drm: vkms: Exposes ARGB_1616161616 and adds XRGB_16161616 formats Igor Torrente
2021-10-26 11:34 ` [PATCH v2 8/8] drm: vkms: Add support the RGB565 format Igor Torrente
2021-10-26 11:34 ` [PATCH v2 8/8] drm: vkms: Add support to " Igor Torrente
2021-11-09  9:32 ` [PATCH v2 0/8] Add new formats support to vkms Pekka Paalanen
2021-11-10 17:32   ` Igor Torrente
2021-11-11  8:32     ` Pekka Paalanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.