All of lore.kernel.org
 help / color / mirror / Atom feed
From: Igor Matheus Andrade Torrente <igormtorrente@gmail.com>
To: Pekka Paalanen <ppaalanen@gmail.com>
Cc: rodrigosiqueiramelo@gmail.com, melissa.srw@gmail.com,
	hamohammed.sa@gmail.com, daniel@ffwll.ch, airlied@linux.ie,
	contact@emersion.fr, leandro.ribeiro@collabora.com,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	lkcamp@lists.libreplanetbr.org
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats
Date: Mon, 18 Oct 2021 16:26:06 -0300	[thread overview]
Message-ID: <d5f92494-9c55-2b7d-6107-6048abb41759@gmail.com> (raw)
In-Reply-To: <20211018113009.5519457c@eldfell>

Hi Pekka,

On 10/18/21 5:30 AM, Pekka Paalanen wrote:
> On Tue,  5 Oct 2021 17:16:37 -0300
> Igor Matheus Andrade Torrente <igormtorrente@gmail.com> wrote:
> 
>> Currently the blend function only accepts XRGB_8888 and ARGB_8888
>> as a color input.
>>
>> This patch refactors all the functions related to the plane composition
>> to overcome this limitation.
>>
>> Now the blend function receives a format handler to each plane and a
>> blend function pointer. It will take two ARGB_1616161616 pixels, one
>> for each handler, and will use the blend function to calculate and
>> store the final color in the output buffer.
>>
>> These format handlers will receive the `vkms_composer` and a pair of
>> coordinates. And they should return the respective pixel in the
>> ARGB_16161616 format.
>>
>> The blend function will receive two ARGB_16161616 pixels, x, y, and
>> the vkms_composer of the output buffer. The method should perform the
>> blend operation and store output to the format aforementioned
>> ARGB_16161616.
> 
> Hi,
> 
> here are some drive-by comments which you are free to take or leave.
> 
> To find more efficient ways to do this, you might want research Pixman
> library. It's MIT licensed.
> 
> IIRC, the elementary operations there operate on pixel lines (pointer +
> length), not individual pixels (image, x, y). Input conversion function
> reads and converts a whole line a time. Blending function blends two
> lines and writes the output into back one of the lines. Output
> conversion function takes a line and converts it into destination
> buffer. That way the elementary operations do not need to take stride
> into account, and blending does not need to deal with varying memory
> alignment FWIW. The inner loops involve much less function calls and
> state, probably.

I was doing some rudimentary profiling, and I noticed that the code 
spends most of the time with the indirect system call overhead and not 
the actual computation. This can definitively help with it.

> 
> Pixman may not do exactly this, but it does something very similar.
> Pixman also has multiple different levels of optimisations, which may
> not be necessary for VKMS.
> 
> It's a trade-off between speed and temporary memory consumed. You need
> temporary buffers for two lines, but not more (not a whole image in
> full intermediate precision). Further optimisation could determine
> whether to process whole image rows as lines, or split a row into
> multiple lines to stay within CPU cache size.
> 

Sorry, I didn't understand the idea of the last sentence.

> Since it seems you are blending multiple planes into a single
> destination which is assumed to be opaque, you might also be able to do
> the multiple blends without convert-writing and read-converting to/from
> the destination between every plane. I think that might be possible to
> architect on top of the per-line operations still.

I tried it. But I don't know how to do this without looking like a mess. 
Does the pixman perform some similar?

> 
>> Signed-off-by: Igor Matheus Andrade Torrente <igormtorrente@gmail.com>
>> ---
>>   drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
>>   drivers/gpu/drm/vkms/vkms_formats.h  | 125 ++++++++++++
>>   2 files changed, 271 insertions(+), 129 deletions(-)
>>   create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> 
> ...
> 
>> +
>> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * Organizes the channels in their respective positions and converts
>> +	 * the 8 bits channel to 16.
>> +	 * The 257 is the "conversion ratio". This number is obtained by the
>> +	 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
>> +	 * the best color value in a color space with more possibilities.
> 
> Pixel format, not color space. >
>> +	 * And a similar idea applies to others RGB color conversions.
>> +	 */
>> +	return ((u64)pixel_addr[3] * 257) << 48 |
>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>> +	       ((u64)pixel_addr[0] * 257);
> 
> I wonder if the compiler is smart enough to choose between "mul 257"
> and "(v << 8) | v" operations... but that's probably totally drowning
> under the overhead of per (x,y) looping.

I disassembled the code to check it. And looks like the compiler is 
replacing the multiplication with shifts and additions.

ARGB8888_to_ARGB16161616:
    0xffffffff816ad660 <+0>:     imul   0x12c(%rdi),%edx
    0xffffffff816ad667 <+7>:     imul   0x130(%rdi),%esi
    0xffffffff816ad66e <+14>:    add    %edx,%esi
    0xffffffff816ad670 <+16>:    add    0x128(%rdi),%esi
    0xffffffff816ad676 <+22>:    movslq %esi,%rax
    0xffffffff816ad679 <+25>:    add    0xe8(%rdi),%rax
    0xffffffff816ad680 <+32>:    movzbl 0x3(%rax),%ecx
    0xffffffff816ad684 <+36>:    movzbl 0x2(%rax),%esi
    0xffffffff816ad688 <+40>:    mov    %rcx,%rdx
    0xffffffff816ad68b <+43>:    shl    $0x8,%rdx
    0xffffffff816ad68f <+47>:    add    %rcx,%rdx
    0xffffffff816ad692 <+50>:    mov    %rsi,%rcx
    0xffffffff816ad695 <+53>:    shl    $0x8,%rcx
    0xffffffff816ad699 <+57>:    shl    $0x30,%rdx
    0xffffffff816ad69d <+61>:    add    %rsi,%rcx
    0xffffffff816ad6a0 <+64>:    movzbl (%rax),%esi
    0xffffffff816ad6a3 <+67>:    shl    $0x20,%rcx
    0xffffffff816ad6a7 <+71>:    or     %rcx,%rdx
    0xffffffff816ad6aa <+74>:    mov    %rsi,%rcx
    0xffffffff816ad6ad <+77>:    shl    $0x8,%rcx
    0xffffffff816ad6b1 <+81>:    add    %rsi,%rcx
    0xffffffff816ad6b4 <+84>:    or     %rcx,%rdx
    0xffffffff816ad6b7 <+87>:    movzbl 0x1(%rax),%ecx
    0xffffffff816ad6bb <+91>:    mov    %rcx,%rax
    0xffffffff816ad6be <+94>:    shl    $0x8,%rax
    0xffffffff816ad6c2 <+98>:    add    %rcx,%rax
    0xffffffff816ad6c5 <+101>:   shl    $0x10,%rax
    0xffffffff816ad6c9 <+105>:   or     %rdx,%rax
    0xffffffff816ad6cc <+108>:   ret

> 
>> +}
>> +
>> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * The same as the ARGB8888 but with the alpha channel as the
>> +	 * maximum value as possible.
>> +	 */
>> +	return 0xffffllu << 48 |
>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>> +	       ((u64)pixel_addr[0] * 257);
>> +}
>> +
>> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	__le64 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * Because the format byte order is in little-endian and this code
>> +	 * needs to run on big-endian machines too, we need modify
>> +	 * the byte order from little-endian to the CPU native byte order.
>> +	 */
>> +	return le64_to_cpu(*pixel_addr);
>> +}
>> +
>> +/*
>> + * The following functions are used as blend operations. But unlike the
>> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
>> + * source, convert it to a specific format, and store it in the destination.
>> + *
>> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
>> + * copy and convert one pixel from/to the output buffer to/from
>> + * another buffer (e.g. writeback buffer, primary plane buffer).
>> + */
>> +
>> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			 struct vkms_composer *dst_composer)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	/*
>> +	 * This sequence below is important because the format's byte order is
>> +	 * in little-endian. In the case of the ARGB8888 the memory is
>> +	 * organized this way:
>> +	 *
>> +	 * | Addr     | = blue channel
>> +	 * | Addr + 1 | = green channel
>> +	 * | Addr + 2 | = Red channel
>> +	 * | Addr + 3 | = Alpha channel
>> +	 */
>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> +	pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
> 
> This could be potentially expensive if the compiler ends up using an
> actual div instruction.
> 
Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the 
"Faster div by 255" works to adapt to 16 bits.

> Btw. this would be shorter to write as
> 
> 	(argb_src1 >> 16) & 0xffff
> 
> etc.
I will use it in the V2. Thanks.

> 
> Thanks,
> pq
> 
>> +}
>> +
>> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			 struct vkms_composer *dst_composer)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> +	pixel_addr[3] = 0xff;
>> +}
>> +
>> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			     struct vkms_composer *dst_composer)
>> +{
>> +	__le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	*pixel_addr = cpu_to_le64(argb_src1);
>> +}
>> +
>> +#endif /* _VKMS_FORMATS_H_ */
> 

  reply	other threads:[~2021-10-18 19:26 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-05 20:16 [PATCH 0/6] Refactor the vkms to accept new formats Igor Matheus Andrade Torrente
2021-10-05 20:16 ` [PATCH 1/6] drm: vkms: Replace the deprecated drm_mode_config_init Igor Matheus Andrade Torrente
2021-10-18 10:02   ` Thomas Zimmermann
2021-10-18 18:21     ` Igor Matheus Andrade Torrente
2021-10-05 20:16 ` [PATCH 2/6] drm: vkms: Alloc the compose frame using vzalloc Igor Matheus Andrade Torrente
2021-10-05 20:16 ` [PATCH 3/6] drm: vkms: Replace hardcoded value of `vkms_composer.map` to DRM_FORMAT_MAX_PLANES Igor Matheus Andrade Torrente
2021-10-18 10:04   ` Thomas Zimmermann
2021-10-05 20:16 ` [PATCH 4/6] drm: vkms: Add fb information to `vkms_writeback_job` Igor Matheus Andrade Torrente
2021-10-18 10:10   ` Thomas Zimmermann
2021-10-05 20:16 ` [PATCH 5/6] drm: vkms: Prepare `vkms_wb_encoder_atomic_check` to accept multiple formats Igor Matheus Andrade Torrente
2021-10-18 10:14   ` Thomas Zimmermann
2021-10-18 17:41     ` Igor Matheus Andrade Torrente
2021-10-18 18:06       ` Thomas Zimmermann
2021-10-18 19:32         ` Igor Matheus Andrade Torrente
2021-10-19  7:17           ` Thomas Zimmermann
2021-10-19 19:12             ` Igor Matheus Andrade Torrente
2021-10-05 20:16 ` [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats Igor Matheus Andrade Torrente
2021-10-05 22:20   ` kernel test robot
2021-10-05 22:20     ` kernel test robot
2021-10-05 23:18   ` kernel test robot
2021-10-05 23:18     ` kernel test robot
2021-10-05 23:36   ` kernel test robot
2021-10-05 23:36     ` kernel test robot
2021-10-18  8:30   ` Pekka Paalanen
2021-10-18 19:26     ` Igor Matheus Andrade Torrente [this message]
2021-10-19  8:05       ` Pekka Paalanen
2021-10-19 21:10         ` Igor Matheus Andrade Torrente
2021-10-20  8:25           ` Pekka Paalanen
2021-10-18  7:53 ` [PATCH 0/6] Refactor the vkms " Pekka Paalanen
2021-10-18 18:05   ` Igor Matheus Andrade Torrente

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d5f92494-9c55-2b7d-6107-6048abb41759@gmail.com \
    --to=igormtorrente@gmail.com \
    --cc=airlied@linux.ie \
    --cc=contact@emersion.fr \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=hamohammed.sa@gmail.com \
    --cc=leandro.ribeiro@collabora.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkcamp@lists.libreplanetbr.org \
    --cc=melissa.srw@gmail.com \
    --cc=ppaalanen@gmail.com \
    --cc=rodrigosiqueiramelo@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.