Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

From: Igor Matheus Andrade Torrente <igormtorrente@gmail.com>
To: Pekka Paalanen <ppaalanen@gmail.com>
Cc: rodrigosiqueiramelo@gmail.com, melissa.srw@gmail.com,
	hamohammed.sa@gmail.com, daniel@ffwll.ch, airlied@linux.ie,
	contact@emersion.fr, leandro.ribeiro@collabora.com,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	lkcamp@lists.libreplanetbr.org
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats
Date: Mon, 18 Oct 2021 16:26:06 -0300	[thread overview]
Message-ID: <d5f92494-9c55-2b7d-6107-6048abb41759@gmail.com> (raw)
In-Reply-To: <20211018113009.5519457c@eldfell>

Hi Pekka,

On 10/18/21 5:30 AM, Pekka Paalanen wrote:
> On Tue,  5 Oct 2021 17:16:37 -0300
> Igor Matheus Andrade Torrente <igormtorrente@gmail.com> wrote:
> 
>> Currently the blend function only accepts XRGB_8888 and ARGB_8888
>> as a color input.
>>
>> This patch refactors all the functions related to the plane composition
>> to overcome this limitation.
>>
>> Now the blend function receives a format handler to each plane and a
>> blend function pointer. It will take two ARGB_1616161616 pixels, one
>> for each handler, and will use the blend function to calculate and
>> store the final color in the output buffer.
>>
>> These format handlers will receive the `vkms_composer` and a pair of
>> coordinates. And they should return the respective pixel in the
>> ARGB_16161616 format.
>>
>> The blend function will receive two ARGB_16161616 pixels, x, y, and
>> the vkms_composer of the output buffer. The method should perform the
>> blend operation and store output to the format aforementioned
>> ARGB_16161616.
> 
> Hi,
> 
> here are some drive-by comments which you are free to take or leave.
> 
> To find more efficient ways to do this, you might want research Pixman
> library. It's MIT licensed.
> 
> IIRC, the elementary operations there operate on pixel lines (pointer +
> length), not individual pixels (image, x, y). Input conversion function
> reads and converts a whole line a time. Blending function blends two
> lines and writes the output into back one of the lines. Output
> conversion function takes a line and converts it into destination
> buffer. That way the elementary operations do not need to take stride
> into account, and blending does not need to deal with varying memory
> alignment FWIW. The inner loops involve much less function calls and
> state, probably.

I was doing some rudimentary profiling, and I noticed that the code 
spends most of the time with the indirect system call overhead and not 
the actual computation. This can definitively help with it.

> 
> Pixman may not do exactly this, but it does something very similar.
> Pixman also has multiple different levels of optimisations, which may
> not be necessary for VKMS.
> 
> It's a trade-off between speed and temporary memory consumed. You need
> temporary buffers for two lines, but not more (not a whole image in
> full intermediate precision). Further optimisation could determine
> whether to process whole image rows as lines, or split a row into
> multiple lines to stay within CPU cache size.
> 

Sorry, I didn't understand the idea of the last sentence.

> Since it seems you are blending multiple planes into a single
> destination which is assumed to be opaque, you might also be able to do
> the multiple blends without convert-writing and read-converting to/from
> the destination between every plane. I think that might be possible to
> architect on top of the per-line operations still.

I tried it. But I don't know how to do this without looking like a mess. 
Does the pixman perform some similar?

> 
>> Signed-off-by: Igor Matheus Andrade Torrente <igormtorrente@gmail.com>
>> ---
>>   drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
>>   drivers/gpu/drm/vkms/vkms_formats.h  | 125 ++++++++++++
>>   2 files changed, 271 insertions(+), 129 deletions(-)
>>   create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> 
> ...
> 
>> +
>> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * Organizes the channels in their respective positions and converts
>> +	 * the 8 bits channel to 16.
>> +	 * The 257 is the "conversion ratio". This number is obtained by the
>> +	 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
>> +	 * the best color value in a color space with more possibilities.
> 
> Pixel format, not color space. >
>> +	 * And a similar idea applies to others RGB color conversions.
>> +	 */
>> +	return ((u64)pixel_addr[3] * 257) << 48 |
>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>> +	       ((u64)pixel_addr[0] * 257);
> 
> I wonder if the compiler is smart enough to choose between "mul 257"
> and "(v << 8) | v" operations... but that's probably totally drowning
> under the overhead of per (x,y) looping.

I disassembled the code to check it. And looks like the compiler is 
replacing the multiplication with shifts and additions.

ARGB8888_to_ARGB16161616:
    0xffffffff816ad660 <+0>:     imul   0x12c(%rdi),%edx
    0xffffffff816ad667 <+7>:     imul   0x130(%rdi),%esi
    0xffffffff816ad66e <+14>:    add    %edx,%esi
    0xffffffff816ad670 <+16>:    add    0x128(%rdi),%esi
    0xffffffff816ad676 <+22>:    movslq %esi,%rax
    0xffffffff816ad679 <+25>:    add    0xe8(%rdi),%rax
    0xffffffff816ad680 <+32>:    movzbl 0x3(%rax),%ecx
    0xffffffff816ad684 <+36>:    movzbl 0x2(%rax),%esi
    0xffffffff816ad688 <+40>:    mov    %rcx,%rdx
    0xffffffff816ad68b <+43>:    shl    $0x8,%rdx
    0xffffffff816ad68f <+47>:    add    %rcx,%rdx
    0xffffffff816ad692 <+50>:    mov    %rsi,%rcx
    0xffffffff816ad695 <+53>:    shl    $0x8,%rcx
    0xffffffff816ad699 <+57>:    shl    $0x30,%rdx
    0xffffffff816ad69d <+61>:    add    %rsi,%rcx
    0xffffffff816ad6a0 <+64>:    movzbl (%rax),%esi
    0xffffffff816ad6a3 <+67>:    shl    $0x20,%rcx
    0xffffffff816ad6a7 <+71>:    or     %rcx,%rdx
    0xffffffff816ad6aa <+74>:    mov    %rsi,%rcx
    0xffffffff816ad6ad <+77>:    shl    $0x8,%rcx
    0xffffffff816ad6b1 <+81>:    add    %rsi,%rcx
    0xffffffff816ad6b4 <+84>:    or     %rcx,%rdx
    0xffffffff816ad6b7 <+87>:    movzbl 0x1(%rax),%ecx
    0xffffffff816ad6bb <+91>:    mov    %rcx,%rax
    0xffffffff816ad6be <+94>:    shl    $0x8,%rax
    0xffffffff816ad6c2 <+98>:    add    %rcx,%rax
    0xffffffff816ad6c5 <+101>:   shl    $0x10,%rax
    0xffffffff816ad6c9 <+105>:   or     %rdx,%rax
    0xffffffff816ad6cc <+108>:   ret

> 
>> +}
>> +
>> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * The same as the ARGB8888 but with the alpha channel as the
>> +	 * maximum value as possible.
>> +	 */
>> +	return 0xffffllu << 48 |
>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>> +	       ((u64)pixel_addr[0] * 257);
>> +}
>> +
>> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	__le64 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * Because the format byte order is in little-endian and this code
>> +	 * needs to run on big-endian machines too, we need modify
>> +	 * the byte order from little-endian to the CPU native byte order.
>> +	 */
>> +	return le64_to_cpu(*pixel_addr);
>> +}
>> +
>> +/*
>> + * The following functions are used as blend operations. But unlike the
>> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
>> + * source, convert it to a specific format, and store it in the destination.
>> + *
>> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
>> + * copy and convert one pixel from/to the output buffer to/from
>> + * another buffer (e.g. writeback buffer, primary plane buffer).
>> + */
>> +
>> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			 struct vkms_composer *dst_composer)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	/*
>> +	 * This sequence below is important because the format's byte order is
>> +	 * in little-endian. In the case of the ARGB8888 the memory is
>> +	 * organized this way:
>> +	 *
>> +	 * | Addr     | = blue channel
>> +	 * | Addr + 1 | = green channel
>> +	 * | Addr + 2 | = Red channel
>> +	 * | Addr + 3 | = Alpha channel
>> +	 */
>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> +	pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
> 
> This could be potentially expensive if the compiler ends up using an
> actual div instruction.
> 
Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the 
"Faster div by 255" works to adapt to 16 bits.

> Btw. this would be shorter to write as
> 
> 	(argb_src1 >> 16) & 0xffff
> 
> etc.
I will use it in the V2. Thanks.

> 
> Thanks,
> pq
> 
>> +}
>> +
>> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			 struct vkms_composer *dst_composer)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> +	pixel_addr[3] = 0xff;
>> +}
>> +
>> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			     struct vkms_composer *dst_composer)
>> +{
>> +	__le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	*pixel_addr = cpu_to_le64(argb_src1);
>> +}
>> +
>> +#endif /* _VKMS_FORMATS_H_ */
>