From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A79C2C433FE
	for <linux-kernel@archiver.kernel.org>; Mon, 18 Oct 2021 19:26:13 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8D33261283
	for <linux-kernel@archiver.kernel.org>; Mon, 18 Oct 2021 19:26:13 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231956AbhJRT2X (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Mon, 18 Oct 2021 15:28:23 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32880 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S229836AbhJRT2W (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 18 Oct 2021 15:28:22 -0400
Received: from mail-ua1-x931.google.com (mail-ua1-x931.google.com [IPv6:2607:f8b0:4864:20::931])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D5CF3C06161C
        for <linux-kernel@vger.kernel.org>; Mon, 18 Oct 2021 12:26:10 -0700 (PDT)
Received: by mail-ua1-x931.google.com with SMTP id f4so2732443uad.4
        for <linux-kernel@vger.kernel.org>; Mon, 18 Oct 2021 12:26:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=subject:to:cc:references:from:message-id:date:user-agent
         :mime-version:in-reply-to:content-language:content-transfer-encoding;
        bh=6q9mrYREhy1PS0zxwbz6IygpGUtwaBnRqvZjeZk381w=;
        b=RY1BYPX+CGClIeAb+4/EKVV7QzBSx+Goqe566R5NqfrEOUXLrCu3Hvnnwy8XBJzh7e
         B+MnaxDaPfX7pXad4XBu57TlLMefrNJhhxLhkMjGlBLK+niAN0Ga+dtgfjwgl0LGI4D3
         1ge1AAMXAVjSWilE0IJVM3FkNYgJNtocaVt1qxbXll6EqU1GnydyY43QkZBdoqpNT6C3
         qMg03sMNpNS4QGsD15wxRf28mNBFcRyiPsndOC3H8VZobuAntm94xG5+TAqfpp502e1+
         f+80VEqwGwEpANSIuxnnFW6yd0jvI+Dy4rYwmwDBgHdKdtdbswNGa17beHlt5FfRXl0o
         rKDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:subject:to:cc:references:from:message-id:date
         :user-agent:mime-version:in-reply-to:content-language
         :content-transfer-encoding;
        bh=6q9mrYREhy1PS0zxwbz6IygpGUtwaBnRqvZjeZk381w=;
        b=VH357+TgkjoB9cxV59aOKpwznmgYAPKv9bwQD6iG3ODOFnErjEbcy0umCDrwVk/HDO
         uim/SHMebUhlb6E6wPXePmlu8tmgHu6tJPQILLya6nAIT0UTZG5Sls6dZN5qje96d780
         V95qpA0Gs95A64cvcN/flbnKYncSYT6kW2M0LlrKSt63rwNDkypVOyEm/yR9O0rCt8g8
         Xy+JjEBM01AcFM16rpC+RyCN98trajyVBhNqZwJw1/bMxf1s7gJfJZRwO6F52zMYrQZX
         qKP8W4VcAoo+Ux71qXfcZOn+3bF18PwbzGEXqw8dDBrz7/VSSQ8WZOePueWqCsicYs+4
         pVxQ==
X-Gm-Message-State: AOAM533p9zs8CcHIqm0KkevbcmvG5B6eE0yPo/CYLX/FR/ba58EFqmxy
        o2amLOYB9xwbZC+0lgz3EyI=
X-Google-Smtp-Source: ABdhPJwQr9j8xnumy0Lb0UXlWqho1sFEyszT/pmlUflLpUXCEybORWiM8jx9B3HdIg4ppQ18QhAw5Q==
X-Received: by 2002:a67:d504:: with SMTP id l4mr29712049vsj.42.1634585169691;
        Mon, 18 Oct 2021 12:26:09 -0700 (PDT)
Received: from ?IPv6:2804:431:c7f4:b20b:2ce0:3c04:a56a:40cc? ([2804:431:c7f4:b20b:2ce0:3c04:a56a:40cc])
        by smtp.gmail.com with ESMTPSA id x21sm9571007uao.2.2021.10.18.12.26.07
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 18 Oct 2021 12:26:09 -0700 (PDT)
Subject: Re: [PATCH 6/6] drm: vkms: Refactor the plane composer to accept new
 formats
To:     Pekka Paalanen <ppaalanen@gmail.com>
Cc:     rodrigosiqueiramelo@gmail.com, melissa.srw@gmail.com,
        hamohammed.sa@gmail.com, daniel@ffwll.ch, airlied@linux.ie,
        contact@emersion.fr, leandro.ribeiro@collabora.com,
        dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
        lkcamp@lists.libreplanetbr.org
References: <20211005201637.58563-1-igormtorrente@gmail.com>
 <20211005201637.58563-7-igormtorrente@gmail.com>
 <20211018113009.5519457c@eldfell>
From:   Igor Matheus Andrade Torrente <igormtorrente@gmail.com>
Message-ID: <d5f92494-9c55-2b7d-6107-6048abb41759@gmail.com>
Date:   Mon, 18 Oct 2021 16:26:06 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.13.0
MIME-Version: 1.0
In-Reply-To: <20211018113009.5519457c@eldfell>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Pekka,

On 10/18/21 5:30 AM, Pekka Paalanen wrote:
> On Tue,  5 Oct 2021 17:16:37 -0300
> Igor Matheus Andrade Torrente <igormtorrente@gmail.com> wrote:
> 
>> Currently the blend function only accepts XRGB_8888 and ARGB_8888
>> as a color input.
>>
>> This patch refactors all the functions related to the plane composition
>> to overcome this limitation.
>>
>> Now the blend function receives a format handler to each plane and a
>> blend function pointer. It will take two ARGB_1616161616 pixels, one
>> for each handler, and will use the blend function to calculate and
>> store the final color in the output buffer.
>>
>> These format handlers will receive the `vkms_composer` and a pair of
>> coordinates. And they should return the respective pixel in the
>> ARGB_16161616 format.
>>
>> The blend function will receive two ARGB_16161616 pixels, x, y, and
>> the vkms_composer of the output buffer. The method should perform the
>> blend operation and store output to the format aforementioned
>> ARGB_16161616.
> 
> Hi,
> 
> here are some drive-by comments which you are free to take or leave.
> 
> To find more efficient ways to do this, you might want research Pixman
> library. It's MIT licensed.
> 
> IIRC, the elementary operations there operate on pixel lines (pointer +
> length), not individual pixels (image, x, y). Input conversion function
> reads and converts a whole line a time. Blending function blends two
> lines and writes the output into back one of the lines. Output
> conversion function takes a line and converts it into destination
> buffer. That way the elementary operations do not need to take stride
> into account, and blending does not need to deal with varying memory
> alignment FWIW. The inner loops involve much less function calls and
> state, probably.

I was doing some rudimentary profiling, and I noticed that the code 
spends most of the time with the indirect system call overhead and not 
the actual computation. This can definitively help with it.

> 
> Pixman may not do exactly this, but it does something very similar.
> Pixman also has multiple different levels of optimisations, which may
> not be necessary for VKMS.
> 
> It's a trade-off between speed and temporary memory consumed. You need
> temporary buffers for two lines, but not more (not a whole image in
> full intermediate precision). Further optimisation could determine
> whether to process whole image rows as lines, or split a row into
> multiple lines to stay within CPU cache size.
> 

Sorry, I didn't understand the idea of the last sentence.

> Since it seems you are blending multiple planes into a single
> destination which is assumed to be opaque, you might also be able to do
> the multiple blends without convert-writing and read-converting to/from
> the destination between every plane. I think that might be possible to
> architect on top of the per-line operations still.

I tried it. But I don't know how to do this without looking like a mess. 
Does the pixman perform some similar?

> 
>> Signed-off-by: Igor Matheus Andrade Torrente <igormtorrente@gmail.com>
>> ---
>>   drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
>>   drivers/gpu/drm/vkms/vkms_formats.h  | 125 ++++++++++++
>>   2 files changed, 271 insertions(+), 129 deletions(-)
>>   create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> 
> ...
> 
>> +
>> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * Organizes the channels in their respective positions and converts
>> +	 * the 8 bits channel to 16.
>> +	 * The 257 is the "conversion ratio". This number is obtained by the
>> +	 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
>> +	 * the best color value in a color space with more possibilities.
> 
> Pixel format, not color space. >
>> +	 * And a similar idea applies to others RGB color conversions.
>> +	 */
>> +	return ((u64)pixel_addr[3] * 257) << 48 |
>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>> +	       ((u64)pixel_addr[0] * 257);
> 
> I wonder if the compiler is smart enough to choose between "mul 257"
> and "(v << 8) | v" operations... but that's probably totally drowning
> under the overhead of per (x,y) looping.

I disassembled the code to check it. And looks like the compiler is 
replacing the multiplication with shifts and additions.

ARGB8888_to_ARGB16161616:
    0xffffffff816ad660 <+0>:     imul   0x12c(%rdi),%edx
    0xffffffff816ad667 <+7>:     imul   0x130(%rdi),%esi
    0xffffffff816ad66e <+14>:    add    %edx,%esi
    0xffffffff816ad670 <+16>:    add    0x128(%rdi),%esi
    0xffffffff816ad676 <+22>:    movslq %esi,%rax
    0xffffffff816ad679 <+25>:    add    0xe8(%rdi),%rax
    0xffffffff816ad680 <+32>:    movzbl 0x3(%rax),%ecx
    0xffffffff816ad684 <+36>:    movzbl 0x2(%rax),%esi
    0xffffffff816ad688 <+40>:    mov    %rcx,%rdx
    0xffffffff816ad68b <+43>:    shl    $0x8,%rdx
    0xffffffff816ad68f <+47>:    add    %rcx,%rdx
    0xffffffff816ad692 <+50>:    mov    %rsi,%rcx
    0xffffffff816ad695 <+53>:    shl    $0x8,%rcx
    0xffffffff816ad699 <+57>:    shl    $0x30,%rdx
    0xffffffff816ad69d <+61>:    add    %rsi,%rcx
    0xffffffff816ad6a0 <+64>:    movzbl (%rax),%esi
    0xffffffff816ad6a3 <+67>:    shl    $0x20,%rcx
    0xffffffff816ad6a7 <+71>:    or     %rcx,%rdx
    0xffffffff816ad6aa <+74>:    mov    %rsi,%rcx
    0xffffffff816ad6ad <+77>:    shl    $0x8,%rcx
    0xffffffff816ad6b1 <+81>:    add    %rsi,%rcx
    0xffffffff816ad6b4 <+84>:    or     %rcx,%rdx
    0xffffffff816ad6b7 <+87>:    movzbl 0x1(%rax),%ecx
    0xffffffff816ad6bb <+91>:    mov    %rcx,%rax
    0xffffffff816ad6be <+94>:    shl    $0x8,%rax
    0xffffffff816ad6c2 <+98>:    add    %rcx,%rax
    0xffffffff816ad6c5 <+101>:   shl    $0x10,%rax
    0xffffffff816ad6c9 <+105>:   or     %rdx,%rax
    0xffffffff816ad6cc <+108>:   ret

> 
>> +}
>> +
>> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * The same as the ARGB8888 but with the alpha channel as the
>> +	 * maximum value as possible.
>> +	 */
>> +	return 0xffffllu << 48 |
>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>> +	       ((u64)pixel_addr[0] * 257);
>> +}
>> +
>> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
>> +{
>> +	__le64 *pixel_addr = packed_pixels_addr(composer, x, y);
>> +
>> +	/*
>> +	 * Because the format byte order is in little-endian and this code
>> +	 * needs to run on big-endian machines too, we need modify
>> +	 * the byte order from little-endian to the CPU native byte order.
>> +	 */
>> +	return le64_to_cpu(*pixel_addr);
>> +}
>> +
>> +/*
>> + * The following functions are used as blend operations. But unlike the
>> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
>> + * source, convert it to a specific format, and store it in the destination.
>> + *
>> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
>> + * copy and convert one pixel from/to the output buffer to/from
>> + * another buffer (e.g. writeback buffer, primary plane buffer).
>> + */
>> +
>> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			 struct vkms_composer *dst_composer)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	/*
>> +	 * This sequence below is important because the format's byte order is
>> +	 * in little-endian. In the case of the ARGB8888 the memory is
>> +	 * organized this way:
>> +	 *
>> +	 * | Addr     | = blue channel
>> +	 * | Addr + 1 | = green channel
>> +	 * | Addr + 2 | = Red channel
>> +	 * | Addr + 3 | = Alpha channel
>> +	 */
>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> +	pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
> 
> This could be potentially expensive if the compiler ends up using an
> actual div instruction.
> 
Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the 
"Faster div by 255" works to adapt to 16 bits.

> Btw. this would be shorter to write as
> 
> 	(argb_src1 >> 16) & 0xffff
> 
> etc.
I will use it in the V2. Thanks.

> 
> Thanks,
> pq
> 
>> +}
>> +
>> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			 struct vkms_composer *dst_composer)
>> +{
>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>> +	pixel_addr[3] = 0xff;
>> +}
>> +
>> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
>> +			     struct vkms_composer *dst_composer)
>> +{
>> +	__le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>> +
>> +	*pixel_addr = cpu_to_le64(argb_src1);
>> +}
>> +
>> +#endif /* _VKMS_FORMATS_H_ */
>