Re: [PATCH v2 5/7] sm501: Replace hand written implementation with pixman where possible

From: BALATON Zoltan <balaton@eik.bme.hu>
To: Gerd Hoffmann <kraxel@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>,
	Sebastian Bauer <mail@sebastianbauer.info>,
	Magnus Damm <magnus.damm@gmail.com>,
	qemu-devel@nongnu.org, Aurelien Jarno <aurelien@aurel32.net>
Subject: Re: [PATCH v2 5/7] sm501: Replace hand written implementation with pixman where possible
Date: Tue, 26 May 2020 15:35:21 +0200 (CEST)	[thread overview]
Message-ID: <alpine.BSF.2.22.395.2005261434540.87757@zero.eik.bme.hu> (raw)
In-Reply-To: <20200526104318.wmsqqtia3h52l454@sirius.home.kraxel.org>

On Tue, 26 May 2020, Gerd Hoffmann wrote:
> On Thu, May 21, 2020 at 09:39:44PM +0200, BALATON Zoltan wrote:
>> Besides being faster this should also prevent malicious guests to
>> abuse 2D engine to overwrite data or cause a crash.
>
>>          uint32_t src_base = s->twoD_source_base & 0x03FFFFFF;
>> -        uint8_t *src = s->local_mem + src_base;
>
>> -                    val = *(_pixel_type *)&src[index_s];                      \
>
> Well, the advantage of *not* using pixman is that you can easily switch
> the code to use offsets instead of pointers, then apply the mask to the
> *final* offset to avoid oob data access:

The mask applied to src_base is not to prevent overflow but to implement 
register limits. Only these bits are valid if I remember correctly, so 
even if I use offsets I need to check for overflow. This patch basically 
does that by changing parameters to unsigned to prevent them being 
negative, checking values we multiply by to prevent them to be zero and 
then calculating first and last offset and check if they are within vram. 
(Unless of course I've made a mistake somewhere.) This should prevent 
overflow with one check and does not need to apply a mask at every step. 
The vram size can also be different so it's not a fixed mask anyway.

If not using pixman then I'd need to reimplement optimised 2D ops that 
will likely never be as good as pixman and no point in doing it several 
times for every device model so I'd rather try to use pixman where 
possible unless a better library is available.

>    val = *(_pixel_type*)(&s->local_mem[(s->twoD_source_base + index_s) & 0x03FFFFFF]);
>
>> +        if ((rop_mode && rop == 0x5) || (!rop_mode && rop == 0x55)) {
>> +            /* Invert dest, is there a way to do this with pixman? */
>
> PIXMAN_OP_XOR maybe?

Maybe, but looking at the pixman source I couldn't decide if

UN8x4_MUL_UN8_ADD_UN8x4_MUL_UN8 (s, dest_ia, d, src_ia);

seen here:
https://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine32.c#n396
is really the same as s ^ d.

>> +            if (rtl && ((db >= sb && db <= se) || (de >= sb && de <= se))) {
>> +                /* regions may overlap: copy via temporary */
>
> The usual way for a hardware blitter is to have a direction bit, i.e.
> the guest os can ask to blit in top->bottom or bottom->top scanline
> ordering.  The guest can use that to make sure the blit does not

Yes, this is the rtl above (right to left) and AmigaOS sets this most of 
the time so only relying on that to detect overlaps is not efficient.

> overwrite things.  But note the guest can also intentionally use
> overlapping regions, i.e. memset(0) the first scanline, then use a blit
> with overlap to clear the whole screen.  The later will surely break if
> you blit via temporary image ...

Fortunately no guest code seems to do that so unless we find one needing 
it I don't worry much about such rare cases. It would be best if pixman 
supported this but while I've found patches were submitted they did not 
get merged so far so using a temporary seems to be the simplest way that 
works well enough for now.

>> +                pixman_blt((uint32_t *)&s->local_mem[src_base],
>> +                           (uint32_t *)&s->local_mem[dst_base],
>> +                           src_pitch * (1 << format) / sizeof(uint32_t),
>> +                           dst_pitch * (1 << format) / sizeof(uint32_t),
>> +                           8 * (1 << format), 8 * (1 << format),
>> +                           src_x, src_y, dst_x, dst_y, width, height);
>
> See above, i'm not convinced pixman is the best way here.
> When using pixman I'd suggest:
>
>  (1) src = pixman_image_create_bits_no_clear(...);
>  (2) dst = pixman_image_create_bits_no_clear(...);
>  (3) pixman_image_composite(PIXMAN_OP_SRC, src, NULL, dst, ...);
>  (4) pixman_image_unref(src);
>  (5) pixman_image_unref(dst);
>
> pixman_blt() is probably doing basically the same.

Actually not the same, pixman_blt is faster operating directly on pointers 
while we need all the pixman_image overhead to use pixman_image_composite. 
Blitter is used for a lot of small ops (I've seen AmigaOS even call it 
with 1 pixel regions) so going through pixman_image every time does not 
seem to be efficient. To implement more complex ops this may be needed so 
I may try to figure that out later but I'd need some test cases to test if 
the results are correct. The current patches do the same as before (except 
for some rare overlapping cases as you noted above but we haven't observed 
any yet) and fix the overflows so this was the best I could do in the time 
I had. Maybe I try to improve this later but don't plan to rewrite it now.

>  The advantage of not
> using pixman_blt() is that
>
>  (a) you can also use pixman ops other than PIXMAN_OP_SRC, and
>  (b) you can have a helper function for (1)+(2) which very carefully
>      applies sanity checks to make sure the pixman image created stays
>      completely inside s->local_mem.
>  (c) you have the option to completely rearrange the code flow, for
>      example update the src pixman image whenever the guest touches
>      src_base or src_pitch or format instead of having a
>      create/op/unref cycle on every blitter op.

From traces I think most guest would write bltter related regs on every op 
so probably not worth the hassle to try to update regions on register 
access and we could do it on every op, possibly optimising 1 pixel blits 
and small regions via some special cases but even then simple copy image 
is probably the most common op that might worth doing via pixman_blt as 
it's expected to be frequently used so the less overhead is the better. 
Therefore I'd only use image_composite for more complex ops but that's too 
much effort for a relatively unused device model. Maybe for ati-vga I'll 
try to make it better but first should fix microengine for that so drivers 
can talk to it. I'd rather spend my limited free time on that than further 
improving sm501 unless some bugs show up.

>> +        pixman_fill((uint32_t *)&s->local_mem[dst_base],
>> +                    dst_pitch * (1 << format) / sizeof(uint32_t),
>> +                    8 * (1 << format), dst_x, dst_y, width, height, color);
>
>  (1) src = pixman_image_create_solid(...), otherwise same as above ;)

Same argument as composite_image and for fill we don't even have any 
advantage so while for composite implementing other ops is a reason to not 
use pixman_blt I see no reason to not go the fastest way for fill.

Regards,
BALATON Zoltan