All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: hardware accelerated bitblt using dma engine
@ 2016-08-02 13:21 Enrico Weigelt, metux IT consult
  2016-08-02 14:04 ` Daniel Vetter
  2016-08-03  9:24 ` Marek Szyprowski
  0 siblings, 2 replies; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-02 13:21 UTC (permalink / raw)
  To: dri devel

Hi folks,


I'm currently thinking about adding an hw-accelerated bitblt operation.
The idea goes like this:

* we add some bitblt ioctl which copies rects between bo's.
  (it also handles memory layouts, pixfmt conversion, etc)
* the driver can decide to let the GPU or IPU do that, if available
* if we have an suitable DMA engine (maybe only the more complex ones
  which can handle lines on their own ...) we'll use that
* as fallback, resort to memcpy().


Whether an dma engine can/should be used might be highly hw specific,
so that probably would be configured in DT.

To use that feature, userland could actually allocate two BO's,
one that's mapped as a framebuffer to some crtc, another one just
a memory buffer. It could then render to the fast memory buffer and
tell the DRM to only copy over the changed regions to the graphics
memory via DMA (or whatever is best on that particular hw platform).


What do you think about that idea ?


--mtx
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-02 13:21 RFC: hardware accelerated bitblt using dma engine Enrico Weigelt, metux IT consult
@ 2016-08-02 14:04 ` Daniel Vetter
  2016-08-02 21:43   ` Enrico Weigelt, metux IT consult
  2016-08-03  9:24 ` Marek Szyprowski
  1 sibling, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2016-08-02 14:04 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult; +Cc: dri devel

On Tue, Aug 02, 2016 at 03:21:08PM +0200, Enrico Weigelt, metux IT consult wrote:
> Hi folks,
> 
> 
> I'm currently thinking about adding an hw-accelerated bitblt operation.
> The idea goes like this:
> 
> * we add some bitblt ioctl which copies rects between bo's.
>   (it also handles memory layouts, pixfmt conversion, etc)
> * the driver can decide to let the GPU or IPU do that, if available
> * if we have an suitable DMA engine (maybe only the more complex ones
>   which can handle lines on their own ...) we'll use that
> * as fallback, resort to memcpy().
> 
> 
> Whether an dma engine can/should be used might be highly hw specific,
> so that probably would be configured in DT.
> 
> To use that feature, userland could actually allocate two BO's,
> one that's mapped as a framebuffer to some crtc, another one just
> a memory buffer. It could then render to the fast memory buffer and
> tell the DRM to only copy over the changed regions to the graphics
> memory via DMA (or whatever is best on that particular hw platform).
> 
> 
> What do you think about that idea ?

If you mean "add a generic hw-accelerated bitblt operation": This is not
hw drm works. The generic kms stuff is about display only, with just very
basic (hence "dumb") buffer allocation support in a generic way.

If you mean "expose the dma engine I have here to userspace in
driver-private ioctls with the trade-off logic between that, kms
compositing using the display block and memcpy in userspace", then go
ahead ;-) But if you do that, pls don't don't forget that for any uapi the
drm subsytem requires correspoding open source userspace (in a real
app/compositor, not just some toy test or something similar).

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-02 14:04 ` Daniel Vetter
@ 2016-08-02 21:43   ` Enrico Weigelt, metux IT consult
  2016-08-02 23:12     ` Rob Clark
  0 siblings, 1 reply; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-02 21:43 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri devel

On 02.08.2016 16:04, Daniel Vetter wrote:

> If you mean "add a generic hw-accelerated bitblt operation": This is not
> hw drm works. The generic kms stuff is about display only, with just very
> basic (hence "dumb") buffer allocation support in a generic way.

Well, if it already does buffer allocation and mapping (which might
also involve copying around phyisical buffers), why not also add
copy-between-buffers ?

> If you mean "expose the dma engine I have here to userspace in
> driver-private ioctls with the trade-off logic between that, kms
> compositing using the display block and memcpy in userspace", then go
> ahead ;-) But if you do that, pls don't don't forget that for any uapi the
> drm subsytem requires correspoding open source userspace (in a real
> app/compositor, not just some toy test or something similar).

I dont intent to add yet another specific driver and driver-specific
ioctl()s, but instead a generic interface. Such stuff needs kernel
support and kernel configuration anyways, so I'd like to keep it out
of userland's business.


--mtx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-02 21:43   ` Enrico Weigelt, metux IT consult
@ 2016-08-02 23:12     ` Rob Clark
  2016-08-03  3:33       ` Enrico Weigelt, metux IT consult
  0 siblings, 1 reply; 17+ messages in thread
From: Rob Clark @ 2016-08-02 23:12 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult; +Cc: dri devel

On Tue, Aug 2, 2016 at 5:43 PM, Enrico Weigelt, metux IT consult
<enrico.weigelt@gr13.net> wrote:
> On 02.08.2016 16:04, Daniel Vetter wrote:
>
>> If you mean "add a generic hw-accelerated bitblt operation": This is not
>> hw drm works. The generic kms stuff is about display only, with just very
>> basic (hence "dumb") buffer allocation support in a generic way.
>
> Well, if it already does buffer allocation and mapping (which might
> also involve copying around phyisical buffers), why not also add
> copy-between-buffers ?

except "dumb" buffers exist *only* for CPU rendered content, you
cannot assume that a gpu can accelerate anything with them.

They basically exist just for simple splash screens and fbcon

>> If you mean "expose the dma engine I have here to userspace in
>> driver-private ioctls with the trade-off logic between that, kms
>> compositing using the display block and memcpy in userspace", then go
>> ahead ;-) But if you do that, pls don't don't forget that for any uapi the
>> drm subsytem requires correspoding open source userspace (in a real
>> app/compositor, not just some toy test or something similar).
>
> I dont intent to add yet another specific driver and driver-specific
> ioctl()s, but instead a generic interface. Such stuff needs kernel
> support and kernel configuration anyways, so I'd like to keep it out
> of userland's business.

there is a reason that there is no generic gpu cmd submission ioctl.
It is too much hw specific, and anyway it is only used by device
specific userspace (ie. gl driver and/or xorg ddx)

BR,
-R

>
> --mtx
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-02 23:12     ` Rob Clark
@ 2016-08-03  3:33       ` Enrico Weigelt, metux IT consult
  2016-08-03  3:47         ` Dave Airlie
  0 siblings, 1 reply; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-03  3:33 UTC (permalink / raw)
  To: Rob Clark; +Cc: dri devel

On 03.08.2016 01:12, Rob Clark wrote:

Hi,

>> Well, if it already does buffer allocation and mapping (which might
>> also involve copying around phyisical buffers), why not also add
>> copy-between-buffers ?
> 
> except "dumb" buffers exist *only* for CPU rendered content, you
> cannot assume that a gpu can accelerate anything with them.

Exactly my usecase: having no (usable) GPU at all, but a an sdma
controller - or even better: an IPU - which can do the bitblt.
(maybe even w/ colorspace conversion, rotation, etc)

There might be GPUs which can also do that - and in that case it
should be done by the GPU.

> They basically exist just for simple splash screens and fbcon

Or when you dont have an (usable) GPU at all ?

> there is a reason that there is no generic gpu cmd submission ioctl.
> It is too much hw specific, 

Sure, but I'm not going to use an GPU at all, but different hw.

> and anyway it is only used by device
> specific userspace (ie. gl driver and/or xorg ddx)

Actually, on my targets I neither have gl nor xorg, and I'd like to
keep userland generic. I'd hate to hate to have lots of hw-specific
cairo-backends when I'll have to touch the kernel anyways, in order
to use smda or ipu.


By the way: while hacking a bit on mesa (backporting to Trusty),
I came around separate hw-specific calls for retrieving the video
memory size. Seems to be a really common thing ... is there any
hw that does not have such thing ? Couldn't that be an generic
ioctl() ?

I somewhat got the strange feeling that anything that goes beyond
very trivial dumb framebuffer has hw-specific ioctl's ;-o


--mtx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-03  3:33       ` Enrico Weigelt, metux IT consult
@ 2016-08-03  3:47         ` Dave Airlie
  2016-08-03  4:39           ` Enrico Weigelt, metux IT consult
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Airlie @ 2016-08-03  3:47 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult; +Cc: dri devel

On 3 August 2016 at 13:33, Enrico Weigelt, metux IT consult
<enrico.weigelt@gr13.net> wrote:
> On 03.08.2016 01:12, Rob Clark wrote:
>
> Hi,
>
>>> Well, if it already does buffer allocation and mapping (which might
>>> also involve copying around phyisical buffers), why not also add
>>> copy-between-buffers ?
>>
>> except "dumb" buffers exist *only* for CPU rendered content, you
>> cannot assume that a gpu can accelerate anything with them.
>
> Exactly my usecase: having no (usable) GPU at all, but a an sdma
> controller - or even better: an IPU - which can do the bitblt.
> (maybe even w/ colorspace conversion, rotation, etc)
>
> There might be GPUs which can also do that - and in that case it
> should be done by the GPU.
>
>> They basically exist just for simple splash screens and fbcon
>
> Or when you dont have an (usable) GPU at all ?
>
>> there is a reason that there is no generic gpu cmd submission ioctl.
>> It is too much hw specific,
>
> Sure, but I'm not going to use an GPU at all, but different hw.
>
>> and anyway it is only used by device
>> specific userspace (ie. gl driver and/or xorg ddx)
>
> Actually, on my targets I neither have gl nor xorg, and I'd like to
> keep userland generic. I'd hate to hate to have lots of hw-specific
> cairo-backends when I'll have to touch the kernel anyways, in order
> to use smda or ipu.
>
>
> By the way: while hacking a bit on mesa (backporting to Trusty),
> I came around separate hw-specific calls for retrieving the video
> memory size. Seems to be a really common thing ... is there any
> hw that does not have such thing ? Couldn't that be an generic
> ioctl() ?
>
> I somewhat got the strange feeling that anything that goes beyond
> very trivial dumb framebuffer has hw-specific ioctl's ;-o

The thing isstuff looks generic until you go to use it, just abstract
it in userspace.

Because no hw is the same once you go beyond that.

Video memory size means what? VRAM, GPU accessible system RAM,
amount of CPU visible VRAM?

Dave.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-03  3:47         ` Dave Airlie
@ 2016-08-03  4:39           ` Enrico Weigelt, metux IT consult
  0 siblings, 0 replies; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-03  4:39 UTC (permalink / raw)
  To: Dave Airlie; +Cc: dri devel

On 03.08.2016 05:47, Dave Airlie wrote:

> Because no hw is the same once you go beyond that.

hmm, it doesn't seem to be so extremly different, that we cant
at least abstract some common aspects.

> Video memory size means what? VRAM, GPU accessible system RAM,
> amount of CPU visible VRAM?

Actually, these are separate things, which of course should be
reported in separate fields:

  * phys_aperture_size:
    --> physical maximum for the shared ram between cpu and gpu
        (cpu-accessible gpu-memory)
  * avail_aperture_size:
    --> the logical maximum that the process can map
    --> might be lower than phys_..., eg. due to process limits or
        when running a 32bit userland on 64bit kernel
  * phys_gpu_memory_size:
    --> the total size of gpu's memory (that could be accessed by cpu)
    --> might be larger than phys_aperture_size / avail_aperture_size
        when gpu just has more memory than can be shared w/ cpu
    --> eg. an interesting indicator on how much can be filled w/
        readonly textures (which dont need to be cpu-accessible anymore)
  * avail_gpu_memory_size:
    --> the logical maximum that process can consume
  * phys_shm_size:
    --> max size of shared system memory (directly accessible b
        both gpu and cpu)
    --> commonly available on SoCs - on other hw might be zero
    --> not counting on-board RAM that is hw-mapped to the GPU, thus not
        falling into system memory in the first place.

IMHO, that should catch all usual scenarios, from the fat gamer-GPU
boards to tiny SoCs ... did I miss something here ?

In the end, these values only seem to be used as some statistics for
the userland's decision on much stuff it uploads to the GPU.

By the way: what about resource limits ? Can we control, how much GPU
memory an unprivileged process can consume, in order to prevent DOS'ing
other processes (even other users) ?


--mtx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-02 13:21 RFC: hardware accelerated bitblt using dma engine Enrico Weigelt, metux IT consult
  2016-08-02 14:04 ` Daniel Vetter
@ 2016-08-03  9:24 ` Marek Szyprowski
  2016-08-03 11:47   ` Daniel Vetter
  2016-08-03 23:19   ` Enrico Weigelt, metux IT consult
  1 sibling, 2 replies; 17+ messages in thread
From: Marek Szyprowski @ 2016-08-03  9:24 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult, dri devel

Hi Enrico,


On 2016-08-02 15:21, Enrico Weigelt, metux IT consult wrote:
> I'm currently thinking about adding an hw-accelerated bitblt operation.
> The idea goes like this:
>
> * we add some bitblt ioctl which copies rects between bo's.
>    (it also handles memory layouts, pixfmt conversion, etc)
> * the driver can decide to let the GPU or IPU do that, if available
> * if we have an suitable DMA engine (maybe only the more complex ones
>    which can handle lines on their own ...) we'll use that
> * as fallback, resort to memcpy().
>
>
> Whether an dma engine can/should be used might be highly hw specific,
> so that probably would be configured in DT.
>
> To use that feature, userland could actually allocate two BO's,
> one that's mapped as a framebuffer to some crtc, another one just
> a memory buffer. It could then render to the fast memory buffer and
> tell the DRM to only copy over the changed regions to the graphics
> memory via DMA (or whatever is best on that particular hw platform).
>
>
> What do you think about that idea ?

I'm working now on something similar, but more generic. There is already
a framework for picture processing (converting, scaling, blitting, rotating)
in Exynos DRM. It is called IPP (Image Post Processing), but its user
interface is really ugly and limited, so I plan to rewrite it and make
it really generic. Some discussion on it were already in the following
thread:
http://thread.gmane.org/gmane.linux.kernel.samsung-soc/49743

I plan to propose an API based on DRM object/properties, which will be
similar to KMS atomic API. I will let you know when I have it ready for
presenting in public.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-03  9:24 ` Marek Szyprowski
@ 2016-08-03 11:47   ` Daniel Vetter
  2016-08-03 23:32     ` Enrico Weigelt, metux IT consult
  2016-08-03 23:19   ` Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 17+ messages in thread
From: Daniel Vetter @ 2016-08-03 11:47 UTC (permalink / raw)
  To: Marek Szyprowski; +Cc: dri devel, Enrico Weigelt, metux IT consult

On Wed, Aug 03, 2016 at 11:24:37AM +0200, Marek Szyprowski wrote:
> Hi Enrico,
> 
> 
> On 2016-08-02 15:21, Enrico Weigelt, metux IT consult wrote:
> > I'm currently thinking about adding an hw-accelerated bitblt operation.
> > The idea goes like this:
> > 
> > * we add some bitblt ioctl which copies rects between bo's.
> >    (it also handles memory layouts, pixfmt conversion, etc)
> > * the driver can decide to let the GPU or IPU do that, if available
> > * if we have an suitable DMA engine (maybe only the more complex ones
> >    which can handle lines on their own ...) we'll use that
> > * as fallback, resort to memcpy().
> > 
> > 
> > Whether an dma engine can/should be used might be highly hw specific,
> > so that probably would be configured in DT.
> > 
> > To use that feature, userland could actually allocate two BO's,
> > one that's mapped as a framebuffer to some crtc, another one just
> > a memory buffer. It could then render to the fast memory buffer and
> > tell the DRM to only copy over the changed regions to the graphics
> > memory via DMA (or whatever is best on that particular hw platform).
> > 
> > 
> > What do you think about that idea ?
> 
> I'm working now on something similar, but more generic. There is already
> a framework for picture processing (converting, scaling, blitting, rotating)
> in Exynos DRM. It is called IPP (Image Post Processing), but its user
> interface is really ugly and limited, so I plan to rewrite it and make
> it really generic. Some discussion on it were already in the following
> thread:
> http://thread.gmane.org/gmane.linux.kernel.samsung-soc/49743
> 
> I plan to propose an API based on DRM object/properties, which will be
> similar to KMS atomic API. I will let you know when I have it ready for
> presenting in public.

In case it's not clear from Dave's, Rob's and my reply: Generic rendering
of any kind is _very_ unpopular in the drm subsystem. We've tried
semi-generic 15 years ago (with some of the shared drm core stuff between
linux and bsd) and it's a disaster of fake generic, single-use code.

The reason for that is that hw accel is actually not simple. You
essentially need to have as little additional abstraction between what's
your real client api (hw composer, Xrender or whatever it is) and the hw.
Because for optimal performance you _must_ supply the commands to the
kernel in an as close to the format/layout used by the hardware as
possible. That means no shared command submission of any kind. And the
other reason is that cache transfers and memory transfers are highly
hardware specific, too. Which means no shared buffer management and
mapping interfaces either.

In short, if you want to get this in you need to disprove the last 15-20
years of linux gfx driver developement and show that we've been wrong on
these. Expect _very_ high resistence to anything remotely looking like a
shared/common blitter uapi. Of course having some common helper code to
make drivers easier to type (like cma helpers, or ttm, or similar) is
something entirely different, this is about the uapi.

And please don't be discourage here, I just want to set clear expectations
to avoid disappointment. Supporting blitter hardware is obviously a good
idea, and I think the drm subsystem is the right place for that
(especially if you have a display block or sometimes a real gpu connected
to that blitter).

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-03  9:24 ` Marek Szyprowski
  2016-08-03 11:47   ` Daniel Vetter
@ 2016-08-03 23:19   ` Enrico Weigelt, metux IT consult
  1 sibling, 0 replies; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-03 23:19 UTC (permalink / raw)
  To: Marek Szyprowski, dri devel

On 03.08.2016 11:24, Marek Szyprowski wrote:

Hi,

> I'm working now on something similar, but more generic. There is
> already a framework for picture processing (converting, scaling,
> blitting, rotating) in Exynos DRM.

In DRM, not v4l ? Hmm, interesting.

On mx5/mx6 we've got an IPU, which is accessible via v4l, eg.for
colorspace conversion, jpeg encode/decode, rotation, etc.
(anyone of the involved folks @ptx here on the list ?)

Yet another overlap between DRM and V4L (IMHO, seems to be a matter of
perspective of perspective and usecases, where to put such stuff in)

By the way: what's the status of sharing buffers between DRM and V4L ?
I could also live with having such an hw-based image-copy operation
living within v4l, when they're operating on the the same buffers.

> http://thread.gmane.org/gmane.linux.kernel.samsung-soc/49743

seems to be offline

> I plan to propose an API based on DRM object/properties, which will be
> similar to KMS atomic API. I will let you know when I have it ready for
> presenting in public.

hmm, I'm getting curious ...


--mtx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-03 11:47   ` Daniel Vetter
@ 2016-08-03 23:32     ` Enrico Weigelt, metux IT consult
  2016-08-04  7:50       ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-03 23:32 UTC (permalink / raw)
  To: Daniel Vetter, Marek Szyprowski; +Cc: dri devel

On 03.08.2016 13:47, Daniel Vetter wrote:

> Because for optimal performance you _must_ supply the commands to the
> kernel in an as close to the format/layout used by the hardware as
> possible. That means no shared command submission of any kind. And the
> other reason is that cache transfers and memory transfers are highly
> hardware specific, too. Which means no shared buffer management and
> mapping interfaces either.

Right, but I wonder whether that applies to my case.
Again, I'm talking about using aux IPs (not the actual GPU) for things
like copying image regions, maybe even pixfmt/colospace conversions -
those things, in embedded world, usually aren't done by the gpu, but
separate IPs.

> Of course having some common helper code to make drivers easier to type
> (like cma helpers, or ttm, or similar) is something entirely
> different, this is about the uapi.

Well, I'm actually talking about an uapi, as userland somehow needs to
call it :p

Doing it in specific drivers doesn't seem to be a good ways, as sooner
or later we'd have to implement that into lots of different drivers
(plus corresponding userland support), as it's pretty orthogonal to
GPU, as well as fbs/crtcs. Just in some cases, it **might** also be done
via GPU, if applicable (maybe only when its idle anyways), but that's
not the usual case. Instead the usual case would be employing some DMA
controller or IPU.

> And please don't be discourage here, I just want to set clear expectations
> to avoid disappointment. Supporting blitter hardware is obviously a good
> idea, and I think the drm subsystem is the right place for that
> (especially if you have a display block or sometimes a real gpu connected
> to that blitter).

Okay, where else should we put it ? Invent an entirely new device for
that ?


--mtx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-03 23:32     ` Enrico Weigelt, metux IT consult
@ 2016-08-04  7:50       ` Daniel Vetter
  2016-08-04 10:09         ` Daniel Stone
  2016-08-04 23:16         ` Enrico Weigelt, metux IT consult
  0 siblings, 2 replies; 17+ messages in thread
From: Daniel Vetter @ 2016-08-04  7:50 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult; +Cc: dri devel, Marek Szyprowski

On Thu, Aug 04, 2016 at 01:32:57AM +0200, Enrico Weigelt, metux IT consult wrote:
> On 03.08.2016 13:47, Daniel Vetter wrote:
> 
> > Because for optimal performance you _must_ supply the commands to the
> > kernel in an as close to the format/layout used by the hardware as
> > possible. That means no shared command submission of any kind. And the
> > other reason is that cache transfers and memory transfers are highly
> > hardware specific, too. Which means no shared buffer management and
> > mapping interfaces either.
> 
> Right, but I wonder whether that applies to my case.
> Again, I'm talking about using aux IPs (not the actual GPU) for things
> like copying image regions, maybe even pixfmt/colospace conversions -
> those things, in embedded world, usually aren't done by the gpu, but
> separate IPs.

15+ years ago gpus weren't much more than fancy blitters either ;-)

> > Of course having some common helper code to make drivers easier to type
> > (like cma helpers, or ttm, or similar) is something entirely
> > different, this is about the uapi.
> 
> Well, I'm actually talking about an uapi, as userland somehow needs to
> call it :p
> 
> Doing it in specific drivers doesn't seem to be a good ways, as sooner
> or later we'd have to implement that into lots of different drivers
> (plus corresponding userland support), as it's pretty orthogonal to
> GPU, as well as fbs/crtcs. Just in some cases, it **might** also be done
> via GPU, if applicable (maybe only when its idle anyways), but that's
> not the usual case. Instead the usual case would be employing some DMA
> controller or IPU.

One problem with 2d blitters is that there's no common userspace
interface, but many: Xrender, hwc, old X drawing api, various attempts by
khronos to standardize something, cairo, ... It's probably worse than
video decoding even, and definitely not like on the 3d side where there's
GL (and now vulkan) and that's it.

So you you'll end up with tons of glue code everywhere anyway. Adding yet
another kernel uapi doesn't help, but forcing it to be generic will make
sure it's inefficient. Which means someone else then will create another
one.

> > And please don't be discourage here, I just want to set clear expectations
> > to avoid disappointment. Supporting blitter hardware is obviously a good
> > idea, and I think the drm subsystem is the right place for that
> > (especially if you have a display block or sometimes a real gpu connected
> > to that blitter).
> 
> Okay, where else should we put it ? Invent an entirely new device for
> that ?

If the blitter is always attached to the display block just add a few gem
based ioctls there (like with desktop gpus) for submitting blit workloads.
Otherwise new driver I guess.

Either case it'll probably be a bit more painful than a kms driver, since
on the gem side the helpers aren't that full-featured (yet).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-04  7:50       ` Daniel Vetter
@ 2016-08-04 10:09         ` Daniel Stone
  2016-08-04 23:16         ` Enrico Weigelt, metux IT consult
  1 sibling, 0 replies; 17+ messages in thread
From: Daniel Stone @ 2016-08-04 10:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: dri devel, Enrico Weigelt, metux IT consult, Marek Szyprowski

Hi,

On 4 August 2016 at 08:50, Daniel Vetter <daniel@ffwll.ch> wrote:
> One problem with 2d blitters is that there's no common userspace
> interface, but many: Xrender, hwc, old X drawing api, various attempts by
> khronos to standardize something, cairo, ... It's probably worse than
> video decoding even, and definitely not like on the 3d side where there's
> GL (and now vulkan) and that's it.

Running with the same theme, a unified API would only be meaningfully
useful if you have unified userspace support. As soon as you hit the
usual issues of needing to blit to/from special buffer types, weird
format restrictions, chained operations which can affect performance
enough to make you avoid or heavily favour certain types of
operations, etc etc, you'll need separate userspace code to handle
them. And at that point, sticking it behind a unified API doesn't
really bring any value.

Other prior art you could look at is the Renesas VSP1/VSP2 hardware,
which works through V4L2 and its media controller.

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-04  7:50       ` Daniel Vetter
  2016-08-04 10:09         ` Daniel Stone
@ 2016-08-04 23:16         ` Enrico Weigelt, metux IT consult
  2016-08-05  4:37           ` Enrico Weigelt, metux IT consult
  2016-08-05  7:47           ` Daniel Vetter
  1 sibling, 2 replies; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-04 23:16 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri devel, Marek Szyprowski

On 04.08.2016 09:50, Daniel Vetter wrote:

Hi,

> One problem with 2d blitters is that there's no common userspace
> interface, but many: Xrender, hwc, old X drawing api, various attempts by
> khronos to standardize something, cairo, ... 

We're talking about userland APIs, not kernel->userland interfaces.
For userland APIs, I'm right now primarily interested in cairo
(using it for my tiny widget toolkit) ... but I'm also thinking about
setting X ontop of someting cairo-alike some day - or making gallium
that layer.

> It's probably worse than video decoding even, and definitely not like
> on the 3d side where there's GL (and now vulkan) and that's it.

On video side we have v4l for the kernel interface and gst as userland
framework ... looks like a good compromise to me.

> So you you'll end up with tons of glue code everywhere anyway. 

Actually, I'd like to get the glue code smaller. Putting both cairo
and X onto the common driver base (something that's somewhere between
xorg video drivers and cairo surface backends) seems a good way to go,
even though there'll be a lot of work to do for that.

> Adding yet another kernel uapi doesn't help, but forcing it to be generic
> will make sure it's inefficient. Which means someone else then will
> create another one.

hmm, I'm not yet convinced that it necessarily will be inefficient.

To clarify the scope: I'm talking only about _dedicated_ units, which
are completely orthogonal to complex gpus (basicly, just specialized
dma controllers).

I personally don't care so much whether it's in DRM, V4L or whatever.
DRM just seemed to be a good place to me.

By the way: as the number of such controllers increases, for dozens
of different things, eg. IO, crypto, etc., and in many cases they're
able to directly access the same memory, I got the feeling that we
should generalize gems even further, so that they could be any kind
of buffer that may be passed to any kind of device. (hmm, reminds me
on some ancient mainframe concepts).

> If the blitter is always attached to the display block just add a few gem
> based ioctls there (like with desktop gpus) for submitting blit workloads.
> Otherwise new driver I guess.

hmm, can I use gems outside DRM ?
eg. would it be possible to write an storage controller driver that
directly accesses an some gem (eg. let the controller write out an
gem object) ?


--mtx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-04 23:16         ` Enrico Weigelt, metux IT consult
@ 2016-08-05  4:37           ` Enrico Weigelt, metux IT consult
  2016-08-05  7:49             ` Daniel Vetter
  2016-08-05  7:47           ` Daniel Vetter
  1 sibling, 1 reply; 17+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2016-08-05  4:37 UTC (permalink / raw)
  To: dri-devel

On 05.08.2016 01:16, Enrico Weigelt, metux IT consult wrote:

<snip>
Seems I've been on a completely wrong path - what I'm looking
for is dma-buf. So my idea now goes like this:

* add a new 'virtual GPU' as render node.
* the basic operations are:
  -> create a virtual dumb framebuffer (just inside system memory),
  -> import dma-buf's as bo's
  -> blitting between bo's using dma-engine

That way, everything should be cleanly separated.

As the application needs to be aware of that buffer-and-blit approach
anyways (IOW: allocate two BO's and trigger the blitting when it done
rendering), the extra glue needed for opening and talking to the
render node should be quite minimal.


--mtx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-04 23:16         ` Enrico Weigelt, metux IT consult
  2016-08-05  4:37           ` Enrico Weigelt, metux IT consult
@ 2016-08-05  7:47           ` Daniel Vetter
  1 sibling, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2016-08-05  7:47 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult; +Cc: dri devel, Marek Szyprowski

On Fri, Aug 05, 2016 at 01:16:55AM +0200, Enrico Weigelt, metux IT consult wrote:
> On 04.08.2016 09:50, Daniel Vetter wrote:
> 
> Hi,
> 
> > One problem with 2d blitters is that there's no common userspace
> > interface, but many: Xrender, hwc, old X drawing api, various attempts by
> > khronos to standardize something, cairo, ... 
> 
> We're talking about userland APIs, not kernel->userland interfaces.
> For userland APIs, I'm right now primarily interested in cairo
> (using it for my tiny widget toolkit) ... but I'm also thinking about
> setting X ontop of someting cairo-alike some day - or making gallium
> that layer.
> 
> > It's probably worse than video decoding even, and definitely not like
> > on the 3d side where there's GL (and now vulkan) and that's it.
> 
> On video side we have v4l for the kernel interface and gst as userland
> framework ... looks like a good compromise to me.
> 
> > So you you'll end up with tons of glue code everywhere anyway. 
> 
> Actually, I'd like to get the glue code smaller. Putting both cairo
> and X onto the common driver base (something that's somewhere between
> xorg video drivers and cairo surface backends) seems a good way to go,
> even though there'll be a lot of work to do for that.
> 
> > Adding yet another kernel uapi doesn't help, but forcing it to be generic
> > will make sure it's inefficient. Which means someone else then will
> > create another one.
> 
> hmm, I'm not yet convinced that it necessarily will be inefficient.
> 
> To clarify the scope: I'm talking only about _dedicated_ units, which
> are completely orthogonal to complex gpus (basicly, just specialized
> dma controllers).
> 
> I personally don't care so much whether it's in DRM, V4L or whatever.
> DRM just seemed to be a good place to me.
> 
> By the way: as the number of such controllers increases, for dozens
> of different things, eg. IO, crypto, etc., and in many cases they're
> able to directly access the same memory, I got the feeling that we
> should generalize gems even further, so that they could be any kind
> of buffer that may be passed to any kind of device. (hmm, reminds me
> on some ancient mainframe concepts).
> 
> > If the blitter is always attached to the display block just add a few gem
> > based ioctls there (like with desktop gpus) for submitting blit workloads.
> > Otherwise new driver I guess.
> 
> hmm, can I use gems outside DRM ?
> eg. would it be possible to write an storage controller driver that
> directly accesses an some gem (eg. let the controller write out an
> gem object) ?

Of course. In drm you can export/import gem buffers from to dma-buf. See

https://dri.freedesktop.org/docs/drm/gpu/drm-mm.html#prime-buffer-sharing

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: hardware accelerated bitblt using dma engine
  2016-08-05  4:37           ` Enrico Weigelt, metux IT consult
@ 2016-08-05  7:49             ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2016-08-05  7:49 UTC (permalink / raw)
  To: Enrico Weigelt, metux IT consult; +Cc: dri-devel

On Fri, Aug 05, 2016 at 06:37:26AM +0200, Enrico Weigelt, metux IT consult wrote:
> On 05.08.2016 01:16, Enrico Weigelt, metux IT consult wrote:
> 
> <snip>
> Seems I've been on a completely wrong path - what I'm looking
> for is dma-buf. So my idea now goes like this:
> 
> * add a new 'virtual GPU' as render node.
> * the basic operations are:
>   -> create a virtual dumb framebuffer (just inside system memory),
>   -> import dma-buf's as bo's
>   -> blitting between bo's using dma-engine
> 
> That way, everything should be cleanly separated.
> 
> As the application needs to be aware of that buffer-and-blit approach
> anyways (IOW: allocate two BO's and trigger the blitting when it done
> rendering), the extra glue needed for opening and talking to the
> render node should be quite minimal.

Yup, this is pretty much what I've beens suggesting ;-) The other bit is
that pls don't try to make the IOCTL/uapi interfaces generic, it will
hurt. Of course if there's a pile of IP (from the same vendor or whatever)
that all works similarly then sure, shared driver makes sense. But pretty
soon it doesn't (usually right when you want to have something closer to
direct submission to hardware with relocations).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-08-05  7:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-02 13:21 RFC: hardware accelerated bitblt using dma engine Enrico Weigelt, metux IT consult
2016-08-02 14:04 ` Daniel Vetter
2016-08-02 21:43   ` Enrico Weigelt, metux IT consult
2016-08-02 23:12     ` Rob Clark
2016-08-03  3:33       ` Enrico Weigelt, metux IT consult
2016-08-03  3:47         ` Dave Airlie
2016-08-03  4:39           ` Enrico Weigelt, metux IT consult
2016-08-03  9:24 ` Marek Szyprowski
2016-08-03 11:47   ` Daniel Vetter
2016-08-03 23:32     ` Enrico Weigelt, metux IT consult
2016-08-04  7:50       ` Daniel Vetter
2016-08-04 10:09         ` Daniel Stone
2016-08-04 23:16         ` Enrico Weigelt, metux IT consult
2016-08-05  4:37           ` Enrico Weigelt, metux IT consult
2016-08-05  7:49             ` Daniel Vetter
2016-08-05  7:47           ` Daniel Vetter
2016-08-03 23:19   ` Enrico Weigelt, metux IT consult

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.