All of lore.kernel.org
 help / color / mirror / Atom feed
* Fence, timeline and android sync points
@ 2014-08-12 22:13 Jerome Glisse
  2014-08-13  1:23 ` Jerome Glisse
                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-12 22:13 UTC (permalink / raw)
  To: dri-devel, maarten.lankhorst; +Cc: daniel.vetter, bskeggs

Hi,

So i want over the whole fence and sync point stuff as it's becoming a pressing
issue. I think we first need to agree on what is the problem we want to solve
and what would be the requirements to solve it.

Problem :
  Explicit synchronization btw different hardware block over a buffer object.

Requirements :
  Share common infrastructure.
  Allow optimal hardware command stream scheduling accross hardware block.
  Allow android sync point to be implemented on top of it.
  Handle/acknowledge exception (like good old gpu lockup).
  Minimize driver changes.

Glossary :
  hardware timeline: timeline bound to a specific hardware block.
  pipeline timeline: timeline bound to a userspace rendering pipeline, each
                     point on that timeline can be a composite of several
                     different hardware pipeline point.
  pipeline: abstract object representing userspace application graphic pipeline
            of each of the application graphic operations.
  fence: specific point in a timeline where synchronization needs to happen.


So now, current include/linux/fence.h implementation is i believe missing the
objective by confusing hardware and pipeline timeline and by bolting fence to
buffer object while what is really needed is true and proper timeline for both
hardware and pipeline. But before going further down that road let me look at
things and explain how i see them.

Current ttm fence have one and a sole purpose, allow synchronization for buffer
object move even thought some driver like radeon slightly abuse it and use them
for things like lockup detection.

The new fence want to expose an api that would allow some implementation of a
timeline. For that it introduces callback and some hard requirement on what the
driver have to expose :
  enable_signaling
  [signaled]
  wait

Each of those have to do work inside the driver to which the fence belongs and
each of those can be call more or less from unexpected (with restriction like
outside irq) context. So we end up with thing like :

 Process 1              Process 2                   Process 3
 I_A_schedule(fence0)
                        CI_A_F_B_signaled(fence0)
                                                    I_A_signal(fence0)
                                                    CI_B_F_A_callback(fence0)
                        CI_A_F_B_wait(fence0)
Lexique:
I_x  in driver x (I_A == in driver A)
CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)

So this is an happy mess everyone call everyone and this bound to get messy.
Yes i know there is all kind of requirement on what happen once a fence is
signaled. But those requirement only looks like they are trying to atone any
mess that can happen from the whole callback dance.

While i was too seduced by the whole callback idea long time ago, i think it is
a highly dangerous path to take where the combinatorial of what could happen
are bound to explode with the increase in the number of players.


So now back to how to solve the problem we are trying to address. First i want
to make an observation, almost all GPU that exist today have a command ring
on to which userspace command buffer are executed and inside the command ring
you can do something like :

  if (condition) execute_command_buffer else skip_command_buffer

where condition is a simple expression (memory_address cop value)) with cop one
of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
that any gpu that slightly matter can do that. Those who can not should fix
there command ring processor.


With that in mind, i think proper solution is implementing timeline and having
fence be a timeline object with a way simpler api. For each hardware timeline
driver provide a system memory address at which the lastest signaled fence
sequence number can be read. Each fence object is uniquely associated with
both a hardware and a pipeline timeline. Each pipeline timeline have a wait
queue.

When scheduling something that require synchronization on a hardware timeline
a fence is created and associated with the pipeline timeline and hardware
timeline. Other hardware block that need to wait on a fence can use there
command ring conditional execution to directly check the fence sequence from
the other hw block so you do optimistic scheduling. If optimistic scheduling
fails (which would be reported by hw block specific solution and hidden) then
things can fallback to software cpu wait inside what could be considered the
kernel thread of the pipeline timeline.


>From api point of view there is no inter-driver call. All the driver needs to
do is wakeup the pipeline timeline wait_queue when things are signaled or
when things go sideway (gpu lockup).


So how to implement that with current driver ? Well easy. Currently we assume
implicit synchronization so all we need is an implicit pipeline timeline per
userspace process (note this do not prevent inter process synchronization).
Everytime a command buffer is submitted it is added to the implicit timeline
with the simple fence object :

struct fence {
  struct list_head   list_hwtimeline;
  struct list_head   list_pipetimeline;
  struct hw_timeline *hw_timeline;
  uint64_t           seq_num;
  work_t             timedout_work;
  void               *csdata;
};

So with set of helper function call by each of the driver command execution
ioctl you have the implicit timeline that is properly populated and each
dirver command execution get the dependency from the implicit timeline.


Of course to take full advantages of all flexibilities this could offer we
would need to allow userspace to create pipeline timeline and to schedule
against the pipeline timeline of there choice. We could create file for
each of the pipeline timeline and have file operation to wait/query
progress.

Note that the gpu lockup are considered exceptional event, the implicit
timeline will probably want to continue on other job on other hardware
block but the explicit one probably will want to decide wether to continue
or abort or retry without the fault hw block.


I realize i am late to the party and that i should have taken a serious
look at all this long time ago. I apologize for that and if you consider
this is to late then just ignore me modulo the big warning the crazyness
that callback will introduce an how bad things bound to happen. I am not
saying that bad things can not happen with what i propose just that
because everything happen inside the process context that is the one
asking/requiring synchronization there will be not interprocess kernel
callback (a callback that was registered by one process and that is call
inside another process time slice because fence signaling is happening
inside this other process time slice).


Pseudo code for explicitness :

drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
{
   struct fence *dependency[16], *fence;
   int m;

   m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
                         dependency, 16, &fence);
   if (m < 0)
     return m;
   if (m >= 16) {
       // alloc m and recall;
   }
   dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
}

int timeline_schedule(ptimeline, hwtimeline, timeout,
                       dependency, mdep, **fence)
{
   // allocate fence set hw_timeline and init work
   // build up list of dependency by looking at list of pending fence in
   // timeline
}



// If device driver schedule job hopping for all dependency to be signaled then
// it must also call this function with csdata being a copy of what needs to be
// executed once all dependency are signaled
void timeline_missed_schedule(timeline, fence, void *csdata)
{
   INITWORK(fence->work, timeline_missed_schedule_worker)
   fence->csdata = csdata;
   schedule_delayed_work(fence->work, default_timeout)
}

void timeline_missed_schedule_worker(work)
{
   driver = driver_from_fence_hwtimeline(fence)

   // Make sure that each of the hwtimeline dependency will fire irq by
   // calling a driver function.
   timeline_wait_for_fence_dependency(fence);
   driver->execute_cs(driver, fence);
}

// This function is call by driver code that signal fence (could be call from
// interrupt context). It is responsabilities of device driver to call that
// function.
void timeline_signal(hwtimeline)
{
  for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
    wakeup(fence->pipetimeline->wait_queue);
  }
}


Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-12 22:13 Fence, timeline and android sync points Jerome Glisse
@ 2014-08-13  1:23 ` Jerome Glisse
  2014-08-13  7:59 ` Christian König
  2014-08-13  8:28 ` Daniel Vetter
  2 siblings, 0 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-13  1:23 UTC (permalink / raw)
  To: dri-devel, maarten.lankhorst; +Cc: daniel.vetter, bskeggs

On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> Hi,
> 
> So i want over the whole fence and sync point stuff as it's becoming a pressing
> issue. I think we first need to agree on what is the problem we want to solve
> and what would be the requirements to solve it.
> 
> Problem :
>   Explicit synchronization btw different hardware block over a buffer object.
> 
> Requirements :
>   Share common infrastructure.
>   Allow optimal hardware command stream scheduling accross hardware block.
>   Allow android sync point to be implemented on top of it.
>   Handle/acknowledge exception (like good old gpu lockup).
>   Minimize driver changes.
> 
> Glossary :
>   hardware timeline: timeline bound to a specific hardware block.
>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
>                      point on that timeline can be a composite of several
>                      different hardware pipeline point.
>   pipeline: abstract object representing userspace application graphic pipeline
>             of each of the application graphic operations.
>   fence: specific point in a timeline where synchronization needs to happen.
> 
> 
> So now, current include/linux/fence.h implementation is i believe missing the
> objective by confusing hardware and pipeline timeline and by bolting fence to
> buffer object while what is really needed is true and proper timeline for both
> hardware and pipeline. But before going further down that road let me look at
> things and explain how i see them.
> 
> Current ttm fence have one and a sole purpose, allow synchronization for buffer
> object move even thought some driver like radeon slightly abuse it and use them
> for things like lockup detection.
> 
> The new fence want to expose an api that would allow some implementation of a
> timeline. For that it introduces callback and some hard requirement on what the
> driver have to expose :
>   enable_signaling
>   [signaled]
>   wait
> 
> Each of those have to do work inside the driver to which the fence belongs and
> each of those can be call more or less from unexpected (with restriction like
> outside irq) context. So we end up with thing like :
> 
>  Process 1              Process 2                   Process 3
>  I_A_schedule(fence0)
>                         CI_A_F_B_signaled(fence0)
>                                                     I_A_signal(fence0)
>                                                     CI_B_F_A_callback(fence0)
>                         CI_A_F_B_wait(fence0)
> Lexique:
> I_x  in driver x (I_A == in driver A)
> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> 
> So this is an happy mess everyone call everyone and this bound to get messy.
> Yes i know there is all kind of requirement on what happen once a fence is
> signaled. But those requirement only looks like they are trying to atone any
> mess that can happen from the whole callback dance.
> 
> While i was too seduced by the whole callback idea long time ago, i think it is
> a highly dangerous path to take where the combinatorial of what could happen
> are bound to explode with the increase in the number of players.
> 
> 
> So now back to how to solve the problem we are trying to address. First i want
> to make an observation, almost all GPU that exist today have a command ring
> on to which userspace command buffer are executed and inside the command ring
> you can do something like :
> 
>   if (condition) execute_command_buffer else skip_command_buffer
> 
> where condition is a simple expression (memory_address cop value)) with cop one
> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> that any gpu that slightly matter can do that. Those who can not should fix
> there command ring processor.
> 
> 
> With that in mind, i think proper solution is implementing timeline and having
> fence be a timeline object with a way simpler api. For each hardware timeline
> driver provide a system memory address at which the lastest signaled fence
> sequence number can be read. Each fence object is uniquely associated with
> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> queue.
> 
> When scheduling something that require synchronization on a hardware timeline
> a fence is created and associated with the pipeline timeline and hardware
> timeline. Other hardware block that need to wait on a fence can use there
> command ring conditional execution to directly check the fence sequence from
> the other hw block so you do optimistic scheduling. If optimistic scheduling
> fails (which would be reported by hw block specific solution and hidden) then
> things can fallback to software cpu wait inside what could be considered the
> kernel thread of the pipeline timeline.
> 
> 
> From api point of view there is no inter-driver call. All the driver needs to
> do is wakeup the pipeline timeline wait_queue when things are signaled or
> when things go sideway (gpu lockup).
> 
> 
> So how to implement that with current driver ? Well easy. Currently we assume
> implicit synchronization so all we need is an implicit pipeline timeline per
> userspace process (note this do not prevent inter process synchronization).
> Everytime a command buffer is submitted it is added to the implicit timeline
> with the simple fence object :
> 
> struct fence {
>   struct list_head   list_hwtimeline;
>   struct list_head   list_pipetimeline;
>   struct hw_timeline *hw_timeline;
>   uint64_t           seq_num;
>   work_t             timedout_work;
>   void               *csdata;
> };
> 
> So with set of helper function call by each of the driver command execution
> ioctl you have the implicit timeline that is properly populated and each
> dirver command execution get the dependency from the implicit timeline.
> 
> 
> Of course to take full advantages of all flexibilities this could offer we
> would need to allow userspace to create pipeline timeline and to schedule
> against the pipeline timeline of there choice. We could create file for
> each of the pipeline timeline and have file operation to wait/query
> progress.
> 
> Note that the gpu lockup are considered exceptional event, the implicit
> timeline will probably want to continue on other job on other hardware
> block but the explicit one probably will want to decide wether to continue
> or abort or retry without the fault hw block.
> 
> 
> I realize i am late to the party and that i should have taken a serious
> look at all this long time ago. I apologize for that and if you consider
> this is to late then just ignore me modulo the big warning the crazyness
> that callback will introduce an how bad things bound to happen. I am not
> saying that bad things can not happen with what i propose just that
> because everything happen inside the process context that is the one
> asking/requiring synchronization there will be not interprocess kernel
> callback (a callback that was registered by one process and that is call
> inside another process time slice because fence signaling is happening
> inside this other process time slice).
> 
> 
> Pseudo code for explicitness :
> 
> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> {
>    struct fence *dependency[16], *fence;
>    int m;
> 
>    m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>                          dependency, 16, &fence);
>    if (m < 0)
>      return m;
>    if (m >= 16) {
>        // alloc m and recall;
>    }
>    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> }
> 
> int timeline_schedule(ptimeline, hwtimeline, timeout,
>                        dependency, mdep, **fence)
> {
>    // allocate fence set hw_timeline and init work
>    // build up list of dependency by looking at list of pending fence in
>    // timeline
> }
> 
> 
> 
> // If device driver schedule job hopping for all dependency to be signaled then
> // it must also call this function with csdata being a copy of what needs to be
> // executed once all dependency are signaled
> void timeline_missed_schedule(timeline, fence, void *csdata)
> {
>    INITWORK(fence->work, timeline_missed_schedule_worker)
>    fence->csdata = csdata;
>    schedule_delayed_work(fence->work, default_timeout)
> }
> 
> void timeline_missed_schedule_worker(work)
> {
>    driver = driver_from_fence_hwtimeline(fence)
> 
>    // Make sure that each of the hwtimeline dependency will fire irq by
>    // calling a driver function.
>    timeline_wait_for_fence_dependency(fence);
>    driver->execute_cs(driver, fence);
> }
> 
> // This function is call by driver code that signal fence (could be call from
> // interrupt context). It is responsabilities of device driver to call that
> // function.
> void timeline_signal(hwtimeline)
> {
>   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>     wakeup(fence->pipetimeline->wait_queue);
>   }
> }

Btw as extra note, because of implicit timeline any shared object schedule on a
hw timeline must add a fence to all the implicit timeline where this object exist.

Also there is no need to have a fence pointer per object.

> 
> 
> Cheers,
> Jérôme

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-12 22:13 Fence, timeline and android sync points Jerome Glisse
  2014-08-13  1:23 ` Jerome Glisse
@ 2014-08-13  7:59 ` Christian König
  2014-08-13 13:41   ` Jerome Glisse
  2014-08-13  8:28 ` Daniel Vetter
  2 siblings, 1 reply; 39+ messages in thread
From: Christian König @ 2014-08-13  7:59 UTC (permalink / raw)
  To: Jerome Glisse, dri-devel, maarten.lankhorst; +Cc: daniel.vetter, bskeggs

Hi Jerome,

first of all that finally sounds like somebody starts to draw the whole 
picture for me.

So far all I have seen was a bunch of specialized requirements and some 
not so obvious design decisions based on those requirements.

So thanks a lot for finally summarizing the requirements from a top 
above view and I perfectly agree with your analysis of the current fence 
design and the downsides of that API.

Apart from that I also have some comments / requirements that hopefully 
can be taken into account as well:

>    pipeline timeline: timeline bound to a userspace rendering pipeline, each
>                       point on that timeline can be a composite of several
>                       different hardware pipeline point.
>    pipeline: abstract object representing userspace application graphic pipeline
>              of each of the application graphic operations.
In the long term a requirement for the driver for AMD GFX hardware is 
that instead of a fixed pipeline timeline we need a bit more flexible 
model where concurrent execution on different hardware engines is 
possible as well.

So the requirement is that you can do things like submitting a 3D job A, 
a DMA job B, a VCE job C and another 3D job D that are executed like this:
     A
    /  \
   B  C
    \  /
     D

(Let's just hope that looks as good on your mail client as it looked for 
me).

My current thinking is that we avoid having a pipeline object in the 
kernel and instead letting userspace specify which fence we want to 
synchronize to explicitly as long as everything stays withing the same 
client. As soon as any buffer is shared between clients the kernel we 
would need to fall back to implicitly synchronization to allow backward 
compatibility with DRI2/3.

>    if (condition) execute_command_buffer else skip_command_buffer
>
> where condition is a simple expression (memory_address cop value)) with cop one
> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> that any gpu that slightly matter can do that. Those who can not should fix
> there command ring processor.
At least for some engines on AMD hardware that isn't possible (UVD, VCE 
and in some extends DMA as well), but I don't see any reason why we 
shouldn't be able to use software based scheduling on those engines by 
default. So this isn't really a problem, but just an additional comment 
to keep in mind.

Regards,
Christian.

Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
> Hi,
>
> So i want over the whole fence and sync point stuff as it's becoming a pressing
> issue. I think we first need to agree on what is the problem we want to solve
> and what would be the requirements to solve it.
>
> Problem :
>    Explicit synchronization btw different hardware block over a buffer object.
>
> Requirements :
>    Share common infrastructure.
>    Allow optimal hardware command stream scheduling accross hardware block.
>    Allow android sync point to be implemented on top of it.
>    Handle/acknowledge exception (like good old gpu lockup).
>    Minimize driver changes.
>
> Glossary :
>    hardware timeline: timeline bound to a specific hardware block.
>    pipeline timeline: timeline bound to a userspace rendering pipeline, each
>                       point on that timeline can be a composite of several
>                       different hardware pipeline point.
>    pipeline: abstract object representing userspace application graphic pipeline
>              of each of the application graphic operations.
>    fence: specific point in a timeline where synchronization needs to happen.
>
>
> So now, current include/linux/fence.h implementation is i believe missing the
> objective by confusing hardware and pipeline timeline and by bolting fence to
> buffer object while what is really needed is true and proper timeline for both
> hardware and pipeline. But before going further down that road let me look at
> things and explain how i see them.
>
> Current ttm fence have one and a sole purpose, allow synchronization for buffer
> object move even thought some driver like radeon slightly abuse it and use them
> for things like lockup detection.
>
> The new fence want to expose an api that would allow some implementation of a
> timeline. For that it introduces callback and some hard requirement on what the
> driver have to expose :
>    enable_signaling
>    [signaled]
>    wait
>
> Each of those have to do work inside the driver to which the fence belongs and
> each of those can be call more or less from unexpected (with restriction like
> outside irq) context. So we end up with thing like :
>
>   Process 1              Process 2                   Process 3
>   I_A_schedule(fence0)
>                          CI_A_F_B_signaled(fence0)
>                                                      I_A_signal(fence0)
>                                                      CI_B_F_A_callback(fence0)
>                          CI_A_F_B_wait(fence0)
> Lexique:
> I_x  in driver x (I_A == in driver A)
> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
>
> So this is an happy mess everyone call everyone and this bound to get messy.
> Yes i know there is all kind of requirement on what happen once a fence is
> signaled. But those requirement only looks like they are trying to atone any
> mess that can happen from the whole callback dance.
>
> While i was too seduced by the whole callback idea long time ago, i think it is
> a highly dangerous path to take where the combinatorial of what could happen
> are bound to explode with the increase in the number of players.
>
>
> So now back to how to solve the problem we are trying to address. First i want
> to make an observation, almost all GPU that exist today have a command ring
> on to which userspace command buffer are executed and inside the command ring
> you can do something like :
>
>    if (condition) execute_command_buffer else skip_command_buffer
>
> where condition is a simple expression (memory_address cop value)) with cop one
> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> that any gpu that slightly matter can do that. Those who can not should fix
> there command ring processor.
>
>
> With that in mind, i think proper solution is implementing timeline and having
> fence be a timeline object with a way simpler api. For each hardware timeline
> driver provide a system memory address at which the lastest signaled fence
> sequence number can be read. Each fence object is uniquely associated with
> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> queue.
>
> When scheduling something that require synchronization on a hardware timeline
> a fence is created and associated with the pipeline timeline and hardware
> timeline. Other hardware block that need to wait on a fence can use there
> command ring conditional execution to directly check the fence sequence from
> the other hw block so you do optimistic scheduling. If optimistic scheduling
> fails (which would be reported by hw block specific solution and hidden) then
> things can fallback to software cpu wait inside what could be considered the
> kernel thread of the pipeline timeline.
>
>
>  From api point of view there is no inter-driver call. All the driver needs to
> do is wakeup the pipeline timeline wait_queue when things are signaled or
> when things go sideway (gpu lockup).
>
>
> So how to implement that with current driver ? Well easy. Currently we assume
> implicit synchronization so all we need is an implicit pipeline timeline per
> userspace process (note this do not prevent inter process synchronization).
> Everytime a command buffer is submitted it is added to the implicit timeline
> with the simple fence object :
>
> struct fence {
>    struct list_head   list_hwtimeline;
>    struct list_head   list_pipetimeline;
>    struct hw_timeline *hw_timeline;
>    uint64_t           seq_num;
>    work_t             timedout_work;
>    void               *csdata;
> };
>
> So with set of helper function call by each of the driver command execution
> ioctl you have the implicit timeline that is properly populated and each
> dirver command execution get the dependency from the implicit timeline.
>
>
> Of course to take full advantages of all flexibilities this could offer we
> would need to allow userspace to create pipeline timeline and to schedule
> against the pipeline timeline of there choice. We could create file for
> each of the pipeline timeline and have file operation to wait/query
> progress.
>
> Note that the gpu lockup are considered exceptional event, the implicit
> timeline will probably want to continue on other job on other hardware
> block but the explicit one probably will want to decide wether to continue
> or abort or retry without the fault hw block.
>
>
> I realize i am late to the party and that i should have taken a serious
> look at all this long time ago. I apologize for that and if you consider
> this is to late then just ignore me modulo the big warning the crazyness
> that callback will introduce an how bad things bound to happen. I am not
> saying that bad things can not happen with what i propose just that
> because everything happen inside the process context that is the one
> asking/requiring synchronization there will be not interprocess kernel
> callback (a callback that was registered by one process and that is call
> inside another process time slice because fence signaling is happening
> inside this other process time slice).
>
>
> Pseudo code for explicitness :
>
> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> {
>     struct fence *dependency[16], *fence;
>     int m;
>
>     m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>                           dependency, 16, &fence);
>     if (m < 0)
>       return m;
>     if (m >= 16) {
>         // alloc m and recall;
>     }
>     dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> }
>
> int timeline_schedule(ptimeline, hwtimeline, timeout,
>                         dependency, mdep, **fence)
> {
>     // allocate fence set hw_timeline and init work
>     // build up list of dependency by looking at list of pending fence in
>     // timeline
> }
>
>
>
> // If device driver schedule job hopping for all dependency to be signaled then
> // it must also call this function with csdata being a copy of what needs to be
> // executed once all dependency are signaled
> void timeline_missed_schedule(timeline, fence, void *csdata)
> {
>     INITWORK(fence->work, timeline_missed_schedule_worker)
>     fence->csdata = csdata;
>     schedule_delayed_work(fence->work, default_timeout)
> }
>
> void timeline_missed_schedule_worker(work)
> {
>     driver = driver_from_fence_hwtimeline(fence)
>
>     // Make sure that each of the hwtimeline dependency will fire irq by
>     // calling a driver function.
>     timeline_wait_for_fence_dependency(fence);
>     driver->execute_cs(driver, fence);
> }
>
> // This function is call by driver code that signal fence (could be call from
> // interrupt context). It is responsabilities of device driver to call that
> // function.
> void timeline_signal(hwtimeline)
> {
>    for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>      wakeup(fence->pipetimeline->wait_queue);
>    }
> }
>
>
> Cheers,
> Jérôme

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-12 22:13 Fence, timeline and android sync points Jerome Glisse
  2014-08-13  1:23 ` Jerome Glisse
  2014-08-13  7:59 ` Christian König
@ 2014-08-13  8:28 ` Daniel Vetter
  2014-08-13 13:36   ` Jerome Glisse
  2 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-13  8:28 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: daniel.vetter, dri-devel, bskeggs

On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> Hi,
> 
> So i want over the whole fence and sync point stuff as it's becoming a pressing
> issue. I think we first need to agree on what is the problem we want to solve
> and what would be the requirements to solve it.
> 
> Problem :
>   Explicit synchronization btw different hardware block over a buffer object.
> 
> Requirements :
>   Share common infrastructure.
>   Allow optimal hardware command stream scheduling accross hardware block.
>   Allow android sync point to be implemented on top of it.
>   Handle/acknowledge exception (like good old gpu lockup).
>   Minimize driver changes.
> 
> Glossary :
>   hardware timeline: timeline bound to a specific hardware block.
>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
>                      point on that timeline can be a composite of several
>                      different hardware pipeline point.
>   pipeline: abstract object representing userspace application graphic pipeline
>             of each of the application graphic operations.
>   fence: specific point in a timeline where synchronization needs to happen.
> 
> 
> So now, current include/linux/fence.h implementation is i believe missing the
> objective by confusing hardware and pipeline timeline and by bolting fence to
> buffer object while what is really needed is true and proper timeline for both
> hardware and pipeline. But before going further down that road let me look at
> things and explain how i see them.

fences can be used free-standing and no one forces you to integrate them
with buffers. We actually plan to go this way with the intel svm stuff.
Ofc for dma-buf the plan is to synchronize using such fences, but that's
somewhat orthogonal I think. At least you only talk about fences and
timeline and not dma-buf here.
 
> Current ttm fence have one and a sole purpose, allow synchronization for buffer
> object move even thought some driver like radeon slightly abuse it and use them
> for things like lockup detection.
> 
> The new fence want to expose an api that would allow some implementation of a
> timeline. For that it introduces callback and some hard requirement on what the
> driver have to expose :
>   enable_signaling
>   [signaled]
>   wait
> 
> Each of those have to do work inside the driver to which the fence belongs and
> each of those can be call more or less from unexpected (with restriction like
> outside irq) context. So we end up with thing like :
> 
>  Process 1              Process 2                   Process 3
>  I_A_schedule(fence0)
>                         CI_A_F_B_signaled(fence0)
>                                                     I_A_signal(fence0)
>                                                     CI_B_F_A_callback(fence0)
>                         CI_A_F_B_wait(fence0)
> Lexique:
> I_x  in driver x (I_A == in driver A)
> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> 
> So this is an happy mess everyone call everyone and this bound to get messy.
> Yes i know there is all kind of requirement on what happen once a fence is
> signaled. But those requirement only looks like they are trying to atone any
> mess that can happen from the whole callback dance.
> 
> While i was too seduced by the whole callback idea long time ago, i think it is
> a highly dangerous path to take where the combinatorial of what could happen
> are bound to explode with the increase in the number of players.
> 
> 
> So now back to how to solve the problem we are trying to address. First i want
> to make an observation, almost all GPU that exist today have a command ring
> on to which userspace command buffer are executed and inside the command ring
> you can do something like :
> 
>   if (condition) execute_command_buffer else skip_command_buffer
> 
> where condition is a simple expression (memory_address cop value)) with cop one
> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> that any gpu that slightly matter can do that. Those who can not should fix
> there command ring processor.
> 
> 
> With that in mind, i think proper solution is implementing timeline and having
> fence be a timeline object with a way simpler api. For each hardware timeline
> driver provide a system memory address at which the lastest signaled fence
> sequence number can be read. Each fence object is uniquely associated with
> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> queue.
> 
> When scheduling something that require synchronization on a hardware timeline
> a fence is created and associated with the pipeline timeline and hardware
> timeline. Other hardware block that need to wait on a fence can use there
> command ring conditional execution to directly check the fence sequence from
> the other hw block so you do optimistic scheduling. If optimistic scheduling
> fails (which would be reported by hw block specific solution and hidden) then
> things can fallback to software cpu wait inside what could be considered the
> kernel thread of the pipeline timeline.
> 
> 
> From api point of view there is no inter-driver call. All the driver needs to
> do is wakeup the pipeline timeline wait_queue when things are signaled or
> when things go sideway (gpu lockup).
> 
> 
> So how to implement that with current driver ? Well easy. Currently we assume
> implicit synchronization so all we need is an implicit pipeline timeline per
> userspace process (note this do not prevent inter process synchronization).
> Everytime a command buffer is submitted it is added to the implicit timeline
> with the simple fence object :
> 
> struct fence {
>   struct list_head   list_hwtimeline;
>   struct list_head   list_pipetimeline;
>   struct hw_timeline *hw_timeline;
>   uint64_t           seq_num;
>   work_t             timedout_work;
>   void               *csdata;
> };
> 
> So with set of helper function call by each of the driver command execution
> ioctl you have the implicit timeline that is properly populated and each
> dirver command execution get the dependency from the implicit timeline.
> 
> 
> Of course to take full advantages of all flexibilities this could offer we
> would need to allow userspace to create pipeline timeline and to schedule
> against the pipeline timeline of there choice. We could create file for
> each of the pipeline timeline and have file operation to wait/query
> progress.
> 
> Note that the gpu lockup are considered exceptional event, the implicit
> timeline will probably want to continue on other job on other hardware
> block but the explicit one probably will want to decide wether to continue
> or abort or retry without the fault hw block.
> 
> 
> I realize i am late to the party and that i should have taken a serious
> look at all this long time ago. I apologize for that and if you consider
> this is to late then just ignore me modulo the big warning the crazyness
> that callback will introduce an how bad things bound to happen. I am not
> saying that bad things can not happen with what i propose just that
> because everything happen inside the process context that is the one
> asking/requiring synchronization there will be not interprocess kernel
> callback (a callback that was registered by one process and that is call
> inside another process time slice because fence signaling is happening
> inside this other process time slice).

So I read through it all and presuming I understand it correctly your
proposal and what we currently have is about the same. The big difference
is that you make a timeline a first-class object and move the callback
queue from the fence to the timeline, which requires callers to check the
fence/seqno/whatever themselves instead of pushing that responsibility to
callers.

If you actually mandate that the fence is just a seqno or similar which
can be read lockless then I could register my own special callback into
that waitqueue (other stuff than waking up threads is allowed) and from
hard-irq context check the seqno and readd my own callback if that's not
yet happened (that needs to go through some other context for hilarity).

So from that pov (presuming I didn't miss anything) your proposal is
identical to what we have, minor some different color choices (like where
to place the callback queue).

I guess that leaves the topic of how do you wait upon other fences, and
the current stuff allows you to do that all from process context like you
propose. And wrt the hardirq context callbacks I'm not worried at all
since lockdep will yell at any driver write who gets this wrong, so
debugging this stuff won't be tricky. And you can always reject any such
patches if they concern your own driver.

So the only substantial proposal I can distill here is to make timelines
first-class citizens, and we can always do that later on. We actually
discussed this, but didn't really see a point in having them around.

And yeah you're fairly late to the party, this stuff has been floating
around and been discussed since years ;-)

Cheers, Daniel

> 
> 
> Pseudo code for explicitness :
> 
> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> {
>    struct fence *dependency[16], *fence;
>    int m;
> 
>    m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>                          dependency, 16, &fence);
>    if (m < 0)
>      return m;
>    if (m >= 16) {
>        // alloc m and recall;
>    }
>    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> }
> 
> int timeline_schedule(ptimeline, hwtimeline, timeout,
>                        dependency, mdep, **fence)
> {
>    // allocate fence set hw_timeline and init work
>    // build up list of dependency by looking at list of pending fence in
>    // timeline
> }
> 
> 
> 
> // If device driver schedule job hopping for all dependency to be signaled then
> // it must also call this function with csdata being a copy of what needs to be
> // executed once all dependency are signaled
> void timeline_missed_schedule(timeline, fence, void *csdata)
> {
>    INITWORK(fence->work, timeline_missed_schedule_worker)
>    fence->csdata = csdata;
>    schedule_delayed_work(fence->work, default_timeout)
> }
> 
> void timeline_missed_schedule_worker(work)
> {
>    driver = driver_from_fence_hwtimeline(fence)
> 
>    // Make sure that each of the hwtimeline dependency will fire irq by
>    // calling a driver function.
>    timeline_wait_for_fence_dependency(fence);
>    driver->execute_cs(driver, fence);
> }
> 
> // This function is call by driver code that signal fence (could be call from
> // interrupt context). It is responsabilities of device driver to call that
> // function.
> void timeline_signal(hwtimeline)
> {
>   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>     wakeup(fence->pipetimeline->wait_queue);
>   }
> }
> 
> 
> Cheers,
> Jérôme

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13  8:28 ` Daniel Vetter
@ 2014-08-13 13:36   ` Jerome Glisse
  2014-08-13 15:54     ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-13 13:36 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: daniel.vetter, dri-devel, bskeggs

On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
> On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> > Hi,
> > 
> > So i want over the whole fence and sync point stuff as it's becoming a pressing
> > issue. I think we first need to agree on what is the problem we want to solve
> > and what would be the requirements to solve it.
> > 
> > Problem :
> >   Explicit synchronization btw different hardware block over a buffer object.
> > 
> > Requirements :
> >   Share common infrastructure.
> >   Allow optimal hardware command stream scheduling accross hardware block.
> >   Allow android sync point to be implemented on top of it.
> >   Handle/acknowledge exception (like good old gpu lockup).
> >   Minimize driver changes.
> > 
> > Glossary :
> >   hardware timeline: timeline bound to a specific hardware block.
> >   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >                      point on that timeline can be a composite of several
> >                      different hardware pipeline point.
> >   pipeline: abstract object representing userspace application graphic pipeline
> >             of each of the application graphic operations.
> >   fence: specific point in a timeline where synchronization needs to happen.
> > 
> > 
> > So now, current include/linux/fence.h implementation is i believe missing the
> > objective by confusing hardware and pipeline timeline and by bolting fence to
> > buffer object while what is really needed is true and proper timeline for both
> > hardware and pipeline. But before going further down that road let me look at
> > things and explain how i see them.
> 
> fences can be used free-standing and no one forces you to integrate them
> with buffers. We actually plan to go this way with the intel svm stuff.
> Ofc for dma-buf the plan is to synchronize using such fences, but that's
> somewhat orthogonal I think. At least you only talk about fences and
> timeline and not dma-buf here.
>  
> > Current ttm fence have one and a sole purpose, allow synchronization for buffer
> > object move even thought some driver like radeon slightly abuse it and use them
> > for things like lockup detection.
> > 
> > The new fence want to expose an api that would allow some implementation of a
> > timeline. For that it introduces callback and some hard requirement on what the
> > driver have to expose :
> >   enable_signaling
> >   [signaled]
> >   wait
> > 
> > Each of those have to do work inside the driver to which the fence belongs and
> > each of those can be call more or less from unexpected (with restriction like
> > outside irq) context. So we end up with thing like :
> > 
> >  Process 1              Process 2                   Process 3
> >  I_A_schedule(fence0)
> >                         CI_A_F_B_signaled(fence0)
> >                                                     I_A_signal(fence0)
> >                                                     CI_B_F_A_callback(fence0)
> >                         CI_A_F_B_wait(fence0)
> > Lexique:
> > I_x  in driver x (I_A == in driver A)
> > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> > 
> > So this is an happy mess everyone call everyone and this bound to get messy.
> > Yes i know there is all kind of requirement on what happen once a fence is
> > signaled. But those requirement only looks like they are trying to atone any
> > mess that can happen from the whole callback dance.
> > 
> > While i was too seduced by the whole callback idea long time ago, i think it is
> > a highly dangerous path to take where the combinatorial of what could happen
> > are bound to explode with the increase in the number of players.
> > 
> > 
> > So now back to how to solve the problem we are trying to address. First i want
> > to make an observation, almost all GPU that exist today have a command ring
> > on to which userspace command buffer are executed and inside the command ring
> > you can do something like :
> > 
> >   if (condition) execute_command_buffer else skip_command_buffer
> > 
> > where condition is a simple expression (memory_address cop value)) with cop one
> > of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> > that any gpu that slightly matter can do that. Those who can not should fix
> > there command ring processor.
> > 
> > 
> > With that in mind, i think proper solution is implementing timeline and having
> > fence be a timeline object with a way simpler api. For each hardware timeline
> > driver provide a system memory address at which the lastest signaled fence
> > sequence number can be read. Each fence object is uniquely associated with
> > both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> > queue.
> > 
> > When scheduling something that require synchronization on a hardware timeline
> > a fence is created and associated with the pipeline timeline and hardware
> > timeline. Other hardware block that need to wait on a fence can use there
> > command ring conditional execution to directly check the fence sequence from
> > the other hw block so you do optimistic scheduling. If optimistic scheduling
> > fails (which would be reported by hw block specific solution and hidden) then
> > things can fallback to software cpu wait inside what could be considered the
> > kernel thread of the pipeline timeline.
> > 
> > 
> > From api point of view there is no inter-driver call. All the driver needs to
> > do is wakeup the pipeline timeline wait_queue when things are signaled or
> > when things go sideway (gpu lockup).
> > 
> > 
> > So how to implement that with current driver ? Well easy. Currently we assume
> > implicit synchronization so all we need is an implicit pipeline timeline per
> > userspace process (note this do not prevent inter process synchronization).
> > Everytime a command buffer is submitted it is added to the implicit timeline
> > with the simple fence object :
> > 
> > struct fence {
> >   struct list_head   list_hwtimeline;
> >   struct list_head   list_pipetimeline;
> >   struct hw_timeline *hw_timeline;
> >   uint64_t           seq_num;
> >   work_t             timedout_work;
> >   void               *csdata;
> > };
> > 
> > So with set of helper function call by each of the driver command execution
> > ioctl you have the implicit timeline that is properly populated and each
> > dirver command execution get the dependency from the implicit timeline.
> > 
> > 
> > Of course to take full advantages of all flexibilities this could offer we
> > would need to allow userspace to create pipeline timeline and to schedule
> > against the pipeline timeline of there choice. We could create file for
> > each of the pipeline timeline and have file operation to wait/query
> > progress.
> > 
> > Note that the gpu lockup are considered exceptional event, the implicit
> > timeline will probably want to continue on other job on other hardware
> > block but the explicit one probably will want to decide wether to continue
> > or abort or retry without the fault hw block.
> > 
> > 
> > I realize i am late to the party and that i should have taken a serious
> > look at all this long time ago. I apologize for that and if you consider
> > this is to late then just ignore me modulo the big warning the crazyness
> > that callback will introduce an how bad things bound to happen. I am not
> > saying that bad things can not happen with what i propose just that
> > because everything happen inside the process context that is the one
> > asking/requiring synchronization there will be not interprocess kernel
> > callback (a callback that was registered by one process and that is call
> > inside another process time slice because fence signaling is happening
> > inside this other process time slice).
> 
> So I read through it all and presuming I understand it correctly your
> proposal and what we currently have is about the same. The big difference
> is that you make a timeline a first-class object and move the callback
> queue from the fence to the timeline, which requires callers to check the
> fence/seqno/whatever themselves instead of pushing that responsibility to
> callers.

No, big difference is that there is no callback thus when waiting for a
fence you are either inside the process context that need to wait for it
or your inside a kernel thread process context. Which means in both case
you can do whatever you want. What i hate about the fence code as it is,
is the callback stuff, because you never know into which context fence
are signaled then you never know into which context callback are executed.

> 
> If you actually mandate that the fence is just a seqno or similar which
> can be read lockless then I could register my own special callback into
> that waitqueue (other stuff than waking up threads is allowed) and from
> hard-irq context check the seqno and readd my own callback if that's not
> yet happened (that needs to go through some other context for hilarity).

Yes mandating a simple number that can be read from anywhere without lock,
i am pretty sure all hw can write to system page and can write a value
alongside their command buffer. So either you hw support reading and testing
value either you can do it in atomic context right before scheduling.

> So from that pov (presuming I didn't miss anything) your proposal is
> identical to what we have, minor some different color choices (like where
> to place the callback queue).

No callback is the mantra here, and instead of bolting free living fence
to buffer object, they are associated with timeline which means you do not
need to go over all buffer object to know what you need to wait for.

> 
> I guess that leaves the topic of how do you wait upon other fences, and
> the current stuff allows you to do that all from process context like you
> propose. And wrt the hardirq context callbacks I'm not worried at all
> since lockdep will yell at any driver write who gets this wrong, so
> debugging this stuff won't be tricky. And you can always reject any such
> patches if they concern your own driver.
> 
> So the only substantial proposal I can distill here is to make timelines
> first-class citizens, and we can always do that later on. We actually
> discussed this, but didn't really see a point in having them around.

Again big difference is no callback and no fence bolted to buffer object.

> And yeah you're fairly late to the party, this stuff has been floating
> around and been discussed since years ;-)
> 
> Cheers, Daniel
> 
> > 
> > 
> > Pseudo code for explicitness :
> > 
> > drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> > {
> >    struct fence *dependency[16], *fence;
> >    int m;
> > 
> >    m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
> >                          dependency, 16, &fence);
> >    if (m < 0)
> >      return m;
> >    if (m >= 16) {
> >        // alloc m and recall;
> >    }
> >    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> > }
> > 
> > int timeline_schedule(ptimeline, hwtimeline, timeout,
> >                        dependency, mdep, **fence)
> > {
> >    // allocate fence set hw_timeline and init work
> >    // build up list of dependency by looking at list of pending fence in
> >    // timeline
> > }
> > 
> > 
> > 
> > // If device driver schedule job hopping for all dependency to be signaled then
> > // it must also call this function with csdata being a copy of what needs to be
> > // executed once all dependency are signaled
> > void timeline_missed_schedule(timeline, fence, void *csdata)
> > {
> >    INITWORK(fence->work, timeline_missed_schedule_worker)
> >    fence->csdata = csdata;
> >    schedule_delayed_work(fence->work, default_timeout)
> > }
> > 
> > void timeline_missed_schedule_worker(work)
> > {
> >    driver = driver_from_fence_hwtimeline(fence)
> > 
> >    // Make sure that each of the hwtimeline dependency will fire irq by
> >    // calling a driver function.
> >    timeline_wait_for_fence_dependency(fence);
> >    driver->execute_cs(driver, fence);
> > }
> > 
> > // This function is call by driver code that signal fence (could be call from
> > // interrupt context). It is responsabilities of device driver to call that
> > // function.
> > void timeline_signal(hwtimeline)
> > {
> >   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
> >     wakeup(fence->pipetimeline->wait_queue);
> >   }
> > }
> > 
> > 
> > Cheers,
> > Jérôme
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13  7:59 ` Christian König
@ 2014-08-13 13:41   ` Jerome Glisse
  2014-08-13 14:08     ` Christian König
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-13 13:41 UTC (permalink / raw)
  To: Christian König; +Cc: daniel.vetter, bskeggs, dri-devel

On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian König wrote:
> Hi Jerome,
> 
> first of all that finally sounds like somebody starts to draw the whole
> picture for me.
> 
> So far all I have seen was a bunch of specialized requirements and some not
> so obvious design decisions based on those requirements.
> 
> So thanks a lot for finally summarizing the requirements from a top above
> view and I perfectly agree with your analysis of the current fence design
> and the downsides of that API.
> 
> Apart from that I also have some comments / requirements that hopefully can
> be taken into account as well:
> 
> >   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >                      point on that timeline can be a composite of several
> >                      different hardware pipeline point.
> >   pipeline: abstract object representing userspace application graphic pipeline
> >             of each of the application graphic operations.
> In the long term a requirement for the driver for AMD GFX hardware is that
> instead of a fixed pipeline timeline we need a bit more flexible model where
> concurrent execution on different hardware engines is possible as well.
> 
> So the requirement is that you can do things like submitting a 3D job A, a
> DMA job B, a VCE job C and another 3D job D that are executed like this:
>     A
>    /  \
>   B  C
>    \  /
>     D
> 
> (Let's just hope that looks as good on your mail client as it looked for
> me).

My thinking of hw timeline is that a gpu like amd or nvidia would have several
different hw timeline. They are per block/engine so one for dma ring, one for
gfx, one for vce, ....

 
> My current thinking is that we avoid having a pipeline object in the kernel
> and instead letting userspace specify which fence we want to synchronize to
> explicitly as long as everything stays withing the same client. As soon as
> any buffer is shared between clients the kernel we would need to fall back
> to implicitly synchronization to allow backward compatibility with DRI2/3.

The whole issue is that today cs ioctl assume implied synchronization. So this
can not change, so for now anything that goes through cs ioctl would need to
use an implied timeline and have all ring that use common buffer synchronize
on it. As long as those ring use different buffer there is no need for sync.

Buffer object are what links hw timeline.

Of course there might be way to be more flexible if timeline are expose to
userspace and userspace can create several of them for a single process.

> 
> >   if (condition) execute_command_buffer else skip_command_buffer
> >
> >where condition is a simple expression (memory_address cop value)) with cop one
> >of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> >that any gpu that slightly matter can do that. Those who can not should fix
> >there command ring processor.
> At least for some engines on AMD hardware that isn't possible (UVD, VCE and
> in some extends DMA as well), but I don't see any reason why we shouldn't be
> able to use software based scheduling on those engines by default. So this
> isn't really a problem, but just an additional comment to keep in mind.

Yes not everything can do that but as it's a simple memory access with simple
comparison then it's easy to do on cpu for limited hardware. But this really
sounds like something so easy to add to hw ring execution that it is a shame
hw designer do not already added such thing.

> Regards,
> Christian.
> 
> Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
> >Hi,
> >
> >So i want over the whole fence and sync point stuff as it's becoming a pressing
> >issue. I think we first need to agree on what is the problem we want to solve
> >and what would be the requirements to solve it.
> >
> >Problem :
> >   Explicit synchronization btw different hardware block over a buffer object.
> >
> >Requirements :
> >   Share common infrastructure.
> >   Allow optimal hardware command stream scheduling accross hardware block.
> >   Allow android sync point to be implemented on top of it.
> >   Handle/acknowledge exception (like good old gpu lockup).
> >   Minimize driver changes.
> >
> >Glossary :
> >   hardware timeline: timeline bound to a specific hardware block.
> >   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >                      point on that timeline can be a composite of several
> >                      different hardware pipeline point.
> >   pipeline: abstract object representing userspace application graphic pipeline
> >             of each of the application graphic operations.
> >   fence: specific point in a timeline where synchronization needs to happen.
> >
> >
> >So now, current include/linux/fence.h implementation is i believe missing the
> >objective by confusing hardware and pipeline timeline and by bolting fence to
> >buffer object while what is really needed is true and proper timeline for both
> >hardware and pipeline. But before going further down that road let me look at
> >things and explain how i see them.
> >
> >Current ttm fence have one and a sole purpose, allow synchronization for buffer
> >object move even thought some driver like radeon slightly abuse it and use them
> >for things like lockup detection.
> >
> >The new fence want to expose an api that would allow some implementation of a
> >timeline. For that it introduces callback and some hard requirement on what the
> >driver have to expose :
> >   enable_signaling
> >   [signaled]
> >   wait
> >
> >Each of those have to do work inside the driver to which the fence belongs and
> >each of those can be call more or less from unexpected (with restriction like
> >outside irq) context. So we end up with thing like :
> >
> >  Process 1              Process 2                   Process 3
> >  I_A_schedule(fence0)
> >                         CI_A_F_B_signaled(fence0)
> >                                                     I_A_signal(fence0)
> >                                                     CI_B_F_A_callback(fence0)
> >                         CI_A_F_B_wait(fence0)
> >Lexique:
> >I_x  in driver x (I_A == in driver A)
> >CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> >
> >So this is an happy mess everyone call everyone and this bound to get messy.
> >Yes i know there is all kind of requirement on what happen once a fence is
> >signaled. But those requirement only looks like they are trying to atone any
> >mess that can happen from the whole callback dance.
> >
> >While i was too seduced by the whole callback idea long time ago, i think it is
> >a highly dangerous path to take where the combinatorial of what could happen
> >are bound to explode with the increase in the number of players.
> >
> >
> >So now back to how to solve the problem we are trying to address. First i want
> >to make an observation, almost all GPU that exist today have a command ring
> >on to which userspace command buffer are executed and inside the command ring
> >you can do something like :
> >
> >   if (condition) execute_command_buffer else skip_command_buffer
> >
> >where condition is a simple expression (memory_address cop value)) with cop one
> >of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> >that any gpu that slightly matter can do that. Those who can not should fix
> >there command ring processor.
> >
> >
> >With that in mind, i think proper solution is implementing timeline and having
> >fence be a timeline object with a way simpler api. For each hardware timeline
> >driver provide a system memory address at which the lastest signaled fence
> >sequence number can be read. Each fence object is uniquely associated with
> >both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> >queue.
> >
> >When scheduling something that require synchronization on a hardware timeline
> >a fence is created and associated with the pipeline timeline and hardware
> >timeline. Other hardware block that need to wait on a fence can use there
> >command ring conditional execution to directly check the fence sequence from
> >the other hw block so you do optimistic scheduling. If optimistic scheduling
> >fails (which would be reported by hw block specific solution and hidden) then
> >things can fallback to software cpu wait inside what could be considered the
> >kernel thread of the pipeline timeline.
> >
> >
> > From api point of view there is no inter-driver call. All the driver needs to
> >do is wakeup the pipeline timeline wait_queue when things are signaled or
> >when things go sideway (gpu lockup).
> >
> >
> >So how to implement that with current driver ? Well easy. Currently we assume
> >implicit synchronization so all we need is an implicit pipeline timeline per
> >userspace process (note this do not prevent inter process synchronization).
> >Everytime a command buffer is submitted it is added to the implicit timeline
> >with the simple fence object :
> >
> >struct fence {
> >   struct list_head   list_hwtimeline;
> >   struct list_head   list_pipetimeline;
> >   struct hw_timeline *hw_timeline;
> >   uint64_t           seq_num;
> >   work_t             timedout_work;
> >   void               *csdata;
> >};
> >
> >So with set of helper function call by each of the driver command execution
> >ioctl you have the implicit timeline that is properly populated and each
> >dirver command execution get the dependency from the implicit timeline.
> >
> >
> >Of course to take full advantages of all flexibilities this could offer we
> >would need to allow userspace to create pipeline timeline and to schedule
> >against the pipeline timeline of there choice. We could create file for
> >each of the pipeline timeline and have file operation to wait/query
> >progress.
> >
> >Note that the gpu lockup are considered exceptional event, the implicit
> >timeline will probably want to continue on other job on other hardware
> >block but the explicit one probably will want to decide wether to continue
> >or abort or retry without the fault hw block.
> >
> >
> >I realize i am late to the party and that i should have taken a serious
> >look at all this long time ago. I apologize for that and if you consider
> >this is to late then just ignore me modulo the big warning the crazyness
> >that callback will introduce an how bad things bound to happen. I am not
> >saying that bad things can not happen with what i propose just that
> >because everything happen inside the process context that is the one
> >asking/requiring synchronization there will be not interprocess kernel
> >callback (a callback that was registered by one process and that is call
> >inside another process time slice because fence signaling is happening
> >inside this other process time slice).
> >
> >
> >Pseudo code for explicitness :
> >
> >drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> >{
> >    struct fence *dependency[16], *fence;
> >    int m;
> >
> >    m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
> >                          dependency, 16, &fence);
> >    if (m < 0)
> >      return m;
> >    if (m >= 16) {
> >        // alloc m and recall;
> >    }
> >    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> >}
> >
> >int timeline_schedule(ptimeline, hwtimeline, timeout,
> >                        dependency, mdep, **fence)
> >{
> >    // allocate fence set hw_timeline and init work
> >    // build up list of dependency by looking at list of pending fence in
> >    // timeline
> >}
> >
> >
> >
> >// If device driver schedule job hopping for all dependency to be signaled then
> >// it must also call this function with csdata being a copy of what needs to be
> >// executed once all dependency are signaled
> >void timeline_missed_schedule(timeline, fence, void *csdata)
> >{
> >    INITWORK(fence->work, timeline_missed_schedule_worker)
> >    fence->csdata = csdata;
> >    schedule_delayed_work(fence->work, default_timeout)
> >}
> >
> >void timeline_missed_schedule_worker(work)
> >{
> >    driver = driver_from_fence_hwtimeline(fence)
> >
> >    // Make sure that each of the hwtimeline dependency will fire irq by
> >    // calling a driver function.
> >    timeline_wait_for_fence_dependency(fence);
> >    driver->execute_cs(driver, fence);
> >}
> >
> >// This function is call by driver code that signal fence (could be call from
> >// interrupt context). It is responsabilities of device driver to call that
> >// function.
> >void timeline_signal(hwtimeline)
> >{
> >   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
> >     wakeup(fence->pipetimeline->wait_queue);
> >   }
> >}
> >
> >
> >Cheers,
> >Jérôme
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13 13:41   ` Jerome Glisse
@ 2014-08-13 14:08     ` Christian König
  2014-08-13 15:56       ` Jerome Glisse
  0 siblings, 1 reply; 39+ messages in thread
From: Christian König @ 2014-08-13 14:08 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: daniel.vetter, bskeggs, dri-devel

> The whole issue is that today cs ioctl assume implied synchronization. So this
> can not change, so for now anything that goes through cs ioctl would need to
> use an implied timeline and have all ring that use common buffer synchronize
> on it. As long as those ring use different buffer there is no need for sync.
Exactly my thoughts.

> Buffer object are what links hw timeline.
A couple of people at AMD have a problem with that and I'm currently 
working full time on a solution. But solving this and keeping 100% 
backward compatibility at the same time is not an easy task.

> Of course there might be way to be more flexible if timeline are expose to
> userspace and userspace can create several of them for a single process.
Concurrent execution is mostly used for temporary things e.g. copying a 
result to a userspace buffer while VCE is decoding into the ring buffer 
at a different location for example. Creating an extra timeline just to 
tell the kernel that two commands are allowed to run in parallel sounds 
like to much overhead to me.

Cheers,
Christian.

Am 13.08.2014 um 15:41 schrieb Jerome Glisse:
> On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian König wrote:
>> Hi Jerome,
>>
>> first of all that finally sounds like somebody starts to draw the whole
>> picture for me.
>>
>> So far all I have seen was a bunch of specialized requirements and some not
>> so obvious design decisions based on those requirements.
>>
>> So thanks a lot for finally summarizing the requirements from a top above
>> view and I perfectly agree with your analysis of the current fence design
>> and the downsides of that API.
>>
>> Apart from that I also have some comments / requirements that hopefully can
>> be taken into account as well:
>>
>>>    pipeline timeline: timeline bound to a userspace rendering pipeline, each
>>>                       point on that timeline can be a composite of several
>>>                       different hardware pipeline point.
>>>    pipeline: abstract object representing userspace application graphic pipeline
>>>              of each of the application graphic operations.
>> In the long term a requirement for the driver for AMD GFX hardware is that
>> instead of a fixed pipeline timeline we need a bit more flexible model where
>> concurrent execution on different hardware engines is possible as well.
>>
>> So the requirement is that you can do things like submitting a 3D job A, a
>> DMA job B, a VCE job C and another 3D job D that are executed like this:
>>      A
>>     /  \
>>    B  C
>>     \  /
>>      D
>>
>> (Let's just hope that looks as good on your mail client as it looked for
>> me).
> My thinking of hw timeline is that a gpu like amd or nvidia would have several
> different hw timeline. They are per block/engine so one for dma ring, one for
> gfx, one for vce, ....
>
>   
>> My current thinking is that we avoid having a pipeline object in the kernel
>> and instead letting userspace specify which fence we want to synchronize to
>> explicitly as long as everything stays withing the same client. As soon as
>> any buffer is shared between clients the kernel we would need to fall back
>> to implicitly synchronization to allow backward compatibility with DRI2/3.
> The whole issue is that today cs ioctl assume implied synchronization. So this
> can not change, so for now anything that goes through cs ioctl would need to
> use an implied timeline and have all ring that use common buffer synchronize
> on it. As long as those ring use different buffer there is no need for sync.
>
> Buffer object are what links hw timeline.
>
> Of course there might be way to be more flexible if timeline are expose to
> userspace and userspace can create several of them for a single process.
>
>>>    if (condition) execute_command_buffer else skip_command_buffer
>>>
>>> where condition is a simple expression (memory_address cop value)) with cop one
>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
>>> that any gpu that slightly matter can do that. Those who can not should fix
>>> there command ring processor.
>> At least for some engines on AMD hardware that isn't possible (UVD, VCE and
>> in some extends DMA as well), but I don't see any reason why we shouldn't be
>> able to use software based scheduling on those engines by default. So this
>> isn't really a problem, but just an additional comment to keep in mind.
> Yes not everything can do that but as it's a simple memory access with simple
> comparison then it's easy to do on cpu for limited hardware. But this really
> sounds like something so easy to add to hw ring execution that it is a shame
> hw designer do not already added such thing.
>
>> Regards,
>> Christian.
>>
>> Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
>>> Hi,
>>>
>>> So i want over the whole fence and sync point stuff as it's becoming a pressing
>>> issue. I think we first need to agree on what is the problem we want to solve
>>> and what would be the requirements to solve it.
>>>
>>> Problem :
>>>    Explicit synchronization btw different hardware block over a buffer object.
>>>
>>> Requirements :
>>>    Share common infrastructure.
>>>    Allow optimal hardware command stream scheduling accross hardware block.
>>>    Allow android sync point to be implemented on top of it.
>>>    Handle/acknowledge exception (like good old gpu lockup).
>>>    Minimize driver changes.
>>>
>>> Glossary :
>>>    hardware timeline: timeline bound to a specific hardware block.
>>>    pipeline timeline: timeline bound to a userspace rendering pipeline, each
>>>                       point on that timeline can be a composite of several
>>>                       different hardware pipeline point.
>>>    pipeline: abstract object representing userspace application graphic pipeline
>>>              of each of the application graphic operations.
>>>    fence: specific point in a timeline where synchronization needs to happen.
>>>
>>>
>>> So now, current include/linux/fence.h implementation is i believe missing the
>>> objective by confusing hardware and pipeline timeline and by bolting fence to
>>> buffer object while what is really needed is true and proper timeline for both
>>> hardware and pipeline. But before going further down that road let me look at
>>> things and explain how i see them.
>>>
>>> Current ttm fence have one and a sole purpose, allow synchronization for buffer
>>> object move even thought some driver like radeon slightly abuse it and use them
>>> for things like lockup detection.
>>>
>>> The new fence want to expose an api that would allow some implementation of a
>>> timeline. For that it introduces callback and some hard requirement on what the
>>> driver have to expose :
>>>    enable_signaling
>>>    [signaled]
>>>    wait
>>>
>>> Each of those have to do work inside the driver to which the fence belongs and
>>> each of those can be call more or less from unexpected (with restriction like
>>> outside irq) context. So we end up with thing like :
>>>
>>>   Process 1              Process 2                   Process 3
>>>   I_A_schedule(fence0)
>>>                          CI_A_F_B_signaled(fence0)
>>>                                                      I_A_signal(fence0)
>>>                                                      CI_B_F_A_callback(fence0)
>>>                          CI_A_F_B_wait(fence0)
>>> Lexique:
>>> I_x  in driver x (I_A == in driver A)
>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
>>>
>>> So this is an happy mess everyone call everyone and this bound to get messy.
>>> Yes i know there is all kind of requirement on what happen once a fence is
>>> signaled. But those requirement only looks like they are trying to atone any
>>> mess that can happen from the whole callback dance.
>>>
>>> While i was too seduced by the whole callback idea long time ago, i think it is
>>> a highly dangerous path to take where the combinatorial of what could happen
>>> are bound to explode with the increase in the number of players.
>>>
>>>
>>> So now back to how to solve the problem we are trying to address. First i want
>>> to make an observation, almost all GPU that exist today have a command ring
>>> on to which userspace command buffer are executed and inside the command ring
>>> you can do something like :
>>>
>>>    if (condition) execute_command_buffer else skip_command_buffer
>>>
>>> where condition is a simple expression (memory_address cop value)) with cop one
>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
>>> that any gpu that slightly matter can do that. Those who can not should fix
>>> there command ring processor.
>>>
>>>
>>> With that in mind, i think proper solution is implementing timeline and having
>>> fence be a timeline object with a way simpler api. For each hardware timeline
>>> driver provide a system memory address at which the lastest signaled fence
>>> sequence number can be read. Each fence object is uniquely associated with
>>> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
>>> queue.
>>>
>>> When scheduling something that require synchronization on a hardware timeline
>>> a fence is created and associated with the pipeline timeline and hardware
>>> timeline. Other hardware block that need to wait on a fence can use there
>>> command ring conditional execution to directly check the fence sequence from
>>> the other hw block so you do optimistic scheduling. If optimistic scheduling
>>> fails (which would be reported by hw block specific solution and hidden) then
>>> things can fallback to software cpu wait inside what could be considered the
>>> kernel thread of the pipeline timeline.
>>>
>>>
>>>  From api point of view there is no inter-driver call. All the driver needs to
>>> do is wakeup the pipeline timeline wait_queue when things are signaled or
>>> when things go sideway (gpu lockup).
>>>
>>>
>>> So how to implement that with current driver ? Well easy. Currently we assume
>>> implicit synchronization so all we need is an implicit pipeline timeline per
>>> userspace process (note this do not prevent inter process synchronization).
>>> Everytime a command buffer is submitted it is added to the implicit timeline
>>> with the simple fence object :
>>>
>>> struct fence {
>>>    struct list_head   list_hwtimeline;
>>>    struct list_head   list_pipetimeline;
>>>    struct hw_timeline *hw_timeline;
>>>    uint64_t           seq_num;
>>>    work_t             timedout_work;
>>>    void               *csdata;
>>> };
>>>
>>> So with set of helper function call by each of the driver command execution
>>> ioctl you have the implicit timeline that is properly populated and each
>>> dirver command execution get the dependency from the implicit timeline.
>>>
>>>
>>> Of course to take full advantages of all flexibilities this could offer we
>>> would need to allow userspace to create pipeline timeline and to schedule
>>> against the pipeline timeline of there choice. We could create file for
>>> each of the pipeline timeline and have file operation to wait/query
>>> progress.
>>>
>>> Note that the gpu lockup are considered exceptional event, the implicit
>>> timeline will probably want to continue on other job on other hardware
>>> block but the explicit one probably will want to decide wether to continue
>>> or abort or retry without the fault hw block.
>>>
>>>
>>> I realize i am late to the party and that i should have taken a serious
>>> look at all this long time ago. I apologize for that and if you consider
>>> this is to late then just ignore me modulo the big warning the crazyness
>>> that callback will introduce an how bad things bound to happen. I am not
>>> saying that bad things can not happen with what i propose just that
>>> because everything happen inside the process context that is the one
>>> asking/requiring synchronization there will be not interprocess kernel
>>> callback (a callback that was registered by one process and that is call
>>> inside another process time slice because fence signaling is happening
>>> inside this other process time slice).
>>>
>>>
>>> Pseudo code for explicitness :
>>>
>>> drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
>>> {
>>>     struct fence *dependency[16], *fence;
>>>     int m;
>>>
>>>     m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
>>>                           dependency, 16, &fence);
>>>     if (m < 0)
>>>       return m;
>>>     if (m >= 16) {
>>>         // alloc m and recall;
>>>     }
>>>     dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
>>> }
>>>
>>> int timeline_schedule(ptimeline, hwtimeline, timeout,
>>>                         dependency, mdep, **fence)
>>> {
>>>     // allocate fence set hw_timeline and init work
>>>     // build up list of dependency by looking at list of pending fence in
>>>     // timeline
>>> }
>>>
>>>
>>>
>>> // If device driver schedule job hopping for all dependency to be signaled then
>>> // it must also call this function with csdata being a copy of what needs to be
>>> // executed once all dependency are signaled
>>> void timeline_missed_schedule(timeline, fence, void *csdata)
>>> {
>>>     INITWORK(fence->work, timeline_missed_schedule_worker)
>>>     fence->csdata = csdata;
>>>     schedule_delayed_work(fence->work, default_timeout)
>>> }
>>>
>>> void timeline_missed_schedule_worker(work)
>>> {
>>>     driver = driver_from_fence_hwtimeline(fence)
>>>
>>>     // Make sure that each of the hwtimeline dependency will fire irq by
>>>     // calling a driver function.
>>>     timeline_wait_for_fence_dependency(fence);
>>>     driver->execute_cs(driver, fence);
>>> }
>>>
>>> // This function is call by driver code that signal fence (could be call from
>>> // interrupt context). It is responsabilities of device driver to call that
>>> // function.
>>> void timeline_signal(hwtimeline)
>>> {
>>>    for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
>>>      wakeup(fence->pipetimeline->wait_queue);
>>>    }
>>> }
>>>
>>>
>>> Cheers,
>>> Jérôme
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13 13:36   ` Jerome Glisse
@ 2014-08-13 15:54     ` Daniel Vetter
  2014-08-13 17:07       ` Jerome Glisse
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-13 15:54 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: daniel.vetter, dri-devel, bskeggs

On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
> > On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> > > Hi,
> > > 
> > > So i want over the whole fence and sync point stuff as it's becoming a pressing
> > > issue. I think we first need to agree on what is the problem we want to solve
> > > and what would be the requirements to solve it.
> > > 
> > > Problem :
> > >   Explicit synchronization btw different hardware block over a buffer object.
> > > 
> > > Requirements :
> > >   Share common infrastructure.
> > >   Allow optimal hardware command stream scheduling accross hardware block.
> > >   Allow android sync point to be implemented on top of it.
> > >   Handle/acknowledge exception (like good old gpu lockup).
> > >   Minimize driver changes.
> > > 
> > > Glossary :
> > >   hardware timeline: timeline bound to a specific hardware block.
> > >   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> > >                      point on that timeline can be a composite of several
> > >                      different hardware pipeline point.
> > >   pipeline: abstract object representing userspace application graphic pipeline
> > >             of each of the application graphic operations.
> > >   fence: specific point in a timeline where synchronization needs to happen.
> > > 
> > > 
> > > So now, current include/linux/fence.h implementation is i believe missing the
> > > objective by confusing hardware and pipeline timeline and by bolting fence to
> > > buffer object while what is really needed is true and proper timeline for both
> > > hardware and pipeline. But before going further down that road let me look at
> > > things and explain how i see them.
> > 
> > fences can be used free-standing and no one forces you to integrate them
> > with buffers. We actually plan to go this way with the intel svm stuff.
> > Ofc for dma-buf the plan is to synchronize using such fences, but that's
> > somewhat orthogonal I think. At least you only talk about fences and
> > timeline and not dma-buf here.
> >  
> > > Current ttm fence have one and a sole purpose, allow synchronization for buffer
> > > object move even thought some driver like radeon slightly abuse it and use them
> > > for things like lockup detection.
> > > 
> > > The new fence want to expose an api that would allow some implementation of a
> > > timeline. For that it introduces callback and some hard requirement on what the
> > > driver have to expose :
> > >   enable_signaling
> > >   [signaled]
> > >   wait
> > > 
> > > Each of those have to do work inside the driver to which the fence belongs and
> > > each of those can be call more or less from unexpected (with restriction like
> > > outside irq) context. So we end up with thing like :
> > > 
> > >  Process 1              Process 2                   Process 3
> > >  I_A_schedule(fence0)
> > >                         CI_A_F_B_signaled(fence0)
> > >                                                     I_A_signal(fence0)
> > >                                                     CI_B_F_A_callback(fence0)
> > >                         CI_A_F_B_wait(fence0)
> > > Lexique:
> > > I_x  in driver x (I_A == in driver A)
> > > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> > > 
> > > So this is an happy mess everyone call everyone and this bound to get messy.
> > > Yes i know there is all kind of requirement on what happen once a fence is
> > > signaled. But those requirement only looks like they are trying to atone any
> > > mess that can happen from the whole callback dance.
> > > 
> > > While i was too seduced by the whole callback idea long time ago, i think it is
> > > a highly dangerous path to take where the combinatorial of what could happen
> > > are bound to explode with the increase in the number of players.
> > > 
> > > 
> > > So now back to how to solve the problem we are trying to address. First i want
> > > to make an observation, almost all GPU that exist today have a command ring
> > > on to which userspace command buffer are executed and inside the command ring
> > > you can do something like :
> > > 
> > >   if (condition) execute_command_buffer else skip_command_buffer
> > > 
> > > where condition is a simple expression (memory_address cop value)) with cop one
> > > of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> > > that any gpu that slightly matter can do that. Those who can not should fix
> > > there command ring processor.
> > > 
> > > 
> > > With that in mind, i think proper solution is implementing timeline and having
> > > fence be a timeline object with a way simpler api. For each hardware timeline
> > > driver provide a system memory address at which the lastest signaled fence
> > > sequence number can be read. Each fence object is uniquely associated with
> > > both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> > > queue.
> > > 
> > > When scheduling something that require synchronization on a hardware timeline
> > > a fence is created and associated with the pipeline timeline and hardware
> > > timeline. Other hardware block that need to wait on a fence can use there
> > > command ring conditional execution to directly check the fence sequence from
> > > the other hw block so you do optimistic scheduling. If optimistic scheduling
> > > fails (which would be reported by hw block specific solution and hidden) then
> > > things can fallback to software cpu wait inside what could be considered the
> > > kernel thread of the pipeline timeline.
> > > 
> > > 
> > > From api point of view there is no inter-driver call. All the driver needs to
> > > do is wakeup the pipeline timeline wait_queue when things are signaled or
> > > when things go sideway (gpu lockup).
> > > 
> > > 
> > > So how to implement that with current driver ? Well easy. Currently we assume
> > > implicit synchronization so all we need is an implicit pipeline timeline per
> > > userspace process (note this do not prevent inter process synchronization).
> > > Everytime a command buffer is submitted it is added to the implicit timeline
> > > with the simple fence object :
> > > 
> > > struct fence {
> > >   struct list_head   list_hwtimeline;
> > >   struct list_head   list_pipetimeline;
> > >   struct hw_timeline *hw_timeline;
> > >   uint64_t           seq_num;
> > >   work_t             timedout_work;
> > >   void               *csdata;
> > > };
> > > 
> > > So with set of helper function call by each of the driver command execution
> > > ioctl you have the implicit timeline that is properly populated and each
> > > dirver command execution get the dependency from the implicit timeline.
> > > 
> > > 
> > > Of course to take full advantages of all flexibilities this could offer we
> > > would need to allow userspace to create pipeline timeline and to schedule
> > > against the pipeline timeline of there choice. We could create file for
> > > each of the pipeline timeline and have file operation to wait/query
> > > progress.
> > > 
> > > Note that the gpu lockup are considered exceptional event, the implicit
> > > timeline will probably want to continue on other job on other hardware
> > > block but the explicit one probably will want to decide wether to continue
> > > or abort or retry without the fault hw block.
> > > 
> > > 
> > > I realize i am late to the party and that i should have taken a serious
> > > look at all this long time ago. I apologize for that and if you consider
> > > this is to late then just ignore me modulo the big warning the crazyness
> > > that callback will introduce an how bad things bound to happen. I am not
> > > saying that bad things can not happen with what i propose just that
> > > because everything happen inside the process context that is the one
> > > asking/requiring synchronization there will be not interprocess kernel
> > > callback (a callback that was registered by one process and that is call
> > > inside another process time slice because fence signaling is happening
> > > inside this other process time slice).
> > 
> > So I read through it all and presuming I understand it correctly your
> > proposal and what we currently have is about the same. The big difference
> > is that you make a timeline a first-class object and move the callback
> > queue from the fence to the timeline, which requires callers to check the
> > fence/seqno/whatever themselves instead of pushing that responsibility to
> > callers.
> 
> No, big difference is that there is no callback thus when waiting for a
> fence you are either inside the process context that need to wait for it
> or your inside a kernel thread process context. Which means in both case
> you can do whatever you want. What i hate about the fence code as it is,
> is the callback stuff, because you never know into which context fence
> are signaled then you never know into which context callback are executed.

Look at waitqueues a bit closer. They're implemented with callbacks ;-)
The only difference is that you're allowed to have spurious wakeups and
need to handle that somehow, so need a separate check function.

> > If you actually mandate that the fence is just a seqno or similar which
> > can be read lockless then I could register my own special callback into
> > that waitqueue (other stuff than waking up threads is allowed) and from
> > hard-irq context check the seqno and readd my own callback if that's not
> > yet happened (that needs to go through some other context for hilarity).
> 
> Yes mandating a simple number that can be read from anywhere without lock,
> i am pretty sure all hw can write to system page and can write a value
> alongside their command buffer. So either you hw support reading and testing
> value either you can do it in atomic context right before scheduling.

Imo that's a step in the wrong direction since reading a bit of system
memory or checking a bit of irq-safe spinlock protected data or a register
shouldn't matter. You just arbitrarily disallow that. And allowing random
other kernel subsystems to read mmio or page mappings not under their
control is an idea that freaks /me/ out.

> > So from that pov (presuming I didn't miss anything) your proposal is
> > identical to what we have, minor some different color choices (like where
> > to place the callback queue).
> 
> No callback is the mantra here, and instead of bolting free living fence
> to buffer object, they are associated with timeline which means you do not
> need to go over all buffer object to know what you need to wait for.

Ok, then I guess I didn't understand that part of your the proposal. Can
you please elaborate a bit more how you want to synchronize mulitple
drivers accessing a dma-buf object and what piece of state we need to
associate to the dma-buf to make this happen?

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13 14:08     ` Christian König
@ 2014-08-13 15:56       ` Jerome Glisse
  0 siblings, 0 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-13 15:56 UTC (permalink / raw)
  To: Christian König; +Cc: daniel.vetter, bskeggs, dri-devel

On Wed, Aug 13, 2014 at 04:08:14PM +0200, Christian König wrote:
> >The whole issue is that today cs ioctl assume implied synchronization. So this
> >can not change, so for now anything that goes through cs ioctl would need to
> >use an implied timeline and have all ring that use common buffer synchronize
> >on it. As long as those ring use different buffer there is no need for sync.
> Exactly my thoughts.
> 
> >Buffer object are what links hw timeline.
> A couple of people at AMD have a problem with that and I'm currently working
> full time on a solution. But solving this and keeping 100% backward
> compatibility at the same time is not an easy task.

Let me rephrase, with current cs ioctl forcing synchronization for everybuffer
that appear on different hw ring is mandatory and there is no way to fix that.

That being said one can imagine a single buffer where one engine works on a
region of it and another hw block on another non overlapping region in which
case there is no need for synchronization btw those different hw block (like
multi-gpu each rendering one half of the screen). But to properly do such thing
you need to expose timeline or something like that to userspace and have user
space emit sync on this timeline. So something like :

cs_ioctl(timeline, cs) {return csno + hwtimeline_id;}
timeline_sync(nsync, seqno[], hwtimeline_id[])

When you schedule something using a new ioctl that just take a timeline as
extra parameter you add no synchronization to timeline you assume that user
space will call timeline_sync that will insert synchronization point on the
timeline. So you can schedule bunch of cs on different hwblock, user space
keep track of last emited cs seqno and its associated hwtimeline and when
it wants to synchronize it call the timeline sync and any new cs ioctl on
that timeline will have to wait before being able to schedule.

So really i do not see how to fix that properly without a new cs ioctl that
just take an extra timeline as a parameter (well in case of radeon we can add
a timeline chunk to cs ioctl).

> 
> >Of course there might be way to be more flexible if timeline are expose to
> >userspace and userspace can create several of them for a single process.
> Concurrent execution is mostly used for temporary things e.g. copying a
> result to a userspace buffer while VCE is decoding into the ring buffer at a
> different location for example. Creating an extra timeline just to tell the
> kernel that two commands are allowed to run in parallel sounds like to much
> overhead to me.

Never was my intention to create different timeline, like said above when
scheduling with explicit timeline then each time you schedule there is no
synchronization whatsoever. Only time an engine have to wait is when user
space emit an explicit sync point. Like said above.

The allowing multi-timeline per process is more for thing where you have a
process working on two different distinct problem with no interdependency.
Hence one timeline for each of those task but inside that task you can schedule
thing concurently as said above.

> 
> Cheers,
> Christian.
> 
> Am 13.08.2014 um 15:41 schrieb Jerome Glisse:
> >On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian König wrote:
> >>Hi Jerome,
> >>
> >>first of all that finally sounds like somebody starts to draw the whole
> >>picture for me.
> >>
> >>So far all I have seen was a bunch of specialized requirements and some not
> >>so obvious design decisions based on those requirements.
> >>
> >>So thanks a lot for finally summarizing the requirements from a top above
> >>view and I perfectly agree with your analysis of the current fence design
> >>and the downsides of that API.
> >>
> >>Apart from that I also have some comments / requirements that hopefully can
> >>be taken into account as well:
> >>
> >>>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >>>                      point on that timeline can be a composite of several
> >>>                      different hardware pipeline point.
> >>>   pipeline: abstract object representing userspace application graphic pipeline
> >>>             of each of the application graphic operations.
> >>In the long term a requirement for the driver for AMD GFX hardware is that
> >>instead of a fixed pipeline timeline we need a bit more flexible model where
> >>concurrent execution on different hardware engines is possible as well.
> >>
> >>So the requirement is that you can do things like submitting a 3D job A, a
> >>DMA job B, a VCE job C and another 3D job D that are executed like this:
> >>     A
> >>    /  \
> >>   B  C
> >>    \  /
> >>     D
> >>
> >>(Let's just hope that looks as good on your mail client as it looked for
> >>me).
> >My thinking of hw timeline is that a gpu like amd or nvidia would have several
> >different hw timeline. They are per block/engine so one for dma ring, one for
> >gfx, one for vce, ....
> >
> >>My current thinking is that we avoid having a pipeline object in the kernel
> >>and instead letting userspace specify which fence we want to synchronize to
> >>explicitly as long as everything stays withing the same client. As soon as
> >>any buffer is shared between clients the kernel we would need to fall back
> >>to implicitly synchronization to allow backward compatibility with DRI2/3.
> >The whole issue is that today cs ioctl assume implied synchronization. So this
> >can not change, so for now anything that goes through cs ioctl would need to
> >use an implied timeline and have all ring that use common buffer synchronize
> >on it. As long as those ring use different buffer there is no need for sync.
> >
> >Buffer object are what links hw timeline.
> >
> >Of course there might be way to be more flexible if timeline are expose to
> >userspace and userspace can create several of them for a single process.
> >
> >>>   if (condition) execute_command_buffer else skip_command_buffer
> >>>
> >>>where condition is a simple expression (memory_address cop value)) with cop one
> >>>of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> >>>that any gpu that slightly matter can do that. Those who can not should fix
> >>>there command ring processor.
> >>At least for some engines on AMD hardware that isn't possible (UVD, VCE and
> >>in some extends DMA as well), but I don't see any reason why we shouldn't be
> >>able to use software based scheduling on those engines by default. So this
> >>isn't really a problem, but just an additional comment to keep in mind.
> >Yes not everything can do that but as it's a simple memory access with simple
> >comparison then it's easy to do on cpu for limited hardware. But this really
> >sounds like something so easy to add to hw ring execution that it is a shame
> >hw designer do not already added such thing.
> >
> >>Regards,
> >>Christian.
> >>
> >>Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
> >>>Hi,
> >>>
> >>>So i want over the whole fence and sync point stuff as it's becoming a pressing
> >>>issue. I think we first need to agree on what is the problem we want to solve
> >>>and what would be the requirements to solve it.
> >>>
> >>>Problem :
> >>>   Explicit synchronization btw different hardware block over a buffer object.
> >>>
> >>>Requirements :
> >>>   Share common infrastructure.
> >>>   Allow optimal hardware command stream scheduling accross hardware block.
> >>>   Allow android sync point to be implemented on top of it.
> >>>   Handle/acknowledge exception (like good old gpu lockup).
> >>>   Minimize driver changes.
> >>>
> >>>Glossary :
> >>>   hardware timeline: timeline bound to a specific hardware block.
> >>>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >>>                      point on that timeline can be a composite of several
> >>>                      different hardware pipeline point.
> >>>   pipeline: abstract object representing userspace application graphic pipeline
> >>>             of each of the application graphic operations.
> >>>   fence: specific point in a timeline where synchronization needs to happen.
> >>>
> >>>
> >>>So now, current include/linux/fence.h implementation is i believe missing the
> >>>objective by confusing hardware and pipeline timeline and by bolting fence to
> >>>buffer object while what is really needed is true and proper timeline for both
> >>>hardware and pipeline. But before going further down that road let me look at
> >>>things and explain how i see them.
> >>>
> >>>Current ttm fence have one and a sole purpose, allow synchronization for buffer
> >>>object move even thought some driver like radeon slightly abuse it and use them
> >>>for things like lockup detection.
> >>>
> >>>The new fence want to expose an api that would allow some implementation of a
> >>>timeline. For that it introduces callback and some hard requirement on what the
> >>>driver have to expose :
> >>>   enable_signaling
> >>>   [signaled]
> >>>   wait
> >>>
> >>>Each of those have to do work inside the driver to which the fence belongs and
> >>>each of those can be call more or less from unexpected (with restriction like
> >>>outside irq) context. So we end up with thing like :
> >>>
> >>>  Process 1              Process 2                   Process 3
> >>>  I_A_schedule(fence0)
> >>>                         CI_A_F_B_signaled(fence0)
> >>>                                                     I_A_signal(fence0)
> >>>                                                     CI_B_F_A_callback(fence0)
> >>>                         CI_A_F_B_wait(fence0)
> >>>Lexique:
> >>>I_x  in driver x (I_A == in driver A)
> >>>CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> >>>
> >>>So this is an happy mess everyone call everyone and this bound to get messy.
> >>>Yes i know there is all kind of requirement on what happen once a fence is
> >>>signaled. But those requirement only looks like they are trying to atone any
> >>>mess that can happen from the whole callback dance.
> >>>
> >>>While i was too seduced by the whole callback idea long time ago, i think it is
> >>>a highly dangerous path to take where the combinatorial of what could happen
> >>>are bound to explode with the increase in the number of players.
> >>>
> >>>
> >>>So now back to how to solve the problem we are trying to address. First i want
> >>>to make an observation, almost all GPU that exist today have a command ring
> >>>on to which userspace command buffer are executed and inside the command ring
> >>>you can do something like :
> >>>
> >>>   if (condition) execute_command_buffer else skip_command_buffer
> >>>
> >>>where condition is a simple expression (memory_address cop value)) with cop one
> >>>of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> >>>that any gpu that slightly matter can do that. Those who can not should fix
> >>>there command ring processor.
> >>>
> >>>
> >>>With that in mind, i think proper solution is implementing timeline and having
> >>>fence be a timeline object with a way simpler api. For each hardware timeline
> >>>driver provide a system memory address at which the lastest signaled fence
> >>>sequence number can be read. Each fence object is uniquely associated with
> >>>both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> >>>queue.
> >>>
> >>>When scheduling something that require synchronization on a hardware timeline
> >>>a fence is created and associated with the pipeline timeline and hardware
> >>>timeline. Other hardware block that need to wait on a fence can use there
> >>>command ring conditional execution to directly check the fence sequence from
> >>>the other hw block so you do optimistic scheduling. If optimistic scheduling
> >>>fails (which would be reported by hw block specific solution and hidden) then
> >>>things can fallback to software cpu wait inside what could be considered the
> >>>kernel thread of the pipeline timeline.
> >>>
> >>>
> >>> From api point of view there is no inter-driver call. All the driver needs to
> >>>do is wakeup the pipeline timeline wait_queue when things are signaled or
> >>>when things go sideway (gpu lockup).
> >>>
> >>>
> >>>So how to implement that with current driver ? Well easy. Currently we assume
> >>>implicit synchronization so all we need is an implicit pipeline timeline per
> >>>userspace process (note this do not prevent inter process synchronization).
> >>>Everytime a command buffer is submitted it is added to the implicit timeline
> >>>with the simple fence object :
> >>>
> >>>struct fence {
> >>>   struct list_head   list_hwtimeline;
> >>>   struct list_head   list_pipetimeline;
> >>>   struct hw_timeline *hw_timeline;
> >>>   uint64_t           seq_num;
> >>>   work_t             timedout_work;
> >>>   void               *csdata;
> >>>};
> >>>
> >>>So with set of helper function call by each of the driver command execution
> >>>ioctl you have the implicit timeline that is properly populated and each
> >>>dirver command execution get the dependency from the implicit timeline.
> >>>
> >>>
> >>>Of course to take full advantages of all flexibilities this could offer we
> >>>would need to allow userspace to create pipeline timeline and to schedule
> >>>against the pipeline timeline of there choice. We could create file for
> >>>each of the pipeline timeline and have file operation to wait/query
> >>>progress.
> >>>
> >>>Note that the gpu lockup are considered exceptional event, the implicit
> >>>timeline will probably want to continue on other job on other hardware
> >>>block but the explicit one probably will want to decide wether to continue
> >>>or abort or retry without the fault hw block.
> >>>
> >>>
> >>>I realize i am late to the party and that i should have taken a serious
> >>>look at all this long time ago. I apologize for that and if you consider
> >>>this is to late then just ignore me modulo the big warning the crazyness
> >>>that callback will introduce an how bad things bound to happen. I am not
> >>>saying that bad things can not happen with what i propose just that
> >>>because everything happen inside the process context that is the one
> >>>asking/requiring synchronization there will be not interprocess kernel
> >>>callback (a callback that was registered by one process and that is call
> >>>inside another process time slice because fence signaling is happening
> >>>inside this other process time slice).
> >>>
> >>>
> >>>Pseudo code for explicitness :
> >>>
> >>>drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp)
> >>>{
> >>>    struct fence *dependency[16], *fence;
> >>>    int m;
> >>>
> >>>    m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline,
> >>>                          dependency, 16, &fence);
> >>>    if (m < 0)
> >>>      return m;
> >>>    if (m >= 16) {
> >>>        // alloc m and recall;
> >>>    }
> >>>    dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
> >>>}
> >>>
> >>>int timeline_schedule(ptimeline, hwtimeline, timeout,
> >>>                        dependency, mdep, **fence)
> >>>{
> >>>    // allocate fence set hw_timeline and init work
> >>>    // build up list of dependency by looking at list of pending fence in
> >>>    // timeline
> >>>}
> >>>
> >>>
> >>>
> >>>// If device driver schedule job hopping for all dependency to be signaled then
> >>>// it must also call this function with csdata being a copy of what needs to be
> >>>// executed once all dependency are signaled
> >>>void timeline_missed_schedule(timeline, fence, void *csdata)
> >>>{
> >>>    INITWORK(fence->work, timeline_missed_schedule_worker)
> >>>    fence->csdata = csdata;
> >>>    schedule_delayed_work(fence->work, default_timeout)
> >>>}
> >>>
> >>>void timeline_missed_schedule_worker(work)
> >>>{
> >>>    driver = driver_from_fence_hwtimeline(fence)
> >>>
> >>>    // Make sure that each of the hwtimeline dependency will fire irq by
> >>>    // calling a driver function.
> >>>    timeline_wait_for_fence_dependency(fence);
> >>>    driver->execute_cs(driver, fence);
> >>>}
> >>>
> >>>// This function is call by driver code that signal fence (could be call from
> >>>// interrupt context). It is responsabilities of device driver to call that
> >>>// function.
> >>>void timeline_signal(hwtimeline)
> >>>{
> >>>   for_each_fence(fence, hwtimeline->fences, list_hwtimeline) {
> >>>     wakeup(fence->pipetimeline->wait_queue);
> >>>   }
> >>>}
> >>>
> >>>
> >>>Cheers,
> >>>Jérôme
> >>_______________________________________________
> >>dri-devel mailing list
> >>dri-devel@lists.freedesktop.org
> >>http://lists.freedesktop.org/mailman/listinfo/dri-devel
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13 15:54     ` Daniel Vetter
@ 2014-08-13 17:07       ` Jerome Glisse
  2014-08-14  9:08         ` Daniel Vetter
                           ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-13 17:07 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: daniel.vetter, dri-devel, bskeggs

On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
> > On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
> > > On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> > > > Hi,
> > > > 
> > > > So i want over the whole fence and sync point stuff as it's becoming a pressing
> > > > issue. I think we first need to agree on what is the problem we want to solve
> > > > and what would be the requirements to solve it.
> > > > 
> > > > Problem :
> > > >   Explicit synchronization btw different hardware block over a buffer object.
> > > > 
> > > > Requirements :
> > > >   Share common infrastructure.
> > > >   Allow optimal hardware command stream scheduling accross hardware block.
> > > >   Allow android sync point to be implemented on top of it.
> > > >   Handle/acknowledge exception (like good old gpu lockup).
> > > >   Minimize driver changes.
> > > > 
> > > > Glossary :
> > > >   hardware timeline: timeline bound to a specific hardware block.
> > > >   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> > > >                      point on that timeline can be a composite of several
> > > >                      different hardware pipeline point.
> > > >   pipeline: abstract object representing userspace application graphic pipeline
> > > >             of each of the application graphic operations.
> > > >   fence: specific point in a timeline where synchronization needs to happen.
> > > > 
> > > > 
> > > > So now, current include/linux/fence.h implementation is i believe missing the
> > > > objective by confusing hardware and pipeline timeline and by bolting fence to
> > > > buffer object while what is really needed is true and proper timeline for both
> > > > hardware and pipeline. But before going further down that road let me look at
> > > > things and explain how i see them.
> > > 
> > > fences can be used free-standing and no one forces you to integrate them
> > > with buffers. We actually plan to go this way with the intel svm stuff.
> > > Ofc for dma-buf the plan is to synchronize using such fences, but that's
> > > somewhat orthogonal I think. At least you only talk about fences and
> > > timeline and not dma-buf here.
> > >  
> > > > Current ttm fence have one and a sole purpose, allow synchronization for buffer
> > > > object move even thought some driver like radeon slightly abuse it and use them
> > > > for things like lockup detection.
> > > > 
> > > > The new fence want to expose an api that would allow some implementation of a
> > > > timeline. For that it introduces callback and some hard requirement on what the
> > > > driver have to expose :
> > > >   enable_signaling
> > > >   [signaled]
> > > >   wait
> > > > 
> > > > Each of those have to do work inside the driver to which the fence belongs and
> > > > each of those can be call more or less from unexpected (with restriction like
> > > > outside irq) context. So we end up with thing like :
> > > > 
> > > >  Process 1              Process 2                   Process 3
> > > >  I_A_schedule(fence0)
> > > >                         CI_A_F_B_signaled(fence0)
> > > >                                                     I_A_signal(fence0)
> > > >                                                     CI_B_F_A_callback(fence0)
> > > >                         CI_A_F_B_wait(fence0)
> > > > Lexique:
> > > > I_x  in driver x (I_A == in driver A)
> > > > CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> > > > 
> > > > So this is an happy mess everyone call everyone and this bound to get messy.
> > > > Yes i know there is all kind of requirement on what happen once a fence is
> > > > signaled. But those requirement only looks like they are trying to atone any
> > > > mess that can happen from the whole callback dance.
> > > > 
> > > > While i was too seduced by the whole callback idea long time ago, i think it is
> > > > a highly dangerous path to take where the combinatorial of what could happen
> > > > are bound to explode with the increase in the number of players.
> > > > 
> > > > 
> > > > So now back to how to solve the problem we are trying to address. First i want
> > > > to make an observation, almost all GPU that exist today have a command ring
> > > > on to which userspace command buffer are executed and inside the command ring
> > > > you can do something like :
> > > > 
> > > >   if (condition) execute_command_buffer else skip_command_buffer
> > > > 
> > > > where condition is a simple expression (memory_address cop value)) with cop one
> > > > of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> > > > that any gpu that slightly matter can do that. Those who can not should fix
> > > > there command ring processor.
> > > > 
> > > > 
> > > > With that in mind, i think proper solution is implementing timeline and having
> > > > fence be a timeline object with a way simpler api. For each hardware timeline
> > > > driver provide a system memory address at which the lastest signaled fence
> > > > sequence number can be read. Each fence object is uniquely associated with
> > > > both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> > > > queue.
> > > > 
> > > > When scheduling something that require synchronization on a hardware timeline
> > > > a fence is created and associated with the pipeline timeline and hardware
> > > > timeline. Other hardware block that need to wait on a fence can use there
> > > > command ring conditional execution to directly check the fence sequence from
> > > > the other hw block so you do optimistic scheduling. If optimistic scheduling
> > > > fails (which would be reported by hw block specific solution and hidden) then
> > > > things can fallback to software cpu wait inside what could be considered the
> > > > kernel thread of the pipeline timeline.
> > > > 
> > > > 
> > > > From api point of view there is no inter-driver call. All the driver needs to
> > > > do is wakeup the pipeline timeline wait_queue when things are signaled or
> > > > when things go sideway (gpu lockup).
> > > > 
> > > > 
> > > > So how to implement that with current driver ? Well easy. Currently we assume
> > > > implicit synchronization so all we need is an implicit pipeline timeline per
> > > > userspace process (note this do not prevent inter process synchronization).
> > > > Everytime a command buffer is submitted it is added to the implicit timeline
> > > > with the simple fence object :
> > > > 
> > > > struct fence {
> > > >   struct list_head   list_hwtimeline;
> > > >   struct list_head   list_pipetimeline;
> > > >   struct hw_timeline *hw_timeline;
> > > >   uint64_t           seq_num;
> > > >   work_t             timedout_work;
> > > >   void               *csdata;
> > > > };
> > > > 
> > > > So with set of helper function call by each of the driver command execution
> > > > ioctl you have the implicit timeline that is properly populated and each
> > > > dirver command execution get the dependency from the implicit timeline.
> > > > 
> > > > 
> > > > Of course to take full advantages of all flexibilities this could offer we
> > > > would need to allow userspace to create pipeline timeline and to schedule
> > > > against the pipeline timeline of there choice. We could create file for
> > > > each of the pipeline timeline and have file operation to wait/query
> > > > progress.
> > > > 
> > > > Note that the gpu lockup are considered exceptional event, the implicit
> > > > timeline will probably want to continue on other job on other hardware
> > > > block but the explicit one probably will want to decide wether to continue
> > > > or abort or retry without the fault hw block.
> > > > 
> > > > 
> > > > I realize i am late to the party and that i should have taken a serious
> > > > look at all this long time ago. I apologize for that and if you consider
> > > > this is to late then just ignore me modulo the big warning the crazyness
> > > > that callback will introduce an how bad things bound to happen. I am not
> > > > saying that bad things can not happen with what i propose just that
> > > > because everything happen inside the process context that is the one
> > > > asking/requiring synchronization there will be not interprocess kernel
> > > > callback (a callback that was registered by one process and that is call
> > > > inside another process time slice because fence signaling is happening
> > > > inside this other process time slice).
> > > 
> > > So I read through it all and presuming I understand it correctly your
> > > proposal and what we currently have is about the same. The big difference
> > > is that you make a timeline a first-class object and move the callback
> > > queue from the fence to the timeline, which requires callers to check the
> > > fence/seqno/whatever themselves instead of pushing that responsibility to
> > > callers.
> > 
> > No, big difference is that there is no callback thus when waiting for a
> > fence you are either inside the process context that need to wait for it
> > or your inside a kernel thread process context. Which means in both case
> > you can do whatever you want. What i hate about the fence code as it is,
> > is the callback stuff, because you never know into which context fence
> > are signaled then you never know into which context callback are executed.
> 
> Look at waitqueues a bit closer. They're implemented with callbacks ;-)
> The only difference is that you're allowed to have spurious wakeups and
> need to handle that somehow, so need a separate check function.

No this not how wait queue are implemented, ie wait queue do not callback a
random function from a random driver, it callback a limited set of function
from core linux kernel scheduler so that the process thread that was waiting
and out of the scheduler list is added back and marked as something that
should be schedule. Unless this part of the kernel drasticly changed for the
worse recently.

So this is fundamentaly different, fence as they are now allow random driver
callback and this is bound to get ugly this is bound to lead to one driver
doing something that seems innocuous but turn out to break heavoc when call
from some other driver function.

> 
> > > If you actually mandate that the fence is just a seqno or similar which
> > > can be read lockless then I could register my own special callback into
> > > that waitqueue (other stuff than waking up threads is allowed) and from
> > > hard-irq context check the seqno and readd my own callback if that's not
> > > yet happened (that needs to go through some other context for hilarity).
> > 
> > Yes mandating a simple number that can be read from anywhere without lock,
> > i am pretty sure all hw can write to system page and can write a value
> > alongside their command buffer. So either you hw support reading and testing
> > value either you can do it in atomic context right before scheduling.
> 
> Imo that's a step in the wrong direction since reading a bit of system
> memory or checking a bit of irq-safe spinlock protected data or a register
> shouldn't matter. You just arbitrarily disallow that. And allowing random
> other kernel subsystems to read mmio or page mappings not under their
> control is an idea that freaks /me/ out.

Let me make this crystal clear this must be a valid kernel page that have a
valid kernel mapping for the lifetime of the device. Hence there is no access
to mmio space or anything, just a regular kernel page. If can not rely on that
this is a sad world.

That being said, yes i am aware that some device incapacity to write to such
a page. For those dumb hardware what you need to do is have the irq handler
write to this page on behalf of the hardware. But i would like to know any
hardware than can not write a single dword from its ring buffer.

The only tricky part in this, is when device is unloaded and driver is removing
itself, it obviously need to synchronize itself with anyone possibly waiting on
it and possibly reading. But truly this is not that hard to solve.

So tell me once the above is clear what kind of scary thing can happen when cpu
or a device _read_ a kernel page ?

> 
> > > So from that pov (presuming I didn't miss anything) your proposal is
> > > identical to what we have, minor some different color choices (like where
> > > to place the callback queue).
> > 
> > No callback is the mantra here, and instead of bolting free living fence
> > to buffer object, they are associated with timeline which means you do not
> > need to go over all buffer object to know what you need to wait for.
> 
> Ok, then I guess I didn't understand that part of your the proposal. Can
> you please elaborate a bit more how you want to synchronize mulitple
> drivers accessing a dma-buf object and what piece of state we need to
> associate to the dma-buf to make this happen?

Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where
the implicit synchronization is the rule it enforce mandatory synchronization
accross all hw timeline on which a buffer shows up :
  for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
    if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
      continue;
    for_each_process_sharing_buffer(buffer, process) {
      schedule_fence(process->implicit_timeline, cmdbuf->fence)
    }
  }

Next time another process use current ioctl with implicit sync it will synch with
the last fence for any shared resource. This might sounds bad but truely as it is
right now this is already how it happens (at least for radeon).

So now, why it's better than include/linux/fence.h is because you can only link a
single fence to buffer and even if you are ready to waste memory and allow linking
several fence to each buffer you would need userspace to start tracking for each
buffer which fence it cares about. Basicly implementing the android sync point
(at least as i understand android sync point) is hard and error prone if not just
impossible with the fence code.

On the contrary with explicit timeline (ie once you expose timeline to userspace
and allow to schedule thing against a usertime line) you can do crazy stuff. For
a single process for instance :
  (htimeline1, seqno1) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf1, utimeline_p0, buf0)
  (htimeline2, seqno2) = dev_A_hwblock_Y_cmdbuf_ioctl(cmdbuf2, utimeline_p0, buf0)
  (htimeline3, seqno3) = dev_B_hwblock_X_cmdbuf_ioctl(cmdbuf3, utimeline_p0, buf0)
  (htimeline4, seqno4) = dev_C_hwblock_Z_cmdbuf_ioctl(cmdbuf4, utimeline_p0, buf0)
  (htimeline1, seqno5) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf5, utimeline_p0, buf0)
  timeline_sync(utimeline, (htimeline1_p0, seqno5), (htimeline3, seqno3))
  (htimeline1, seqno6) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf6, utimeline_p0, buf0)

So what this allow is for cmdbuf6 to wait on cmdbuf1, cmdbuf3 and cmdbuf5 and just
ignore completely what ever cmdbuf4 is doing as a do not care. Truely only user
space knows the kind of interdependcy that can exist. And all cmdbuf1, cmdbuf2,
cmdbuf3, cmdbuf4 and cmdbuf5 will/can execute concurrently.

Now for inter-process you need the interprocess to communicate with each other and
one process can do :
  timeline_inter_sync(utimeline1_p1, utimeline1_p0, (htimeline4, seqno4))
  (htimeline1, seqno1_p1) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf1_p1, utimeline_p1, buf0)

Then cmdbuf1_p1 will not executed until cmdbuf5 scheduled on htimeline4 by the user
timeline of process 0 is done. This will not care at all about any of the other
cmd using buf0 in process0.


So now truely i do not see how you can pretend allowing such things, which is what
android sync point seems to allow, with associated a fence to buffer.

Hope this make it clear.

Cheers,
Jérôme


> 
> Thanks, Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13 17:07       ` Jerome Glisse
@ 2014-08-14  9:08         ` Daniel Vetter
  2014-08-14 14:23           ` Jerome Glisse
  2014-08-14  9:15         ` Maarten Lankhorst
  2014-08-14 13:16         ` Rob Clark
  2 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-14  9:08 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: daniel.vetter, dri-devel, bskeggs

On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
> Let me make this crystal clear this must be a valid kernel page that have a
> valid kernel mapping for the lifetime of the device. Hence there is no access
> to mmio space or anything, just a regular kernel page. If can not rely on that
> this is a sad world.
> 
> That being said, yes i am aware that some device incapacity to write to such
> a page. For those dumb hardware what you need to do is have the irq handler
> write to this page on behalf of the hardware. But i would like to know any
> hardware than can not write a single dword from its ring buffer.
> 
> The only tricky part in this, is when device is unloaded and driver is removing
> itself, it obviously need to synchronize itself with anyone possibly waiting on
> it and possibly reading. But truly this is not that hard to solve.
> 
> So tell me once the above is clear what kind of scary thing can happen when cpu
> or a device _read_ a kernel page ?

It's not reading it, it's making sense of what you read. In i915 we had
exactly the (timeline, seqno) value pair design for fences for a long
time, and we're switching away from it since it stops working when you
have preemption and scheduler. Or at least it gets really interesting to
interpret the data written into the page.

So I don't want to expose that to other drivers if we decided that
exposing this internally is a stupid idea.

> > 
> > > > So from that pov (presuming I didn't miss anything) your proposal is
> > > > identical to what we have, minor some different color choices (like where
> > > > to place the callback queue).
> > > 
> > > No callback is the mantra here, and instead of bolting free living fence
> > > to buffer object, they are associated with timeline which means you do not
> > > need to go over all buffer object to know what you need to wait for.
> > 
> > Ok, then I guess I didn't understand that part of your the proposal. Can
> > you please elaborate a bit more how you want to synchronize mulitple
> > drivers accessing a dma-buf object and what piece of state we need to
> > associate to the dma-buf to make this happen?
> 
> Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where
> the implicit synchronization is the rule it enforce mandatory synchronization
> accross all hw timeline on which a buffer shows up :
>   for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
>     if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
>       continue;
>     for_each_process_sharing_buffer(buffer, process) {
>       schedule_fence(process->implicit_timeline, cmdbuf->fence)
>     }
>   }
> 
> Next time another process use current ioctl with implicit sync it will synch with
> the last fence for any shared resource. This might sounds bad but truely as it is
> right now this is already how it happens (at least for radeon).

Well i915 is a lot better than that. And I'm not going to implement some
special-case for dma-buf shared buffers just because radeon sucks and
wants to enforce that suckage on everyone else.

So let's cut this short: If you absolutely insist I guess we could ditch
the callback stuff from fences, but I really don't see the problem with
radeon just not using that and then being happy. We can easily implement a
bit of insulation code _just_ for radeon so that the only thing radeon
does is wake up a process (which then calls the callback if it's something
special).

Otoh I don't care about what ttm and radeon do, for i915 the important
stuff is integration with android syncpts and being able to do explicit
fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and
I expect that those patches will land in 3.18, at least the internal
integration.

It would be cool if we could get tear-free optimus working on desktop
linux, but that flat out doesn't pay my bills here. So I think I'll let
you guys figure this out yourself.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13 17:07       ` Jerome Glisse
  2014-08-14  9:08         ` Daniel Vetter
@ 2014-08-14  9:15         ` Maarten Lankhorst
  2014-08-14 11:53           ` Christian König
  2014-08-14 14:09           ` Jerome Glisse
  2014-08-14 13:16         ` Rob Clark
  2 siblings, 2 replies; 39+ messages in thread
From: Maarten Lankhorst @ 2014-08-14  9:15 UTC (permalink / raw)
  To: Jerome Glisse, Daniel Vetter; +Cc: daniel.vetter, dri-devel, bskeggs

Op 13-08-14 om 19:07 schreef Jerome Glisse:
> On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
>> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
>>> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
>>>> On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
>>>>> Hi,
>>>>>
>>>>> So i want over the whole fence and sync point stuff as it's becoming a pressing
>>>>> issue. I think we first need to agree on what is the problem we want to solve
>>>>> and what would be the requirements to solve it.
>>>>>
>>>>> Problem :
>>>>>   Explicit synchronization btw different hardware block over a buffer object.
>>>>>
>>>>> Requirements :
>>>>>   Share common infrastructure.
>>>>>   Allow optimal hardware command stream scheduling accross hardware block.
>>>>>   Allow android sync point to be implemented on top of it.
>>>>>   Handle/acknowledge exception (like good old gpu lockup).
>>>>>   Minimize driver changes.
>>>>>
>>>>> Glossary :
>>>>>   hardware timeline: timeline bound to a specific hardware block.
>>>>>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
>>>>>                      point on that timeline can be a composite of several
>>>>>                      different hardware pipeline point.
>>>>>   pipeline: abstract object representing userspace application graphic pipeline
>>>>>             of each of the application graphic operations.
>>>>>   fence: specific point in a timeline where synchronization needs to happen.
>>>>>
>>>>>
>>>>> So now, current include/linux/fence.h implementation is i believe missing the
>>>>> objective by confusing hardware and pipeline timeline and by bolting fence to
>>>>> buffer object while what is really needed is true and proper timeline for both
>>>>> hardware and pipeline. But before going further down that road let me look at
>>>>> things and explain how i see them.
>>>> fences can be used free-standing and no one forces you to integrate them
>>>> with buffers. We actually plan to go this way with the intel svm stuff.
>>>> Ofc for dma-buf the plan is to synchronize using such fences, but that's
>>>> somewhat orthogonal I think. At least you only talk about fences and
>>>> timeline and not dma-buf here.
>>>>  
>>>>> Current ttm fence have one and a sole purpose, allow synchronization for buffer
>>>>> object move even thought some driver like radeon slightly abuse it and use them
>>>>> for things like lockup detection.
>>>>>
>>>>> The new fence want to expose an api that would allow some implementation of a
>>>>> timeline. For that it introduces callback and some hard requirement on what the
>>>>> driver have to expose :
>>>>>   enable_signaling
>>>>>   [signaled]
>>>>>   wait
>>>>>
>>>>> Each of those have to do work inside the driver to which the fence belongs and
>>>>> each of those can be call more or less from unexpected (with restriction like
>>>>> outside irq) context. So we end up with thing like :
>>>>>
>>>>>  Process 1              Process 2                   Process 3
>>>>>  I_A_schedule(fence0)
>>>>>                         CI_A_F_B_signaled(fence0)
>>>>>                                                     I_A_signal(fence0)
>>>>>                                                     CI_B_F_A_callback(fence0)
>>>>>                         CI_A_F_B_wait(fence0)
>>>>> Lexique:
>>>>> I_x  in driver x (I_A == in driver A)
>>>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
>>>>>
>>>>> So this is an happy mess everyone call everyone and this bound to get messy.
>>>>> Yes i know there is all kind of requirement on what happen once a fence is
>>>>> signaled. But those requirement only looks like they are trying to atone any
>>>>> mess that can happen from the whole callback dance.
>>>>>
>>>>> While i was too seduced by the whole callback idea long time ago, i think it is
>>>>> a highly dangerous path to take where the combinatorial of what could happen
>>>>> are bound to explode with the increase in the number of players.
>>>>>
>>>>>
>>>>> So now back to how to solve the problem we are trying to address. First i want
>>>>> to make an observation, almost all GPU that exist today have a command ring
>>>>> on to which userspace command buffer are executed and inside the command ring
>>>>> you can do something like :
>>>>>
>>>>>   if (condition) execute_command_buffer else skip_command_buffer
>>>>>
>>>>> where condition is a simple expression (memory_address cop value)) with cop one
>>>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
>>>>> that any gpu that slightly matter can do that. Those who can not should fix
>>>>> there command ring processor.
>>>>>
>>>>>
>>>>> With that in mind, i think proper solution is implementing timeline and having
>>>>> fence be a timeline object with a way simpler api. For each hardware timeline
>>>>> driver provide a system memory address at which the lastest signaled fence
>>>>> sequence number can be read. Each fence object is uniquely associated with
>>>>> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
>>>>> queue.
>>>>>
>>>>> When scheduling something that require synchronization on a hardware timeline
>>>>> a fence is created and associated with the pipeline timeline and hardware
>>>>> timeline. Other hardware block that need to wait on a fence can use there
>>>>> command ring conditional execution to directly check the fence sequence from
>>>>> the other hw block so you do optimistic scheduling. If optimistic scheduling
>>>>> fails (which would be reported by hw block specific solution and hidden) then
>>>>> things can fallback to software cpu wait inside what could be considered the
>>>>> kernel thread of the pipeline timeline.
>>>>>
>>>>>
>>>>> From api point of view there is no inter-driver call. All the driver needs to
>>>>> do is wakeup the pipeline timeline wait_queue when things are signaled or
>>>>> when things go sideway (gpu lockup).
>>>>>
>>>>>
>>>>> So how to implement that with current driver ? Well easy. Currently we assume
>>>>> implicit synchronization so all we need is an implicit pipeline timeline per
>>>>> userspace process (note this do not prevent inter process synchronization).
>>>>> Everytime a command buffer is submitted it is added to the implicit timeline
>>>>> with the simple fence object :
>>>>>
>>>>> struct fence {
>>>>>   struct list_head   list_hwtimeline;
>>>>>   struct list_head   list_pipetimeline;
>>>>>   struct hw_timeline *hw_timeline;
>>>>>   uint64_t           seq_num;
>>>>>   work_t             timedout_work;
>>>>>   void               *csdata;
>>>>> };
>>>>>
>>>>> So with set of helper function call by each of the driver command execution
>>>>> ioctl you have the implicit timeline that is properly populated and each
>>>>> dirver command execution get the dependency from the implicit timeline.
>>>>>
>>>>>
>>>>> Of course to take full advantages of all flexibilities this could offer we
>>>>> would need to allow userspace to create pipeline timeline and to schedule
>>>>> against the pipeline timeline of there choice. We could create file for
>>>>> each of the pipeline timeline and have file operation to wait/query
>>>>> progress.
>>>>>
>>>>> Note that the gpu lockup are considered exceptional event, the implicit
>>>>> timeline will probably want to continue on other job on other hardware
>>>>> block but the explicit one probably will want to decide wether to continue
>>>>> or abort or retry without the fault hw block.
>>>>>
>>>>>
>>>>> I realize i am late to the party and that i should have taken a serious
>>>>> look at all this long time ago. I apologize for that and if you consider
>>>>> this is to late then just ignore me modulo the big warning the crazyness
>>>>> that callback will introduce an how bad things bound to happen. I am not
>>>>> saying that bad things can not happen with what i propose just that
>>>>> because everything happen inside the process context that is the one
>>>>> asking/requiring synchronization there will be not interprocess kernel
>>>>> callback (a callback that was registered by one process and that is call
>>>>> inside another process time slice because fence signaling is happening
>>>>> inside this other process time slice).
>>>> So I read through it all and presuming I understand it correctly your
>>>> proposal and what we currently have is about the same. The big difference
>>>> is that you make a timeline a first-class object and move the callback
>>>> queue from the fence to the timeline, which requires callers to check the
>>>> fence/seqno/whatever themselves instead of pushing that responsibility to
>>>> callers.
>>> No, big difference is that there is no callback thus when waiting for a
>>> fence you are either inside the process context that need to wait for it
>>> or your inside a kernel thread process context. Which means in both case
>>> you can do whatever you want. What i hate about the fence code as it is,
>>> is the callback stuff, because you never know into which context fence
>>> are signaled then you never know into which context callback are executed.
>> Look at waitqueues a bit closer. They're implemented with callbacks ;-)
>> The only difference is that you're allowed to have spurious wakeups and
>> need to handle that somehow, so need a separate check function.
> No this not how wait queue are implemented, ie wait queue do not callback a
> random function from a random driver, it callback a limited set of function
> from core linux kernel scheduler so that the process thread that was waiting
> and out of the scheduler list is added back and marked as something that
> should be schedule. Unless this part of the kernel drasticly changed for the
> worse recently.
>
> So this is fundamentaly different, fence as they are now allow random driver
> callback and this is bound to get ugly this is bound to lead to one driver
> doing something that seems innocuous but turn out to break heavoc when call
> from some other driver function.
No, really, look closer.

fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled.
The callback calls wake_up_state, which calls try_to_wake up.

default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.

Fence now has some additional checks, but originally it was implemented as a wait queue.

But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
doesn't use wait queues, but rolls its own (nouveau).

Fences also don't imply implicit sync, you can use explicit sync if you want.

I posted a patch for this, but if you want to create an android userspace fence, call

struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)


I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.

~Maarten

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14  9:15         ` Maarten Lankhorst
@ 2014-08-14 11:53           ` Christian König
  2014-08-14 12:37             ` Maarten Lankhorst
  2014-08-14 14:09           ` Jerome Glisse
  1 sibling, 1 reply; 39+ messages in thread
From: Christian König @ 2014-08-14 11:53 UTC (permalink / raw)
  To: Maarten Lankhorst, Jerome Glisse, Daniel Vetter
  Cc: daniel.vetter, dri-devel, bskeggs

> But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
> Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
> doesn't use wait queues, but rolls its own (nouveau).
But when all those drivers need a special wait function how can you 
still justify the common callback when a fence is signaled?

If I understood it right the use case for this was waiting for any fence 
of a list of fences from multiple drivers, but if each driver needs 
special handling how for it's wait how can that work reliable?

Christian.

Am 14.08.2014 um 11:15 schrieb Maarten Lankhorst:
> Op 13-08-14 om 19:07 schreef Jerome Glisse:
>> On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
>>> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
>>>> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
>>>>> On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
>>>>>> Hi,
>>>>>>
>>>>>> So i want over the whole fence and sync point stuff as it's becoming a pressing
>>>>>> issue. I think we first need to agree on what is the problem we want to solve
>>>>>> and what would be the requirements to solve it.
>>>>>>
>>>>>> Problem :
>>>>>>    Explicit synchronization btw different hardware block over a buffer object.
>>>>>>
>>>>>> Requirements :
>>>>>>    Share common infrastructure.
>>>>>>    Allow optimal hardware command stream scheduling accross hardware block.
>>>>>>    Allow android sync point to be implemented on top of it.
>>>>>>    Handle/acknowledge exception (like good old gpu lockup).
>>>>>>    Minimize driver changes.
>>>>>>
>>>>>> Glossary :
>>>>>>    hardware timeline: timeline bound to a specific hardware block.
>>>>>>    pipeline timeline: timeline bound to a userspace rendering pipeline, each
>>>>>>                       point on that timeline can be a composite of several
>>>>>>                       different hardware pipeline point.
>>>>>>    pipeline: abstract object representing userspace application graphic pipeline
>>>>>>              of each of the application graphic operations.
>>>>>>    fence: specific point in a timeline where synchronization needs to happen.
>>>>>>
>>>>>>
>>>>>> So now, current include/linux/fence.h implementation is i believe missing the
>>>>>> objective by confusing hardware and pipeline timeline and by bolting fence to
>>>>>> buffer object while what is really needed is true and proper timeline for both
>>>>>> hardware and pipeline. But before going further down that road let me look at
>>>>>> things and explain how i see them.
>>>>> fences can be used free-standing and no one forces you to integrate them
>>>>> with buffers. We actually plan to go this way with the intel svm stuff.
>>>>> Ofc for dma-buf the plan is to synchronize using such fences, but that's
>>>>> somewhat orthogonal I think. At least you only talk about fences and
>>>>> timeline and not dma-buf here.
>>>>>   
>>>>>> Current ttm fence have one and a sole purpose, allow synchronization for buffer
>>>>>> object move even thought some driver like radeon slightly abuse it and use them
>>>>>> for things like lockup detection.
>>>>>>
>>>>>> The new fence want to expose an api that would allow some implementation of a
>>>>>> timeline. For that it introduces callback and some hard requirement on what the
>>>>>> driver have to expose :
>>>>>>    enable_signaling
>>>>>>    [signaled]
>>>>>>    wait
>>>>>>
>>>>>> Each of those have to do work inside the driver to which the fence belongs and
>>>>>> each of those can be call more or less from unexpected (with restriction like
>>>>>> outside irq) context. So we end up with thing like :
>>>>>>
>>>>>>   Process 1              Process 2                   Process 3
>>>>>>   I_A_schedule(fence0)
>>>>>>                          CI_A_F_B_signaled(fence0)
>>>>>>                                                      I_A_signal(fence0)
>>>>>>                                                      CI_B_F_A_callback(fence0)
>>>>>>                          CI_A_F_B_wait(fence0)
>>>>>> Lexique:
>>>>>> I_x  in driver x (I_A == in driver A)
>>>>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
>>>>>>
>>>>>> So this is an happy mess everyone call everyone and this bound to get messy.
>>>>>> Yes i know there is all kind of requirement on what happen once a fence is
>>>>>> signaled. But those requirement only looks like they are trying to atone any
>>>>>> mess that can happen from the whole callback dance.
>>>>>>
>>>>>> While i was too seduced by the whole callback idea long time ago, i think it is
>>>>>> a highly dangerous path to take where the combinatorial of what could happen
>>>>>> are bound to explode with the increase in the number of players.
>>>>>>
>>>>>>
>>>>>> So now back to how to solve the problem we are trying to address. First i want
>>>>>> to make an observation, almost all GPU that exist today have a command ring
>>>>>> on to which userspace command buffer are executed and inside the command ring
>>>>>> you can do something like :
>>>>>>
>>>>>>    if (condition) execute_command_buffer else skip_command_buffer
>>>>>>
>>>>>> where condition is a simple expression (memory_address cop value)) with cop one
>>>>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
>>>>>> that any gpu that slightly matter can do that. Those who can not should fix
>>>>>> there command ring processor.
>>>>>>
>>>>>>
>>>>>> With that in mind, i think proper solution is implementing timeline and having
>>>>>> fence be a timeline object with a way simpler api. For each hardware timeline
>>>>>> driver provide a system memory address at which the lastest signaled fence
>>>>>> sequence number can be read. Each fence object is uniquely associated with
>>>>>> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
>>>>>> queue.
>>>>>>
>>>>>> When scheduling something that require synchronization on a hardware timeline
>>>>>> a fence is created and associated with the pipeline timeline and hardware
>>>>>> timeline. Other hardware block that need to wait on a fence can use there
>>>>>> command ring conditional execution to directly check the fence sequence from
>>>>>> the other hw block so you do optimistic scheduling. If optimistic scheduling
>>>>>> fails (which would be reported by hw block specific solution and hidden) then
>>>>>> things can fallback to software cpu wait inside what could be considered the
>>>>>> kernel thread of the pipeline timeline.
>>>>>>
>>>>>>
>>>>>>  From api point of view there is no inter-driver call. All the driver needs to
>>>>>> do is wakeup the pipeline timeline wait_queue when things are signaled or
>>>>>> when things go sideway (gpu lockup).
>>>>>>
>>>>>>
>>>>>> So how to implement that with current driver ? Well easy. Currently we assume
>>>>>> implicit synchronization so all we need is an implicit pipeline timeline per
>>>>>> userspace process (note this do not prevent inter process synchronization).
>>>>>> Everytime a command buffer is submitted it is added to the implicit timeline
>>>>>> with the simple fence object :
>>>>>>
>>>>>> struct fence {
>>>>>>    struct list_head   list_hwtimeline;
>>>>>>    struct list_head   list_pipetimeline;
>>>>>>    struct hw_timeline *hw_timeline;
>>>>>>    uint64_t           seq_num;
>>>>>>    work_t             timedout_work;
>>>>>>    void               *csdata;
>>>>>> };
>>>>>>
>>>>>> So with set of helper function call by each of the driver command execution
>>>>>> ioctl you have the implicit timeline that is properly populated and each
>>>>>> dirver command execution get the dependency from the implicit timeline.
>>>>>>
>>>>>>
>>>>>> Of course to take full advantages of all flexibilities this could offer we
>>>>>> would need to allow userspace to create pipeline timeline and to schedule
>>>>>> against the pipeline timeline of there choice. We could create file for
>>>>>> each of the pipeline timeline and have file operation to wait/query
>>>>>> progress.
>>>>>>
>>>>>> Note that the gpu lockup are considered exceptional event, the implicit
>>>>>> timeline will probably want to continue on other job on other hardware
>>>>>> block but the explicit one probably will want to decide wether to continue
>>>>>> or abort or retry without the fault hw block.
>>>>>>
>>>>>>
>>>>>> I realize i am late to the party and that i should have taken a serious
>>>>>> look at all this long time ago. I apologize for that and if you consider
>>>>>> this is to late then just ignore me modulo the big warning the crazyness
>>>>>> that callback will introduce an how bad things bound to happen. I am not
>>>>>> saying that bad things can not happen with what i propose just that
>>>>>> because everything happen inside the process context that is the one
>>>>>> asking/requiring synchronization there will be not interprocess kernel
>>>>>> callback (a callback that was registered by one process and that is call
>>>>>> inside another process time slice because fence signaling is happening
>>>>>> inside this other process time slice).
>>>>> So I read through it all and presuming I understand it correctly your
>>>>> proposal and what we currently have is about the same. The big difference
>>>>> is that you make a timeline a first-class object and move the callback
>>>>> queue from the fence to the timeline, which requires callers to check the
>>>>> fence/seqno/whatever themselves instead of pushing that responsibility to
>>>>> callers.
>>>> No, big difference is that there is no callback thus when waiting for a
>>>> fence you are either inside the process context that need to wait for it
>>>> or your inside a kernel thread process context. Which means in both case
>>>> you can do whatever you want. What i hate about the fence code as it is,
>>>> is the callback stuff, because you never know into which context fence
>>>> are signaled then you never know into which context callback are executed.
>>> Look at waitqueues a bit closer. They're implemented with callbacks ;-)
>>> The only difference is that you're allowed to have spurious wakeups and
>>> need to handle that somehow, so need a separate check function.
>> No this not how wait queue are implemented, ie wait queue do not callback a
>> random function from a random driver, it callback a limited set of function
>> from core linux kernel scheduler so that the process thread that was waiting
>> and out of the scheduler list is added back and marked as something that
>> should be schedule. Unless this part of the kernel drasticly changed for the
>> worse recently.
>>
>> So this is fundamentaly different, fence as they are now allow random driver
>> callback and this is bound to get ugly this is bound to lead to one driver
>> doing something that seems innocuous but turn out to break heavoc when call
>> from some other driver function.
> No, really, look closer.
>
> fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled.
> The callback calls wake_up_state, which calls try_to_wake up.
>
> default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.
>
> Fence now has some additional checks, but originally it was implemented as a wait queue.
>
> But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
> Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
> doesn't use wait queues, but rolls its own (nouveau).
>
> Fences also don't imply implicit sync, you can use explicit sync if you want.
>
> I posted a patch for this, but if you want to create an android userspace fence, call
>
> struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)
>
>
> I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.
>
> ~Maarten
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 11:53           ` Christian König
@ 2014-08-14 12:37             ` Maarten Lankhorst
  2014-08-14 14:31               ` Christian König
  0 siblings, 1 reply; 39+ messages in thread
From: Maarten Lankhorst @ 2014-08-14 12:37 UTC (permalink / raw)
  To: Christian König, Jerome Glisse, Daniel Vetter
  Cc: daniel.vetter, dri-devel, bskeggs

Op 14-08-14 om 13:53 schreef Christian König:
>> But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
>> Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
>> doesn't use wait queues, but rolls its own (nouveau).
> But when all those drivers need a special wait function how can you still justify the common callback when a fence is signaled?
>
> If I understood it right the use case for this was waiting for any fence of a list of fences from multiple drivers, but if each driver needs special handling how for it's wait how can that work reliable?
TTM doesn't rely on the callbacks. It will call .enable_signaling when .is_signaled is NULL, to make sure that fence_is_signaled returns true sooner.

QXL is completely awful, I've seen some patches to add dma-buf support but I'll make sure it never supports importing from/exporting to other devices. This should reduce insanity factor there.
If I understand QXL correctly, sometimes fences may never signal at all due to virt-hw bugs.

nouveau (pre nv84) has no interrupt for completed work, but it has a reliable is_signaled. So .enable_signaling only checks if fence is signaled here.
A custom waiting function makes sure things work correctly, and also signals all unsignaled fences for that context. I preserved the original wait from before the fence conversion.
Nouveau keeps a global list of unsignaled fences, so they will all signal eventually.
I may have to limit importing/exporting dma-buf's to other devices, or add delayed work that periodically checks all contexts for completed fences for this to work cross-device.

nv84+ has a sane interrupt, so I use it. :-)

Radeon with fence conversion has the delayed work for handling lockups that also checks.

~Maarten

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-13 17:07       ` Jerome Glisse
  2014-08-14  9:08         ` Daniel Vetter
  2014-08-14  9:15         ` Maarten Lankhorst
@ 2014-08-14 13:16         ` Rob Clark
  2014-08-14 14:12           ` Jerome Glisse
  2 siblings, 1 reply; 39+ messages in thread
From: Rob Clark @ 2014-08-14 13:16 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Daniel Vetter, dri-devel, Ben Skeggs

On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> So this is fundamentaly different, fence as they are now allow random driver
> callback and this is bound to get ugly this is bound to lead to one driver
> doing something that seems innocuous but turn out to break heavoc when call
> from some other driver function.


tbh, that seems solvable by some strict rules about what you can do in
the callback.. ie. don't do anything you couldn't do in atomic, and
don't signal another fence.. off the top of my head that seems
sufficient.

If the driver getting the callback needs to do more, then it can
always schedule a worker..

But I could certainly see the case where the driver waiting on fence
sets everything up before installing the cb and then just needs to
write one or a couple regs from the cb.

BR,
-R

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14  9:15         ` Maarten Lankhorst
  2014-08-14 11:53           ` Christian König
@ 2014-08-14 14:09           ` Jerome Glisse
  1 sibling, 0 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 14:09 UTC (permalink / raw)
  To: Maarten Lankhorst; +Cc: daniel.vetter, dri-devel, bskeggs

On Thu, Aug 14, 2014 at 11:15:11AM +0200, Maarten Lankhorst wrote:
> Op 13-08-14 om 19:07 schreef Jerome Glisse:
> > On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
> >> On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
> >>> On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
> >>>> On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
> >>>>> Hi,
> >>>>>
> >>>>> So i want over the whole fence and sync point stuff as it's becoming a pressing
> >>>>> issue. I think we first need to agree on what is the problem we want to solve
> >>>>> and what would be the requirements to solve it.
> >>>>>
> >>>>> Problem :
> >>>>>   Explicit synchronization btw different hardware block over a buffer object.
> >>>>>
> >>>>> Requirements :
> >>>>>   Share common infrastructure.
> >>>>>   Allow optimal hardware command stream scheduling accross hardware block.
> >>>>>   Allow android sync point to be implemented on top of it.
> >>>>>   Handle/acknowledge exception (like good old gpu lockup).
> >>>>>   Minimize driver changes.
> >>>>>
> >>>>> Glossary :
> >>>>>   hardware timeline: timeline bound to a specific hardware block.
> >>>>>   pipeline timeline: timeline bound to a userspace rendering pipeline, each
> >>>>>                      point on that timeline can be a composite of several
> >>>>>                      different hardware pipeline point.
> >>>>>   pipeline: abstract object representing userspace application graphic pipeline
> >>>>>             of each of the application graphic operations.
> >>>>>   fence: specific point in a timeline where synchronization needs to happen.
> >>>>>
> >>>>>
> >>>>> So now, current include/linux/fence.h implementation is i believe missing the
> >>>>> objective by confusing hardware and pipeline timeline and by bolting fence to
> >>>>> buffer object while what is really needed is true and proper timeline for both
> >>>>> hardware and pipeline. But before going further down that road let me look at
> >>>>> things and explain how i see them.
> >>>> fences can be used free-standing and no one forces you to integrate them
> >>>> with buffers. We actually plan to go this way with the intel svm stuff.
> >>>> Ofc for dma-buf the plan is to synchronize using such fences, but that's
> >>>> somewhat orthogonal I think. At least you only talk about fences and
> >>>> timeline and not dma-buf here.
> >>>>  
> >>>>> Current ttm fence have one and a sole purpose, allow synchronization for buffer
> >>>>> object move even thought some driver like radeon slightly abuse it and use them
> >>>>> for things like lockup detection.
> >>>>>
> >>>>> The new fence want to expose an api that would allow some implementation of a
> >>>>> timeline. For that it introduces callback and some hard requirement on what the
> >>>>> driver have to expose :
> >>>>>   enable_signaling
> >>>>>   [signaled]
> >>>>>   wait
> >>>>>
> >>>>> Each of those have to do work inside the driver to which the fence belongs and
> >>>>> each of those can be call more or less from unexpected (with restriction like
> >>>>> outside irq) context. So we end up with thing like :
> >>>>>
> >>>>>  Process 1              Process 2                   Process 3
> >>>>>  I_A_schedule(fence0)
> >>>>>                         CI_A_F_B_signaled(fence0)
> >>>>>                                                     I_A_signal(fence0)
> >>>>>                                                     CI_B_F_A_callback(fence0)
> >>>>>                         CI_A_F_B_wait(fence0)
> >>>>> Lexique:
> >>>>> I_x  in driver x (I_A == in driver A)
> >>>>> CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
> >>>>>
> >>>>> So this is an happy mess everyone call everyone and this bound to get messy.
> >>>>> Yes i know there is all kind of requirement on what happen once a fence is
> >>>>> signaled. But those requirement only looks like they are trying to atone any
> >>>>> mess that can happen from the whole callback dance.
> >>>>>
> >>>>> While i was too seduced by the whole callback idea long time ago, i think it is
> >>>>> a highly dangerous path to take where the combinatorial of what could happen
> >>>>> are bound to explode with the increase in the number of players.
> >>>>>
> >>>>>
> >>>>> So now back to how to solve the problem we are trying to address. First i want
> >>>>> to make an observation, almost all GPU that exist today have a command ring
> >>>>> on to which userspace command buffer are executed and inside the command ring
> >>>>> you can do something like :
> >>>>>
> >>>>>   if (condition) execute_command_buffer else skip_command_buffer
> >>>>>
> >>>>> where condition is a simple expression (memory_address cop value)) with cop one
> >>>>> of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption
> >>>>> that any gpu that slightly matter can do that. Those who can not should fix
> >>>>> there command ring processor.
> >>>>>
> >>>>>
> >>>>> With that in mind, i think proper solution is implementing timeline and having
> >>>>> fence be a timeline object with a way simpler api. For each hardware timeline
> >>>>> driver provide a system memory address at which the lastest signaled fence
> >>>>> sequence number can be read. Each fence object is uniquely associated with
> >>>>> both a hardware and a pipeline timeline. Each pipeline timeline have a wait
> >>>>> queue.
> >>>>>
> >>>>> When scheduling something that require synchronization on a hardware timeline
> >>>>> a fence is created and associated with the pipeline timeline and hardware
> >>>>> timeline. Other hardware block that need to wait on a fence can use there
> >>>>> command ring conditional execution to directly check the fence sequence from
> >>>>> the other hw block so you do optimistic scheduling. If optimistic scheduling
> >>>>> fails (which would be reported by hw block specific solution and hidden) then
> >>>>> things can fallback to software cpu wait inside what could be considered the
> >>>>> kernel thread of the pipeline timeline.
> >>>>>
> >>>>>
> >>>>> From api point of view there is no inter-driver call. All the driver needs to
> >>>>> do is wakeup the pipeline timeline wait_queue when things are signaled or
> >>>>> when things go sideway (gpu lockup).
> >>>>>
> >>>>>
> >>>>> So how to implement that with current driver ? Well easy. Currently we assume
> >>>>> implicit synchronization so all we need is an implicit pipeline timeline per
> >>>>> userspace process (note this do not prevent inter process synchronization).
> >>>>> Everytime a command buffer is submitted it is added to the implicit timeline
> >>>>> with the simple fence object :
> >>>>>
> >>>>> struct fence {
> >>>>>   struct list_head   list_hwtimeline;
> >>>>>   struct list_head   list_pipetimeline;
> >>>>>   struct hw_timeline *hw_timeline;
> >>>>>   uint64_t           seq_num;
> >>>>>   work_t             timedout_work;
> >>>>>   void               *csdata;
> >>>>> };
> >>>>>
> >>>>> So with set of helper function call by each of the driver command execution
> >>>>> ioctl you have the implicit timeline that is properly populated and each
> >>>>> dirver command execution get the dependency from the implicit timeline.
> >>>>>
> >>>>>
> >>>>> Of course to take full advantages of all flexibilities this could offer we
> >>>>> would need to allow userspace to create pipeline timeline and to schedule
> >>>>> against the pipeline timeline of there choice. We could create file for
> >>>>> each of the pipeline timeline and have file operation to wait/query
> >>>>> progress.
> >>>>>
> >>>>> Note that the gpu lockup are considered exceptional event, the implicit
> >>>>> timeline will probably want to continue on other job on other hardware
> >>>>> block but the explicit one probably will want to decide wether to continue
> >>>>> or abort or retry without the fault hw block.
> >>>>>
> >>>>>
> >>>>> I realize i am late to the party and that i should have taken a serious
> >>>>> look at all this long time ago. I apologize for that and if you consider
> >>>>> this is to late then just ignore me modulo the big warning the crazyness
> >>>>> that callback will introduce an how bad things bound to happen. I am not
> >>>>> saying that bad things can not happen with what i propose just that
> >>>>> because everything happen inside the process context that is the one
> >>>>> asking/requiring synchronization there will be not interprocess kernel
> >>>>> callback (a callback that was registered by one process and that is call
> >>>>> inside another process time slice because fence signaling is happening
> >>>>> inside this other process time slice).
> >>>> So I read through it all and presuming I understand it correctly your
> >>>> proposal and what we currently have is about the same. The big difference
> >>>> is that you make a timeline a first-class object and move the callback
> >>>> queue from the fence to the timeline, which requires callers to check the
> >>>> fence/seqno/whatever themselves instead of pushing that responsibility to
> >>>> callers.
> >>> No, big difference is that there is no callback thus when waiting for a
> >>> fence you are either inside the process context that need to wait for it
> >>> or your inside a kernel thread process context. Which means in both case
> >>> you can do whatever you want. What i hate about the fence code as it is,
> >>> is the callback stuff, because you never know into which context fence
> >>> are signaled then you never know into which context callback are executed.
> >> Look at waitqueues a bit closer. They're implemented with callbacks ;-)
> >> The only difference is that you're allowed to have spurious wakeups and
> >> need to handle that somehow, so need a separate check function.
> > No this not how wait queue are implemented, ie wait queue do not callback a
> > random function from a random driver, it callback a limited set of function
> > from core linux kernel scheduler so that the process thread that was waiting
> > and out of the scheduler list is added back and marked as something that
> > should be schedule. Unless this part of the kernel drasticly changed for the
> > worse recently.
> >
> > So this is fundamentaly different, fence as they are now allow random driver
> > callback and this is bound to get ugly this is bound to lead to one driver
> > doing something that seems innocuous but turn out to break heavoc when call
> > from some other driver function.
> No, really, look closer.
> 
> fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled.
> The callback calls wake_up_state, which calls try_to_wake up.
> 
> default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.
> 
> Fence now has some additional checks, but originally it was implemented as a wait queue.
> 
> But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
> Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
> doesn't use wait queues, but rolls its own (nouveau).

This is not how i read the code. Yes the default_wait helper use a wait_queue
like implementation but this is not what you encourage to implement. If i
understand correctly you want driver to call fence_signal which goes over the
list of callback and call each of them. That is the issue. The way you encourage
people to do thing is with callback. Yes it can be work around by a driver but
then why having a common api if it's for having each driver work around it the
way it likes. Make no sense.


> Fences also don't imply implicit sync, you can use explicit sync if you want.

I do understand that, now understand this, associating fence with buffer object
is wrong for explicit sync, it is broken for explicit sync, it is problematic for
explicit sync. It really is, think about it. You do not care about what is the
last fence on a given buffer, you care about some operation you sent to some hw
block in respect to some other operation you want to execute on some other hw
block.

That's the whole probleme android sync-point is trying to solve at least this is
how i understand the problem.

So associating fence with buffer object is wrong, it make complex code add over
head of going over all buffer that needs to be fence, complex locking and fence
lifetime management.

> 
> I posted a patch for this, but if you want to create an android userspace fence, call
> 
> struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)

What i said is exposing fence to userspace is wrong, exposing timeline with sync
point on timeline is i believe the right way to do it.

> 
> I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.

Like i said feel free to ignore what i am saying here, i am just saying that
this is ugly and this it will lead to issue and does not allow the whole
flexibility you can get with explicit sync (i am not saying you can not do
explicit i am saying you can only do a small number of explicit sync pattern).

> ~Maarten
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 13:16         ` Rob Clark
@ 2014-08-14 14:12           ` Jerome Glisse
  2014-08-14 15:58             ` Daniel Vetter
  2014-08-14 22:11             ` Rob Clark
  0 siblings, 2 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 14:12 UTC (permalink / raw)
  To: Rob Clark; +Cc: Daniel Vetter, dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
> On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > So this is fundamentaly different, fence as they are now allow random driver
> > callback and this is bound to get ugly this is bound to lead to one driver
> > doing something that seems innocuous but turn out to break heavoc when call
> > from some other driver function.
> 
> 
> tbh, that seems solvable by some strict rules about what you can do in
> the callback.. ie. don't do anything you couldn't do in atomic, and
> don't signal another fence.. off the top of my head that seems
> sufficient.
> 
> If the driver getting the callback needs to do more, then it can
> always schedule a worker..
> 
> But I could certainly see the case where the driver waiting on fence
> sets everything up before installing the cb and then just needs to
> write one or a couple regs from the cb.

Yes sane code will do sane things, sadly i fear we can not enforce sane
code everywhere especialy with out of tree driver and i would rather
force there hand to only allow sane implementation. Providing call back
api obviously allows them to do crazy stuff.

> 
> BR,
> -R

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14  9:08         ` Daniel Vetter
@ 2014-08-14 14:23           ` Jerome Glisse
  2014-08-14 15:55             ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 14:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: daniel.vetter, dri-devel, bskeggs

On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
> On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
> > Let me make this crystal clear this must be a valid kernel page that have a
> > valid kernel mapping for the lifetime of the device. Hence there is no access
> > to mmio space or anything, just a regular kernel page. If can not rely on that
> > this is a sad world.
> > 
> > That being said, yes i am aware that some device incapacity to write to such
> > a page. For those dumb hardware what you need to do is have the irq handler
> > write to this page on behalf of the hardware. But i would like to know any
> > hardware than can not write a single dword from its ring buffer.
> > 
> > The only tricky part in this, is when device is unloaded and driver is removing
> > itself, it obviously need to synchronize itself with anyone possibly waiting on
> > it and possibly reading. But truly this is not that hard to solve.
> > 
> > So tell me once the above is clear what kind of scary thing can happen when cpu
> > or a device _read_ a kernel page ?
> 
> It's not reading it, it's making sense of what you read. In i915 we had
> exactly the (timeline, seqno) value pair design for fences for a long
> time, and we're switching away from it since it stops working when you
> have preemption and scheduler. Or at least it gets really interesting to
> interpret the data written into the page.
> 
> So I don't want to expose that to other drivers if we decided that
> exposing this internally is a stupid idea.

I am well aware of that, but context scheduling really is what the timeline i
talk about is. The user timeline should be consider like a single cpu thread on
to which operation involving different hw are scheduled. You can have the hw
switch from one timeline to another and a seqno is only meaningful per timeline.

The whole preemption and scheduling is something bound to happen on gpu and we
will want to integrate with core scheduler to manage time slice allocated to
process, but the we need the concept of thread in which operation on same hw
block are processed in a linear fashion but still allowing concurrency with
other hw block.

> 
> > > 
> > > > > So from that pov (presuming I didn't miss anything) your proposal is
> > > > > identical to what we have, minor some different color choices (like where
> > > > > to place the callback queue).
> > > > 
> > > > No callback is the mantra here, and instead of bolting free living fence
> > > > to buffer object, they are associated with timeline which means you do not
> > > > need to go over all buffer object to know what you need to wait for.
> > > 
> > > Ok, then I guess I didn't understand that part of your the proposal. Can
> > > you please elaborate a bit more how you want to synchronize mulitple
> > > drivers accessing a dma-buf object and what piece of state we need to
> > > associate to the dma-buf to make this happen?
> > 
> > Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where
> > the implicit synchronization is the rule it enforce mandatory synchronization
> > accross all hw timeline on which a buffer shows up :
> >   for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
> >     if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
> >       continue;
> >     for_each_process_sharing_buffer(buffer, process) {
> >       schedule_fence(process->implicit_timeline, cmdbuf->fence)
> >     }
> >   }
> > 
> > Next time another process use current ioctl with implicit sync it will synch with
> > the last fence for any shared resource. This might sounds bad but truely as it is
> > right now this is already how it happens (at least for radeon).
> 
> Well i915 is a lot better than that. And I'm not going to implement some
> special-case for dma-buf shared buffers just because radeon sucks and
> wants to enforce that suckage on everyone else.

I guess i am having hard time to express myself, what i am saying here is that
implicit synchronization sucks because it has to assume the worst case and this
is what current code does, and i am sure intel is doing something similar with
today code.

Explicit synchronization allow more flexibility but fence code as it is designed
does not allow to fully do down that line. By associating fence to buffer object
which is the biggest shortcoming of implicit sync.

> 
> So let's cut this short: If you absolutely insist I guess we could ditch
> the callback stuff from fences, but I really don't see the problem with
> radeon just not using that and then being happy. We can easily implement a
> bit of insulation code _just_ for radeon so that the only thing radeon
> does is wake up a process (which then calls the callback if it's something
> special).

Like i said feel free to ignore me. I am just genuinely want to have the best
solution inside the linux kernel and i do think that fence and callback and
buffer association is not that solution. I tried to explain why but i might
be failing or missing something.

> Otoh I don't care about what ttm and radeon do, for i915 the important
> stuff is integration with android syncpts and being able to do explicit
> fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and
> I expect that those patches will land in 3.18, at least the internal
> integration.
> 
> It would be cool if we could get tear-free optimus working on desktop
> linux, but that flat out doesn't pay my bills here. So I think I'll let
> you guys figure this out yourself.

Sad to learn that Intel no longer have any interest in the linux desktop.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 12:37             ` Maarten Lankhorst
@ 2014-08-14 14:31               ` Christian König
  0 siblings, 0 replies; 39+ messages in thread
From: Christian König @ 2014-08-14 14:31 UTC (permalink / raw)
  To: Maarten Lankhorst, Jerome Glisse, Daniel Vetter
  Cc: daniel.vetter, dri-devel, bskeggs

Am 14.08.2014 um 14:37 schrieb Maarten Lankhorst:
> Op 14-08-14 om 13:53 schreef Christian König:
>>> But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl)
>>> Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??).  And nouveau
>>> doesn't use wait queues, but rolls its own (nouveau).
>> But when all those drivers need a special wait function how can you still justify the common callback when a fence is signaled?
>>
>> If I understood it right the use case for this was waiting for any fence of a list of fences from multiple drivers, but if each driver needs special handling how for it's wait how can that work reliable?
> TTM doesn't rely on the callbacks. It will call .enable_signaling when .is_signaled is NULL, to make sure that fence_is_signaled returns true sooner.
>
> QXL is completely awful, I've seen some patches to add dma-buf support but I'll make sure it never supports importing from/exporting to other devices. This should reduce insanity factor there.
So if I understand you right QXL doesn't really implement the whole 
fence interface? It just implements enough to make TTM happy and you 
forbid DMA-buf support because the rest isn't really working?

Sorry, but that just sounds like your fence design just isn't doing the 
right thing here.

> If I understand QXL correctly, sometimes fences may never signal at all due to virt-hw bugs.
Radeon has the same problem, with the hardware scheduler each client 
essentially has it's own fence sequence number range. If you kill a 
client the remaining fences not necessarily gets signaled by the hardware.

>
> nouveau (pre nv84) has no interrupt for completed work, but it has a reliable is_signaled. So .enable_signaling only checks if fence is signaled here.
> A custom waiting function makes sure things work correctly, and also signals all unsignaled fences for that context. I preserved the original wait from before the fence conversion.
> Nouveau keeps a global list of unsignaled fences, so they will all signal eventually.
> I may have to limit importing/exporting dma-buf's to other devices, or add delayed work that periodically checks all contexts for completed fences for this to work cross-device.
>
> nv84+ has a sane interrupt, so I use it. :-)
>
> Radeon with fence conversion has the delayed work for handling lockups that also checks.

Which I still don't had time to check completely, but it sounds more and 
more like those fallbacks for not fired interrupts should be in the 
common fence code and not in each individual driver.

Christian.

>
> ~Maarten

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 14:23           ` Jerome Glisse
@ 2014-08-14 15:55             ` Daniel Vetter
  2014-08-14 18:18               ` Jerome Glisse
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-14 15:55 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: daniel.vetter, dri-devel, bskeggs

On Thu, Aug 14, 2014 at 10:23:30AM -0400, Jerome Glisse wrote:
> On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
> > > Let me make this crystal clear this must be a valid kernel page that have a
> > > valid kernel mapping for the lifetime of the device. Hence there is no access
> > > to mmio space or anything, just a regular kernel page. If can not rely on that
> > > this is a sad world.
> > > 
> > > That being said, yes i am aware that some device incapacity to write to such
> > > a page. For those dumb hardware what you need to do is have the irq handler
> > > write to this page on behalf of the hardware. But i would like to know any
> > > hardware than can not write a single dword from its ring buffer.
> > > 
> > > The only tricky part in this, is when device is unloaded and driver is removing
> > > itself, it obviously need to synchronize itself with anyone possibly waiting on
> > > it and possibly reading. But truly this is not that hard to solve.
> > > 
> > > So tell me once the above is clear what kind of scary thing can happen when cpu
> > > or a device _read_ a kernel page ?
> > 
> > It's not reading it, it's making sense of what you read. In i915 we had
> > exactly the (timeline, seqno) value pair design for fences for a long
> > time, and we're switching away from it since it stops working when you
> > have preemption and scheduler. Or at least it gets really interesting to
> > interpret the data written into the page.
> > 
> > So I don't want to expose that to other drivers if we decided that
> > exposing this internally is a stupid idea.
> 
> I am well aware of that, but context scheduling really is what the timeline i
> talk about is. The user timeline should be consider like a single cpu thread on
> to which operation involving different hw are scheduled. You can have the hw
> switch from one timeline to another and a seqno is only meaningful per timeline.
> 
> The whole preemption and scheduling is something bound to happen on gpu and we
> will want to integrate with core scheduler to manage time slice allocated to
> process, but the we need the concept of thread in which operation on same hw
> block are processed in a linear fashion but still allowing concurrency with
> other hw block.

Well yeah, it's just that a gpu thread doesn't have a lot to do with a cpu
process, at least in the current drm world. It would be nice make that a
bit more cross-driver for management, but we didn't even manage to pull
that of for gpu memory. So I don't think it'll happen anytime soonish.

> > > > > > So from that pov (presuming I didn't miss anything) your proposal is
> > > > > > identical to what we have, minor some different color choices (like where
> > > > > > to place the callback queue).
> > > > > 
> > > > > No callback is the mantra here, and instead of bolting free living fence
> > > > > to buffer object, they are associated with timeline which means you do not
> > > > > need to go over all buffer object to know what you need to wait for.
> > > > 
> > > > Ok, then I guess I didn't understand that part of your the proposal. Can
> > > > you please elaborate a bit more how you want to synchronize mulitple
> > > > drivers accessing a dma-buf object and what piece of state we need to
> > > > associate to the dma-buf to make this happen?
> > > 
> > > Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where
> > > the implicit synchronization is the rule it enforce mandatory synchronization
> > > accross all hw timeline on which a buffer shows up :
> > >   for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
> > >     if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
> > >       continue;
> > >     for_each_process_sharing_buffer(buffer, process) {
> > >       schedule_fence(process->implicit_timeline, cmdbuf->fence)
> > >     }
> > >   }
> > > 
> > > Next time another process use current ioctl with implicit sync it will synch with
> > > the last fence for any shared resource. This might sounds bad but truely as it is
> > > right now this is already how it happens (at least for radeon).
> > 
> > Well i915 is a lot better than that. And I'm not going to implement some
> > special-case for dma-buf shared buffers just because radeon sucks and
> > wants to enforce that suckage on everyone else.
> 
> I guess i am having hard time to express myself, what i am saying here is that
> implicit synchronization sucks because it has to assume the worst case and this
> is what current code does, and i am sure intel is doing something similar with
> today code.
> 
> Explicit synchronization allow more flexibility but fence code as it is designed
> does not allow to fully do down that line. By associating fence to buffer object
> which is the biggest shortcoming of implicit sync.

I don't see how your statement that implicit sync sucks maps to the
reality of i915 where it doesn't. The only thing you can't do is shared
buffer objects that you suballocate. And userspace is always allowed to
lie about the access mode (if it lies too badly it will simply read zero
page) so can forgoe implicit sync.

> > So let's cut this short: If you absolutely insist I guess we could ditch
> > the callback stuff from fences, but I really don't see the problem with
> > radeon just not using that and then being happy. We can easily implement a
> > bit of insulation code _just_ for radeon so that the only thing radeon
> > does is wake up a process (which then calls the callback if it's something
> > special).
> 
> Like i said feel free to ignore me. I am just genuinely want to have the best
> solution inside the linux kernel and i do think that fence and callback and
> buffer association is not that solution. I tried to explain why but i might
> be failing or missing something.
> 
> > Otoh I don't care about what ttm and radeon do, for i915 the important
> > stuff is integration with android syncpts and being able to do explicit
> > fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and
> > I expect that those patches will land in 3.18, at least the internal
> > integration.
> > 
> > It would be cool if we could get tear-free optimus working on desktop
> > linux, but that flat out doesn't pay my bills here. So I think I'll let
> > you guys figure this out yourself.
> 
> Sad to learn that Intel no longer have any interest in the linux desktop.

Well the current Linux DRI stack works fairly well on Linux with Intel I
think, so I don't think anyone can claim we don't care. It's only starts
to suck if you mix in prime.

And as a matter of fact the current Linux DRI model uses implicit sync, so
that's what dma-buf currently implements. Of course we can implement an
explicit sync gpu model (and it will come for at least opencl and similar
stuff), and probably also for interactions with v4l, but I don't see any
patches to rewrite the current X stack we have.

And yeah we kinda optimized that one a bit on Intel since we actually care
about X ;-)

What I'll stop caring about is how exactly radeon integrates into this,
not the linux desktop in general.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 14:12           ` Jerome Glisse
@ 2014-08-14 15:58             ` Daniel Vetter
  2014-08-14 18:26               ` Jerome Glisse
  2014-08-14 22:11             ` Rob Clark
  1 sibling, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-14 15:58 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Daniel Vetter, dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 10:12:06AM -0400, Jerome Glisse wrote:
> On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
> > On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > > So this is fundamentaly different, fence as they are now allow random driver
> > > callback and this is bound to get ugly this is bound to lead to one driver
> > > doing something that seems innocuous but turn out to break heavoc when call
> > > from some other driver function.
> > 
> > 
> > tbh, that seems solvable by some strict rules about what you can do in
> > the callback.. ie. don't do anything you couldn't do in atomic, and
> > don't signal another fence.. off the top of my head that seems
> > sufficient.
> > 
> > If the driver getting the callback needs to do more, then it can
> > always schedule a worker..
> > 
> > But I could certainly see the case where the driver waiting on fence
> > sets everything up before installing the cb and then just needs to
> > write one or a couple regs from the cb.
> 
> Yes sane code will do sane things, sadly i fear we can not enforce sane
> code everywhere especialy with out of tree driver and i would rather
> force there hand to only allow sane implementation. Providing call back
> api obviously allows them to do crazy stuff.

Well then don't support out of tree drivers. Fairly easy problem really,
and last time I checked "out of tree drivers suck" isn't a valid
objections for upstream code ... It's kinda assumed that they all do, it's
why we have staging after all.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 15:55             ` Daniel Vetter
@ 2014-08-14 18:18               ` Jerome Glisse
  2014-08-14 18:47                 ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 18:18 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: daniel.vetter, dri-devel, bskeggs

On Thu, Aug 14, 2014 at 05:55:51PM +0200, Daniel Vetter wrote:
> On Thu, Aug 14, 2014 at 10:23:30AM -0400, Jerome Glisse wrote:
> > On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
> > > On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
> > > > Let me make this crystal clear this must be a valid kernel page that have a
> > > > valid kernel mapping for the lifetime of the device. Hence there is no access
> > > > to mmio space or anything, just a regular kernel page. If can not rely on that
> > > > this is a sad world.
> > > > 
> > > > That being said, yes i am aware that some device incapacity to write to such
> > > > a page. For those dumb hardware what you need to do is have the irq handler
> > > > write to this page on behalf of the hardware. But i would like to know any
> > > > hardware than can not write a single dword from its ring buffer.
> > > > 
> > > > The only tricky part in this, is when device is unloaded and driver is removing
> > > > itself, it obviously need to synchronize itself with anyone possibly waiting on
> > > > it and possibly reading. But truly this is not that hard to solve.
> > > > 
> > > > So tell me once the above is clear what kind of scary thing can happen when cpu
> > > > or a device _read_ a kernel page ?
> > > 
> > > It's not reading it, it's making sense of what you read. In i915 we had
> > > exactly the (timeline, seqno) value pair design for fences for a long
> > > time, and we're switching away from it since it stops working when you
> > > have preemption and scheduler. Or at least it gets really interesting to
> > > interpret the data written into the page.
> > > 
> > > So I don't want to expose that to other drivers if we decided that
> > > exposing this internally is a stupid idea.
> > 
> > I am well aware of that, but context scheduling really is what the timeline i
> > talk about is. The user timeline should be consider like a single cpu thread on
> > to which operation involving different hw are scheduled. You can have the hw
> > switch from one timeline to another and a seqno is only meaningful per timeline.
> > 
> > The whole preemption and scheduling is something bound to happen on gpu and we
> > will want to integrate with core scheduler to manage time slice allocated to
> > process, but the we need the concept of thread in which operation on same hw
> > block are processed in a linear fashion but still allowing concurrency with
> > other hw block.
> 
> Well yeah, it's just that a gpu thread doesn't have a lot to do with a cpu
> process, at least in the current drm world. It would be nice make that a
> bit more cross-driver for management, but we didn't even manage to pull
> that of for gpu memory. So I don't think it'll happen anytime soonish.
> 
> > > > > > > So from that pov (presuming I didn't miss anything) your proposal is
> > > > > > > identical to what we have, minor some different color choices (like where
> > > > > > > to place the callback queue).
> > > > > > 
> > > > > > No callback is the mantra here, and instead of bolting free living fence
> > > > > > to buffer object, they are associated with timeline which means you do not
> > > > > > need to go over all buffer object to know what you need to wait for.
> > > > > 
> > > > > Ok, then I guess I didn't understand that part of your the proposal. Can
> > > > > you please elaborate a bit more how you want to synchronize mulitple
> > > > > drivers accessing a dma-buf object and what piece of state we need to
> > > > > associate to the dma-buf to make this happen?
> > > > 
> > > > Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where
> > > > the implicit synchronization is the rule it enforce mandatory synchronization
> > > > accross all hw timeline on which a buffer shows up :
> > > >   for_each_buffer_in_cmdbuffer(buffer, cmdbuf) {
> > > >     if (!cmdbuf_write_to_buffer(buffer, cmdbuf))
> > > >       continue;
> > > >     for_each_process_sharing_buffer(buffer, process) {
> > > >       schedule_fence(process->implicit_timeline, cmdbuf->fence)
> > > >     }
> > > >   }
> > > > 
> > > > Next time another process use current ioctl with implicit sync it will synch with
> > > > the last fence for any shared resource. This might sounds bad but truely as it is
> > > > right now this is already how it happens (at least for radeon).
> > > 
> > > Well i915 is a lot better than that. And I'm not going to implement some
> > > special-case for dma-buf shared buffers just because radeon sucks and
> > > wants to enforce that suckage on everyone else.
> > 
> > I guess i am having hard time to express myself, what i am saying here is that
> > implicit synchronization sucks because it has to assume the worst case and this
> > is what current code does, and i am sure intel is doing something similar with
> > today code.
> > 
> > Explicit synchronization allow more flexibility but fence code as it is designed
> > does not allow to fully do down that line. By associating fence to buffer object
> > which is the biggest shortcoming of implicit sync.
> 
> I don't see how your statement that implicit sync sucks maps to the
> reality of i915 where it doesn't. The only thing you can't do is shared
> buffer objects that you suballocate. And userspace is always allowed to
> lie about the access mode (if it lies too badly it will simply read zero
> page) so can forgoe implicit sync.

Sucks because you can not do weird synchronization like one i depicted in another
mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
you can not such thing cleanly in non hackish way.

Sucks because you have a fence object per buffer object and thus overhead grow
with the number of objects. Not even mentioning fence lifetime issue.

Sucks because sub-buffer allocation is just one of many tricks that can not be
achieved properly and cleanly with implicit sync.

...

Having code that work around or alieviate some of this, is in no way a testimony
that it's the best solution. I do believe explicit sync to be superior in use
case it allows way more freedom while the only drawback i see is that you have to
put some trust into userspace.

So yes implicit sync sucks and it does map to i915 reality as well.

> 
> > > So let's cut this short: If you absolutely insist I guess we could ditch
> > > the callback stuff from fences, but I really don't see the problem with
> > > radeon just not using that and then being happy. We can easily implement a
> > > bit of insulation code _just_ for radeon so that the only thing radeon
> > > does is wake up a process (which then calls the callback if it's something
> > > special).
> > 
> > Like i said feel free to ignore me. I am just genuinely want to have the best
> > solution inside the linux kernel and i do think that fence and callback and
> > buffer association is not that solution. I tried to explain why but i might
> > be failing or missing something.
> > 
> > > Otoh I don't care about what ttm and radeon do, for i915 the important
> > > stuff is integration with android syncpts and being able to do explicit
> > > fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and
> > > I expect that those patches will land in 3.18, at least the internal
> > > integration.
> > > 
> > > It would be cool if we could get tear-free optimus working on desktop
> > > linux, but that flat out doesn't pay my bills here. So I think I'll let
> > > you guys figure this out yourself.
> > 
> > Sad to learn that Intel no longer have any interest in the linux desktop.
> 
> Well the current Linux DRI stack works fairly well on Linux with Intel I
> think, so I don't think anyone can claim we don't care. It's only starts
> to suck if you mix in prime.
> 
> And as a matter of fact the current Linux DRI model uses implicit sync, so
> that's what dma-buf currently implements. Of course we can implement an
> explicit sync gpu model (and it will come for at least opencl and similar
> stuff), and probably also for interactions with v4l, but I don't see any
> patches to rewrite the current X stack we have.
> 
> And yeah we kinda optimized that one a bit on Intel since we actually care
> about X ;-)
> 
> What I'll stop caring about is how exactly radeon integrates into this,
> not the linux desktop in general.
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 15:58             ` Daniel Vetter
@ 2014-08-14 18:26               ` Jerome Glisse
  2014-08-14 19:16                 ` Maarten Lankhorst
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 18:26 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Daniel Vetter, dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 05:58:48PM +0200, Daniel Vetter wrote:
> On Thu, Aug 14, 2014 at 10:12:06AM -0400, Jerome Glisse wrote:
> > On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
> > > On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > > > So this is fundamentaly different, fence as they are now allow random driver
> > > > callback and this is bound to get ugly this is bound to lead to one driver
> > > > doing something that seems innocuous but turn out to break heavoc when call
> > > > from some other driver function.
> > > 
> > > 
> > > tbh, that seems solvable by some strict rules about what you can do in
> > > the callback.. ie. don't do anything you couldn't do in atomic, and
> > > don't signal another fence.. off the top of my head that seems
> > > sufficient.
> > > 
> > > If the driver getting the callback needs to do more, then it can
> > > always schedule a worker..
> > > 
> > > But I could certainly see the case where the driver waiting on fence
> > > sets everything up before installing the cb and then just needs to
> > > write one or a couple regs from the cb.
> > 
> > Yes sane code will do sane things, sadly i fear we can not enforce sane
> > code everywhere especialy with out of tree driver and i would rather
> > force there hand to only allow sane implementation. Providing call back
> > api obviously allows them to do crazy stuff.
> 
> Well then don't support out of tree drivers. Fairly easy problem really,
> and last time I checked "out of tree drivers suck" isn't a valid
> objections for upstream code ... It's kinda assumed that they all do, it's
> why we have staging after all.

As usual i fail at expressing my point. I am not saying do not merge this
because of out of tree drivers, i am saying while doing an api let make it
sane and try to make it so that it enforce sanity to anything that lives
outside our realm.

And not even thinking outside kernel tree, but someone might come along and
start using fence, see the callback stuff and start doing crazy stuff with
that all this inside a different obscur kernel subsystem. Then someone in
graphic sees that and use that as justification to do crazy thing too,
because hey, if he is doing why can't i ?

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 18:18               ` Jerome Glisse
@ 2014-08-14 18:47                 ` Daniel Vetter
  2014-08-14 19:15                   ` Jerome Glisse
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-14 18:47 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> Sucks because you can not do weird synchronization like one i depicted in another
> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
> you can not such thing cleanly in non hackish way.

Actually i915 can soon will do that that.

> Sucks because you have a fence object per buffer object and thus overhead grow
> with the number of objects. Not even mentioning fence lifetime issue.
>
> Sucks because sub-buffer allocation is just one of many tricks that can not be
> achieved properly and cleanly with implicit sync.
>
> ...

Well I heard all those reasons and I'm well of aware of them. The
problem is that with current hardware the kernel needs to know for
each buffer how long it needs to be kept around since hw just can't do
page faulting. Yeah you can pin them but for an uma design that
doesn't go down well with folks.

The other problem is that the Linux Desktop I don't seem to care about
any more kinda relies on implicit syncing right now, so we better keep
that working fairly well. Of course we could dream up a shiny new
world where all of the Linux desktop does explicit syncing, but that
world doesn't exist right now. I mean really if you want to right away
throw implicit syncing overboard that doesn't bode well for the
current linux desktop.

So I don't understand all the implicit syncing bashing. It's here, and
it'll stay for a while longer whether you like it or not ...

Of course that doesn't mean we (intel here) won't support explicit
syncing too, and I really don't see a conflict here when mixing and
matching these two approaches.
-Daniel

> Having code that work around or alieviate some of this, is in no way a testimony
> that it's the best solution. I do believe explicit sync to be superior in use
> case it allows way more freedom while the only drawback i see is that you have to
> put some trust into userspace.
>
> So yes implicit sync sucks and it does map to i915 reality as well.




-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 18:47                 ` Daniel Vetter
@ 2014-08-14 19:15                   ` Jerome Glisse
  2014-08-14 19:40                     ` Maarten Lankhorst
                                       ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 19:15 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > Sucks because you can not do weird synchronization like one i depicted in another
> > mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
> > you can not such thing cleanly in non hackish way.
> 
> Actually i915 can soon will do that that.

So you will return fence|syncpoint with each cmdbuf_ioctl ?

> 
> > Sucks because you have a fence object per buffer object and thus overhead grow
> > with the number of objects. Not even mentioning fence lifetime issue.
> >
> > Sucks because sub-buffer allocation is just one of many tricks that can not be
> > achieved properly and cleanly with implicit sync.
> >
> > ...
> 
> Well I heard all those reasons and I'm well of aware of them. The
> problem is that with current hardware the kernel needs to know for
> each buffer how long it needs to be kept around since hw just can't do
> page faulting. Yeah you can pin them but for an uma design that
> doesn't go down well with folks.

I am not thinking with fancy hw in mind, on contrary i thought about all
this with the crappiest hw i could think of, in mind.

Yes you can get rid of fence and not have to pin memory with current hw.
What matter for unpinning is to know that all hw block are done using the
memory. This is easily achievable with your beloved seqno. Have one seqno
per driver (one driver can have different block 3d, video decoding, crtc,
...) each time a buffer is use as part of a command on one block inc the
common seqno and tag the buffer with that number. Have each hw block write
the lastest seqno that is done to a per block location. Now to determine
is buffer is done compare the buffer seqno with the max of all the signaled
seqno of all blocks.

Cost 1 uint32 per buffer and simple if without locking to check status of
a buffer.

Yes preemption and gpu scheduling would break such scheme, but my point is
that when you have such gpu you want to implement a proper solution. Which
of course require quite some work accross the stack. So the past can live
on but the future needs to get its acts together.

> The other problem is that the Linux Desktop I don't seem to care about
> any more kinda relies on implicit syncing right now, so we better keep
> that working fairly well. Of course we could dream up a shiny new
> world where all of the Linux desktop does explicit syncing, but that
> world doesn't exist right now. I mean really if you want to right away
> throw implicit syncing overboard that doesn't bode well for the
> current linux desktop.

Again i fail at expressing myself. I am saying throw things over board,
i am well aware of the current reliance on implicit fencing. I am saying
if fence wants to be this new thing that should allow to do explicit
fencing in the future than it better be done correctly in the first place.

> So I don't understand all the implicit syncing bashing. It's here, and
> it'll stay for a while longer whether you like it or not ...

I am saying this is where we are and it sucks for a number of reasons,
then looking at fence and by looking at fence i am saying this try to
go in the right direction but do crazy things that i am convince we
will regret. In other word if we ever get to the explicit fence better
starts on the right path with the right tool. Moreover i am saying that
this can be done without breaking implicit sync we have today.

> Of course that doesn't mean we (intel here) won't support explicit
> syncing too, and I really don't see a conflict here when mixing and
> matching these two approaches.

Again i fail to express myself. I am not saying there is conflict. I
am saying better take a path which allow to go full way with explicit
fencing while still allowing a less optimal use for an implicit sync
model.

My point is the fence code proposed here, keeps the worst thing about
implicit fencing we have today. This can be done differently, in what
i believe to be better way. And this different approach stills allow
to have have implicit sync for existing userspace.

Cheers,
Jérôme

> -Daniel
> 
> > Having code that work around or alieviate some of this, is in no way a testimony
> > that it's the best solution. I do believe explicit sync to be superior in use
> > case it allows way more freedom while the only drawback i see is that you have to
> > put some trust into userspace.
> >
> > So yes implicit sync sucks and it does map to i915 reality as well.
> 
> 
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 18:26               ` Jerome Glisse
@ 2014-08-14 19:16                 ` Maarten Lankhorst
  0 siblings, 0 replies; 39+ messages in thread
From: Maarten Lankhorst @ 2014-08-14 19:16 UTC (permalink / raw)
  To: Jerome Glisse, Daniel Vetter; +Cc: Daniel Vetter, dri-devel, Ben Skeggs



On 14-08-14 20:26, Jerome Glisse wrote:
> On Thu, Aug 14, 2014 at 05:58:48PM +0200, Daniel Vetter wrote:
>> On Thu, Aug 14, 2014 at 10:12:06AM -0400, Jerome Glisse wrote:
>>> On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
>>>> On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>>>> So this is fundamentaly different, fence as they are now allow random driver
>>>>> callback and this is bound to get ugly this is bound to lead to one driver
>>>>> doing something that seems innocuous but turn out to break heavoc when call
>>>>> from some other driver function.
>>>>
>>>>
>>>> tbh, that seems solvable by some strict rules about what you can do in
>>>> the callback.. ie. don't do anything you couldn't do in atomic, and
>>>> don't signal another fence.. off the top of my head that seems
>>>> sufficient.
>>>>
>>>> If the driver getting the callback needs to do more, then it can
>>>> always schedule a worker..
>>>>
>>>> But I could certainly see the case where the driver waiting on fence
>>>> sets everything up before installing the cb and then just needs to
>>>> write one or a couple regs from the cb.
>>>
>>> Yes sane code will do sane things, sadly i fear we can not enforce sane
>>> code everywhere especialy with out of tree driver and i would rather
>>> force there hand to only allow sane implementation. Providing call back
>>> api obviously allows them to do crazy stuff.
>>
>> Well then don't support out of tree drivers. Fairly easy problem really,
>> and last time I checked "out of tree drivers suck" isn't a valid
>> objections for upstream code ... It's kinda assumed that they all do, it's
>> why we have staging after all.
> 
> As usual i fail at expressing my point. I am not saying do not merge this
> because of out of tree drivers, i am saying while doing an api let make it
> sane and try to make it so that it enforce sanity to anything that lives
> outside our realm.
> 
> And not even thinking outside kernel tree, but someone might come along and
> start using fence, see the callback stuff and start doing crazy stuff with
> that all this inside a different obscur kernel subsystem. Then someone in
> graphic sees that and use that as justification to do crazy thing too,
> because hey, if he is doing why can't i ?
Since when has crazy been contagious?

And here's what stops you:
1. LOCKDEP
2. PROVE_RCU
3. rcu sparse annotations (kbuild test bot)
4. DEBUG_ATOMIC_SLEEP

~Maarten

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 19:15                   ` Jerome Glisse
@ 2014-08-14 19:40                     ` Maarten Lankhorst
  2014-08-14 19:56                       ` Jerome Glisse
  2014-08-14 21:23                     ` Daniel Vetter
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Maarten Lankhorst @ 2014-08-14 19:40 UTC (permalink / raw)
  To: Jerome Glisse, Daniel Vetter; +Cc: dri-devel, Ben Skeggs



On 14-08-14 21:15, Jerome Glisse wrote:
> On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
>> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> Sucks because you can not do weird synchronization like one i depicted in another
>>> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
>>> you can not such thing cleanly in non hackish way.
>>
>> Actually i915 can soon will do that that.
> 
> So you will return fence|syncpoint with each cmdbuf_ioctl ?

It might, soon. There have been patches on the ml about it. It can create a userspace android fence backed by a kernel dma fence.
And it will be created like any other userspace android fence. ;-)

Yet even with that, it will continue to support the implicit sync model since they're not mutually exclusive.

I also fail to understand why you think a fence should be associated with a buffer object. It's the other way around. TTM currently requires buffer objects to be fenced off to protect against eviction.

reservation_object is used for this, and by sharing the reservation_object pointer with a dma-buf you get cross-device synchronization.

It has a bunch of helpers to make common operations easy, see include/linux/reservation.h and drivers/dma-buf/reservation.c
It also allows multiple readers simultaneously across any number of devices. I intend to use this in nouveau.

But there's no requirement to use reservation_object's apart from that's how ttm currently works, and for implicit syncing in dma-buf. If you don't need either go crazy with fence and write your own mechanism on top of fence. Although with android sync and TTM I think I handled all common usecases.

~Maarten

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 19:40                     ` Maarten Lankhorst
@ 2014-08-14 19:56                       ` Jerome Glisse
  2014-08-14 21:20                         ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 19:56 UTC (permalink / raw)
  To: Maarten Lankhorst; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 09:40:08PM +0200, Maarten Lankhorst wrote:
> 
> 
> On 14-08-14 21:15, Jerome Glisse wrote:
> > On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
> >> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>> Sucks because you can not do weird synchronization like one i depicted in another
> >>> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
> >>> you can not such thing cleanly in non hackish way.
> >>
> >> Actually i915 can soon will do that that.
> > 
> > So you will return fence|syncpoint with each cmdbuf_ioctl ?
> 
> It might, soon. There have been patches on the ml about it. It can create a
> userspace android fence backed by a kernel dma fence. And it will be created
> like any other userspace android fence. ;-)

Android fence are not in my mind a nice thing :)

> Yet even with that, it will continue to support the implicit sync model
> since they're not mutually exclusive.

Again i have fail to express myself. I have tried to repeatly said that
what i proposed was not mutualy exclusive.

> I also fail to understand why you think a fence should be associated with
> a buffer object. It's the other way around. TTM currently requires buffer
> objects to be fenced off to protect against eviction.

Again i fail to properly explain myself. I said no fence should be associted
to buffer nor dma-buf wether shared or not. The only things that you really
need for a buffer is a seqno and this only have use for binding/unbinding
object from GART/GPU address space/... so no i am not saying fence should be
associated with a buffer. I am saying fence should be associated with each
cmdbuf and there should be no link whatsover btw fence and buffer object ie
you do not need to store a pointer to fence inside dma-buf or any buffer
object structure.

I am well aware of how ttm works and i am saying it can be replace with
simpler seqno.

> 
> reservation_object is used for this, and by sharing the reservation_object
> pointer with a dma-buf you get cross-device synchronization.

My point is that dma-buf should not have reserversion_object pointer it is
useless and counter productive and only add complexity. You do not care about
buffer, you care about synchronizing cmdbuf/hw block with each other. A buffer
is just what those block consume. In other words reservation_object as fence
are useless and only complexify things for no added value on contrary they
are not as flexible as fence associated to cmdbuf.

> 
> It has a bunch of helpers to make common operations easy, see
include/linux/reservation.h and drivers/dma-buf/reservation.c
> It also allows multiple readers simultaneously across any number of devices.
> I intend to use this in nouveau.

As does what i am talking about. The reservation stuff is a limiting factor
more than an helper in my eyes.

> 
> But there's no requirement to use reservation_object's apart from that's how
> ttm currently works, and for implicit syncing in dma-buf. If you don't need
> either go crazy with fence and write your own mechanism on top of fence.
> Although with android sync and TTM I think I handled all common usecases.

My point is that implicit syncing can be implemented in a saner way that would
also allow to implement an explicit syncing with no restriction and only room
for extra optimisation. By enforcing to have the cpu involve in hw block sync
you are limiting what can be done and clever hardware are force to use sub
optimal solution.

> ~Maarten

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 19:56                       ` Jerome Glisse
@ 2014-08-14 21:20                         ` Daniel Vetter
  0 siblings, 0 replies; 39+ messages in thread
From: Daniel Vetter @ 2014-08-14 21:20 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 9:56 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> Android fence are not in my mind a nice thing :)

Well I'll have a very close look at the proposed ioctl interface on
top of these fence fds to make sure it's sane. But it will come that's
pretty much for sure. And similar fence integration will very likely
happen in the modeset world with acquire/release fence/syncpts around
atomic updates and for vblank events. The modeset integration will btw
not require any driver changes, at least not the release fences.

And like Maarten says we'll keep that working together with implicit
syncing for the Linux Desktop ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 19:15                   ` Jerome Glisse
  2014-08-14 19:40                     ` Maarten Lankhorst
@ 2014-08-14 21:23                     ` Daniel Vetter
  2014-08-14 23:03                       ` Jerome Glisse
  2014-08-14 21:30                     ` Daniel Vetter
  2014-08-15  6:54                     ` Thomas Hellstrom
  3 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-14 21:23 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> Cost 1 uint32 per buffer and simple if without locking to check status of
> a buffer.

Yeah well except it doesn't and that's why we switch to full blown
fence objects internally instead of smacking seqno values all over the
place. At least in i915 because I have seen what this "simple"
solution looks like if you throw all the complexities into the bin and
mix it well.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 19:15                   ` Jerome Glisse
  2014-08-14 19:40                     ` Maarten Lankhorst
  2014-08-14 21:23                     ` Daniel Vetter
@ 2014-08-14 21:30                     ` Daniel Vetter
  2014-08-15  6:54                     ` Thomas Hellstrom
  3 siblings, 0 replies; 39+ messages in thread
From: Daniel Vetter @ 2014-08-14 21:30 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> Yes preemption and gpu scheduling would break such scheme, but my point is
> that when you have such gpu you want to implement a proper solution. Which
> of course require quite some work accross the stack. So the past can live
> on but the future needs to get its acts together.
>
>> The other problem is that the Linux Desktop I don't seem to care about
>> any more kinda relies on implicit syncing right now, so we better keep
>> that working fairly well. Of course we could dream up a shiny new
>> world where all of the Linux desktop does explicit syncing, but that
>> world doesn't exist right now. I mean really if you want to right away
>> throw implicit syncing overboard that doesn't bode well for the
>> current linux desktop.
>
> Again i fail at expressing myself. I am saying throw things over board,
> i am well aware of the current reliance on implicit fencing. I am saying
> if fence wants to be this new thing that should allow to do explicit
> fencing in the future than it better be done correctly in the first place.
>
>> So I don't understand all the implicit syncing bashing. It's here, and
>> it'll stay for a while longer whether you like it or not ...
>
> I am saying this is where we are and it sucks for a number of reasons,
> then looking at fence and by looking at fence i am saying this try to
> go in the right direction but do crazy things that i am convince we
> will regret. In other word if we ever get to the explicit fence better
> starts on the right path with the right tool. Moreover i am saying that
> this can be done without breaking implicit sync we have today.
>
>> Of course that doesn't mean we (intel here) won't support explicit
>> syncing too, and I really don't see a conflict here when mixing and
>> matching these two approaches.
>
> Again i fail to express myself. I am not saying there is conflict. I
> am saying better take a path which allow to go full way with explicit
> fencing while still allowing a less optimal use for an implicit sync
> model.
>
> My point is the fence code proposed here, keeps the worst thing about
> implicit fencing we have today. This can be done differently, in what
> i believe to be better way. And this different approach stills allow
> to have have implicit sync for existing userspace.

Well the problem is I can't wait for that shiny new world to finally
arrive, I kinda have to work with what's there. Which is ugly, messy,
but that's how it is.

So I'm not going to throw away piles of code just because it's ugly
and combining it with the new world is a major pain. Instead I'll
gobble down the ugly, throw testcases at the problem to make sure it
works and ship this frankenstein monster.

If you actually look at development stats we've completely rewritten
i915.ko in the past 2 years (again) and will likely continue for a
while. But we do that with small (tiny) steps and not with big jumps,
because with the current pace and ongoing projects that's plainly too
disruptive. So I envy you if you can do this with radeon, but there's
no way I can do that with i915, I just have to live with it.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 14:12           ` Jerome Glisse
  2014-08-14 15:58             ` Daniel Vetter
@ 2014-08-14 22:11             ` Rob Clark
  1 sibling, 0 replies; 39+ messages in thread
From: Rob Clark @ 2014-08-14 22:11 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Daniel Vetter, dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 10:12 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
>> On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> > So this is fundamentaly different, fence as they are now allow random driver
>> > callback and this is bound to get ugly this is bound to lead to one driver
>> > doing something that seems innocuous but turn out to break heavoc when call
>> > from some other driver function.
>>
>>
>> tbh, that seems solvable by some strict rules about what you can do in
>> the callback.. ie. don't do anything you couldn't do in atomic, and
>> don't signal another fence.. off the top of my head that seems
>> sufficient.
>>
>> If the driver getting the callback needs to do more, then it can
>> always schedule a worker..
>>
>> But I could certainly see the case where the driver waiting on fence
>> sets everything up before installing the cb and then just needs to
>> write one or a couple regs from the cb.
>
> Yes sane code will do sane things, sadly i fear we can not enforce sane
> code everywhere especialy with out of tree driver and i would rather
> force there hand to only allow sane implementation. Providing call back
> api obviously allows them to do crazy stuff.

callback's hard, let's go shopping..

But seriously, we've solved problems like this w/ various kernel debug
features before.  I am pretty sure we could make lock debugging, and
maybe some new fence debugging option, catch a lot of things.  There
is probably a better way, but a dummy spinlock around the callback, or
maybe preempt_enable()/disable() around the callback?  And we should
be able to come up with a way to catch signalling a fence from cb..

A bit of extra rope with a warning sign not to hang yourself and debug
features to tell people when they are hanging themselves seems like
the better option to me.

BR,
-R

>>
>> BR,
>> -R

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 21:23                     ` Daniel Vetter
@ 2014-08-14 23:03                       ` Jerome Glisse
  2014-08-15  8:07                         ` Daniel Vetter
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-14 23:03 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 11:23:01PM +0200, Daniel Vetter wrote:
> On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > Cost 1 uint32 per buffer and simple if without locking to check status of
> > a buffer.
> 
> Yeah well except it doesn't and that's why we switch to full blown
> fence objects internally instead of smacking seqno values all over the
> place. At least in i915 because I have seen what this "simple"
> solution looks like if you throw all the complexities into the bin and
> mix it well.

I am kind of curious on what can go wrong here ? Unless you have thousands
of different hw block inside your gpu all of them with different cmd queue
then i do not see any issue.

Note that this global seqno i am talking is only useful for bind/unbind of
buffer in ideal world of explicit sync, not for synchronization btw command
buffer. So pseudo code would be :

// if buffer_can_unbind(buf, dev) return true then buffer is no longer in
// use and can be unbind. if false you can wait on the device wait queue.
bool buffer_can_unbind(buf, dev)
{
  // Is the last gseqno associated with buffer is smaller than current
  // smallest signaled seqno ?
  if (buf->gseqno <= dev->gseqno_cmin)
    return true;
  hw_update_gseqno_min(dev);
  if (buf->gseqno <= dev->gseqno_cmin)
    return true;
  for_each_hw_block(hwblock, dev) {
    // If that hwblock has not signaled a gseqno bigger than the
    // buffer one's and also has work scheduled that might be using
    // the buffer (ie the last scheduled gseqno is greater than
    // buffer gseqno). If that's true than buffer might still be
    // in use so assume the worst.
    if (hwblock->gseqno < buf->gseqno &&
        hwblock->gseqno_last_scheduled >= buf->gseqno)
      return false;
    // This means either this block is already past the buf->gseqno
    // or that this block have nothing in its pipeline that will ever
    // use buf.
    // As an extra optimization one might add a hwblock mask to buf
    // and unmask wait for that hwblock so further wait will not wait
    // on this block as we know for sure it's not involve.
  }
  // Well none of the hwblock is still using that buffer.
  return true;
}

hw_update_gseqno_min(dev)
{
   unsigned nmin = -1;

   for_each_hw_block(hwblock, dev) {
     nmin = min(nmin, hwblock->gseqno);
   }
   // atomic exchange and compage set new min only if it's bigger then
   // current min
   atomic_cmp_xchg(dev->gseqno_cmin, nmin)
}


So this is how it looks in my mind, each hwblock write to there dedicated
>gseqno and for each hwblock you store the last gseqno that was scheduled.
There is no need for any lock and this is outside any other sync mecanism.

For hw with preemption, i assume you have then have a hwcontext, well you
use same code except that now you have a gseqno per context which means
you need to store a seqno per context per buffer. With some trickery this
might be avoided especialy if hw can do atomic cmp_xchg.

Cheers,
Jérôme


> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 19:15                   ` Jerome Glisse
                                       ` (2 preceding siblings ...)
  2014-08-14 21:30                     ` Daniel Vetter
@ 2014-08-15  6:54                     ` Thomas Hellstrom
  2014-08-15 14:52                       ` Jerome Glisse
  3 siblings, 1 reply; 39+ messages in thread
From: Thomas Hellstrom @ 2014-08-15  6:54 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel, Ben Skeggs

On 08/14/2014 09:15 PM, Jerome Glisse wrote:
> On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
>> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> Sucks because you can not do weird synchronization like one i depicted in another
>>> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
>>> you can not such thing cleanly in non hackish way.
>> Actually i915 can soon will do that that.
> So you will return fence|syncpoint with each cmdbuf_ioctl ?
>
>>> Sucks because you have a fence object per buffer object and thus overhead grow
>>> with the number of objects. Not even mentioning fence lifetime issue.
>>>
>>> Sucks because sub-buffer allocation is just one of many tricks that can not be
>>> achieved properly and cleanly with implicit sync.
>>>
>>> ...
>> Well I heard all those reasons and I'm well of aware of them. The
>> problem is that with current hardware the kernel needs to know for
>> each buffer how long it needs to be kept around since hw just can't do
>> page faulting. Yeah you can pin them but for an uma design that
>> doesn't go down well with folks.
> I am not thinking with fancy hw in mind, on contrary i thought about all
> this with the crappiest hw i could think of, in mind.
>
> Yes you can get rid of fence and not have to pin memory with current hw.
> What matter for unpinning is to know that all hw block are done using the
> memory. This is easily achievable with your beloved seqno. Have one seqno
> per driver (one driver can have different block 3d, video decoding, crtc,
> ...) each time a buffer is use as part of a command on one block inc the
> common seqno and tag the buffer with that number. Have each hw block write
> the lastest seqno that is done to a per block location. Now to determine
> is buffer is done compare the buffer seqno with the max of all the signaled
> seqno of all blocks.
>
> Cost 1 uint32 per buffer and simple if without locking to check status of
> a buffer.

Hmm?
The trivial and first use of fence objects in the linux DRM was
triggered by the fact that a
32-bit seqno wraps pretty quickly and a 32-bit solution just can't be
made robust.
Now a 64-bit seqno will probably be robust for forseeable future, but
when it comes to implement that on 32-bit hardware and compare it to a
simple fence object approach,

/Thomas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-14 23:03                       ` Jerome Glisse
@ 2014-08-15  8:07                         ` Daniel Vetter
  2014-08-15 14:53                           ` Jerome Glisse
  0 siblings, 1 reply; 39+ messages in thread
From: Daniel Vetter @ 2014-08-15  8:07 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel, Ben Skeggs

On Thu, Aug 14, 2014 at 07:03:44PM -0400, Jerome Glisse wrote:
> On Thu, Aug 14, 2014 at 11:23:01PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > > Cost 1 uint32 per buffer and simple if without locking to check status of
> > > a buffer.
> > 
> > Yeah well except it doesn't and that's why we switch to full blown
> > fence objects internally instead of smacking seqno values all over the
> > place. At least in i915 because I have seen what this "simple"
> > solution looks like if you throw all the complexities into the bin and
> > mix it well.
> 
> I am kind of curious on what can go wrong here ? Unless you have thousands
> of different hw block inside your gpu all of them with different cmd queue
> then i do not see any issue.

See my other reply, but because we'll schedule software contexts here and
we need to do it with implicit fencing because we can't just break the
existing world. The below is pretty much the design we currently have.
-Daniel

> 
> Note that this global seqno i am talking is only useful for bind/unbind of
> buffer in ideal world of explicit sync, not for synchronization btw command
> buffer. So pseudo code would be :
> 
> // if buffer_can_unbind(buf, dev) return true then buffer is no longer in
> // use and can be unbind. if false you can wait on the device wait queue.
> bool buffer_can_unbind(buf, dev)
> {
>   // Is the last gseqno associated with buffer is smaller than current
>   // smallest signaled seqno ?
>   if (buf->gseqno <= dev->gseqno_cmin)
>     return true;
>   hw_update_gseqno_min(dev);
>   if (buf->gseqno <= dev->gseqno_cmin)
>     return true;
>   for_each_hw_block(hwblock, dev) {
>     // If that hwblock has not signaled a gseqno bigger than the
>     // buffer one's and also has work scheduled that might be using
>     // the buffer (ie the last scheduled gseqno is greater than
>     // buffer gseqno). If that's true than buffer might still be
>     // in use so assume the worst.
>     if (hwblock->gseqno < buf->gseqno &&
>         hwblock->gseqno_last_scheduled >= buf->gseqno)
>       return false;
>     // This means either this block is already past the buf->gseqno
>     // or that this block have nothing in its pipeline that will ever
>     // use buf.
>     // As an extra optimization one might add a hwblock mask to buf
>     // and unmask wait for that hwblock so further wait will not wait
>     // on this block as we know for sure it's not involve.
>   }
>   // Well none of the hwblock is still using that buffer.
>   return true;
> }
> 
> hw_update_gseqno_min(dev)
> {
>    unsigned nmin = -1;
> 
>    for_each_hw_block(hwblock, dev) {
>      nmin = min(nmin, hwblock->gseqno);
>    }
>    // atomic exchange and compage set new min only if it's bigger then
>    // current min
>    atomic_cmp_xchg(dev->gseqno_cmin, nmin)
> }
> 
> 
> So this is how it looks in my mind, each hwblock write to there dedicated
> >gseqno and for each hwblock you store the last gseqno that was scheduled.
> There is no need for any lock and this is outside any other sync mecanism.
> 
> For hw with preemption, i assume you have then have a hwcontext, well you
> use same code except that now you have a gseqno per context which means
> you need to store a seqno per context per buffer. With some trickery this
> might be avoided especialy if hw can do atomic cmp_xchg.
> 
> Cheers,
> Jérôme
> 
> 
> > -Daniel
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-15  6:54                     ` Thomas Hellstrom
@ 2014-08-15 14:52                       ` Jerome Glisse
  2014-08-16  7:01                         ` Thomas Hellstrom
  0 siblings, 1 reply; 39+ messages in thread
From: Jerome Glisse @ 2014-08-15 14:52 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, Ben Skeggs

On Fri, Aug 15, 2014 at 08:54:38AM +0200, Thomas Hellstrom wrote:
> On 08/14/2014 09:15 PM, Jerome Glisse wrote:
> > On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
> >> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>> Sucks because you can not do weird synchronization like one i depicted in another
> >>> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
> >>> you can not such thing cleanly in non hackish way.
> >> Actually i915 can soon will do that that.
> > So you will return fence|syncpoint with each cmdbuf_ioctl ?
> >
> >>> Sucks because you have a fence object per buffer object and thus overhead grow
> >>> with the number of objects. Not even mentioning fence lifetime issue.
> >>>
> >>> Sucks because sub-buffer allocation is just one of many tricks that can not be
> >>> achieved properly and cleanly with implicit sync.
> >>>
> >>> ...
> >> Well I heard all those reasons and I'm well of aware of them. The
> >> problem is that with current hardware the kernel needs to know for
> >> each buffer how long it needs to be kept around since hw just can't do
> >> page faulting. Yeah you can pin them but for an uma design that
> >> doesn't go down well with folks.
> > I am not thinking with fancy hw in mind, on contrary i thought about all
> > this with the crappiest hw i could think of, in mind.
> >
> > Yes you can get rid of fence and not have to pin memory with current hw.
> > What matter for unpinning is to know that all hw block are done using the
> > memory. This is easily achievable with your beloved seqno. Have one seqno
> > per driver (one driver can have different block 3d, video decoding, crtc,
> > ...) each time a buffer is use as part of a command on one block inc the
> > common seqno and tag the buffer with that number. Have each hw block write
> > the lastest seqno that is done to a per block location. Now to determine
> > is buffer is done compare the buffer seqno with the max of all the signaled
> > seqno of all blocks.
> >
> > Cost 1 uint32 per buffer and simple if without locking to check status of
> > a buffer.
> 
> Hmm?
> The trivial and first use of fence objects in the linux DRM was
> triggered by the fact that a
> 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be
> made robust.
> Now a 64-bit seqno will probably be robust for forseeable future, but
> when it comes to implement that on 32-bit hardware and compare it to a
> simple fence object approach,

Using same kind of arithemic as use for jiffies would do it provided that
there is a checking that we never let someobject pass above a certain age.

> 
> /Thomas
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-15  8:07                         ` Daniel Vetter
@ 2014-08-15 14:53                           ` Jerome Glisse
  0 siblings, 0 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-15 14:53 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel, Ben Skeggs

On Fri, Aug 15, 2014 at 10:07:39AM +0200, Daniel Vetter wrote:
> On Thu, Aug 14, 2014 at 07:03:44PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 14, 2014 at 11:23:01PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > > > Cost 1 uint32 per buffer and simple if without locking to check status of
> > > > a buffer.
> > > 
> > > Yeah well except it doesn't and that's why we switch to full blown
> > > fence objects internally instead of smacking seqno values all over the
> > > place. At least in i915 because I have seen what this "simple"
> > > solution looks like if you throw all the complexities into the bin and
> > > mix it well.
> > 
> > I am kind of curious on what can go wrong here ? Unless you have thousands
> > of different hw block inside your gpu all of them with different cmd queue
> > then i do not see any issue.
> 
> See my other reply, but because we'll schedule software contexts here and
> we need to do it with implicit fencing because we can't just break the
> existing world. The below is pretty much the design we currently have.

Software need the object to be binded to the gpu gart on intel ? That's bad.
But even with such thing, you can do it as a secondary lock only for user
space.

> -Daniel
> 
> > 
> > Note that this global seqno i am talking is only useful for bind/unbind of
> > buffer in ideal world of explicit sync, not for synchronization btw command
> > buffer. So pseudo code would be :
> > 
> > // if buffer_can_unbind(buf, dev) return true then buffer is no longer in
> > // use and can be unbind. if false you can wait on the device wait queue.
> > bool buffer_can_unbind(buf, dev)
> > {
> >   // Is the last gseqno associated with buffer is smaller than current
> >   // smallest signaled seqno ?
> >   if (buf->gseqno <= dev->gseqno_cmin)
> >     return true;
> >   hw_update_gseqno_min(dev);
> >   if (buf->gseqno <= dev->gseqno_cmin)
> >     return true;
> >   for_each_hw_block(hwblock, dev) {
> >     // If that hwblock has not signaled a gseqno bigger than the
> >     // buffer one's and also has work scheduled that might be using
> >     // the buffer (ie the last scheduled gseqno is greater than
> >     // buffer gseqno). If that's true than buffer might still be
> >     // in use so assume the worst.
> >     if (hwblock->gseqno < buf->gseqno &&
> >         hwblock->gseqno_last_scheduled >= buf->gseqno)
> >       return false;
> >     // This means either this block is already past the buf->gseqno
> >     // or that this block have nothing in its pipeline that will ever
> >     // use buf.
> >     // As an extra optimization one might add a hwblock mask to buf
> >     // and unmask wait for that hwblock so further wait will not wait
> >     // on this block as we know for sure it's not involve.
> >   }
> >   // Well none of the hwblock is still using that buffer.
> >   return true;
> > }
> > 
> > hw_update_gseqno_min(dev)
> > {
> >    unsigned nmin = -1;
> > 
> >    for_each_hw_block(hwblock, dev) {
> >      nmin = min(nmin, hwblock->gseqno);
> >    }
> >    // atomic exchange and compage set new min only if it's bigger then
> >    // current min
> >    atomic_cmp_xchg(dev->gseqno_cmin, nmin)
> > }
> > 
> > 
> > So this is how it looks in my mind, each hwblock write to there dedicated
> > >gseqno and for each hwblock you store the last gseqno that was scheduled.
> > There is no need for any lock and this is outside any other sync mecanism.
> > 
> > For hw with preemption, i assume you have then have a hwcontext, well you
> > use same code except that now you have a gseqno per context which means
> > you need to store a seqno per context per buffer. With some trickery this
> > might be avoided especialy if hw can do atomic cmp_xchg.
> > 
> > Cheers,
> > Jérôme
> > 
> > 
> > > -Daniel
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-15 14:52                       ` Jerome Glisse
@ 2014-08-16  7:01                         ` Thomas Hellstrom
  2014-08-16 15:30                           ` Jerome Glisse
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Hellstrom @ 2014-08-16  7:01 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: dri-devel, Ben Skeggs

On 08/15/2014 04:52 PM, Jerome Glisse wrote:
> On Fri, Aug 15, 2014 at 08:54:38AM +0200, Thomas Hellstrom wrote:
>> On 08/14/2014 09:15 PM, Jerome Glisse wrote:
>>> On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
>>>> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>>>> Sucks because you can not do weird synchronization like one i depicted in another
>>>>> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
>>>>> you can not such thing cleanly in non hackish way.
>>>> Actually i915 can soon will do that that.
>>> So you will return fence|syncpoint with each cmdbuf_ioctl ?
>>>
>>>>> Sucks because you have a fence object per buffer object and thus overhead grow
>>>>> with the number of objects. Not even mentioning fence lifetime issue.
>>>>>
>>>>> Sucks because sub-buffer allocation is just one of many tricks that can not be
>>>>> achieved properly and cleanly with implicit sync.
>>>>>
>>>>> ...
>>>> Well I heard all those reasons and I'm well of aware of them. The
>>>> problem is that with current hardware the kernel needs to know for
>>>> each buffer how long it needs to be kept around since hw just can't do
>>>> page faulting. Yeah you can pin them but for an uma design that
>>>> doesn't go down well with folks.
>>> I am not thinking with fancy hw in mind, on contrary i thought about all
>>> this with the crappiest hw i could think of, in mind.
>>>
>>> Yes you can get rid of fence and not have to pin memory with current hw.
>>> What matter for unpinning is to know that all hw block are done using the
>>> memory. This is easily achievable with your beloved seqno. Have one seqno
>>> per driver (one driver can have different block 3d, video decoding, crtc,
>>> ...) each time a buffer is use as part of a command on one block inc the
>>> common seqno and tag the buffer with that number. Have each hw block write
>>> the lastest seqno that is done to a per block location. Now to determine
>>> is buffer is done compare the buffer seqno with the max of all the signaled
>>> seqno of all blocks.
>>>
>>> Cost 1 uint32 per buffer and simple if without locking to check status of
>>> a buffer.
>> Hmm?
>> The trivial and first use of fence objects in the linux DRM was
>> triggered by the fact that a
>> 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be
>> made robust.
>> Now a 64-bit seqno will probably be robust for forseeable future, but
>> when it comes to implement that on 32-bit hardware and compare it to a
>> simple fence object approach,
> Using same kind of arithemic as use for jiffies would do it provided that
> there is a checking that we never let someobject pass above a certain age.

But wouldn't the search-for-max scheme break if blocks complete out of
order?

/Thomas

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Fence, timeline and android sync points
  2014-08-16  7:01                         ` Thomas Hellstrom
@ 2014-08-16 15:30                           ` Jerome Glisse
  0 siblings, 0 replies; 39+ messages in thread
From: Jerome Glisse @ 2014-08-16 15:30 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, Ben Skeggs

On Sat, Aug 16, 2014 at 09:01:27AM +0200, Thomas Hellstrom wrote:
> On 08/15/2014 04:52 PM, Jerome Glisse wrote:
> > On Fri, Aug 15, 2014 at 08:54:38AM +0200, Thomas Hellstrom wrote:
> >> On 08/14/2014 09:15 PM, Jerome Glisse wrote:
> >>> On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
> >>>> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>>>> Sucks because you can not do weird synchronization like one i depicted in another
> >>>>> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
> >>>>> you can not such thing cleanly in non hackish way.
> >>>> Actually i915 can soon will do that that.
> >>> So you will return fence|syncpoint with each cmdbuf_ioctl ?
> >>>
> >>>>> Sucks because you have a fence object per buffer object and thus overhead grow
> >>>>> with the number of objects. Not even mentioning fence lifetime issue.
> >>>>>
> >>>>> Sucks because sub-buffer allocation is just one of many tricks that can not be
> >>>>> achieved properly and cleanly with implicit sync.
> >>>>>
> >>>>> ...
> >>>> Well I heard all those reasons and I'm well of aware of them. The
> >>>> problem is that with current hardware the kernel needs to know for
> >>>> each buffer how long it needs to be kept around since hw just can't do
> >>>> page faulting. Yeah you can pin them but for an uma design that
> >>>> doesn't go down well with folks.
> >>> I am not thinking with fancy hw in mind, on contrary i thought about all
> >>> this with the crappiest hw i could think of, in mind.
> >>>
> >>> Yes you can get rid of fence and not have to pin memory with current hw.
> >>> What matter for unpinning is to know that all hw block are done using the
> >>> memory. This is easily achievable with your beloved seqno. Have one seqno
> >>> per driver (one driver can have different block 3d, video decoding, crtc,
> >>> ...) each time a buffer is use as part of a command on one block inc the
> >>> common seqno and tag the buffer with that number. Have each hw block write
> >>> the lastest seqno that is done to a per block location. Now to determine
> >>> is buffer is done compare the buffer seqno with the max of all the signaled
> >>> seqno of all blocks.
> >>>
> >>> Cost 1 uint32 per buffer and simple if without locking to check status of
> >>> a buffer.
> >> Hmm?
> >> The trivial and first use of fence objects in the linux DRM was
> >> triggered by the fact that a
> >> 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be
> >> made robust.
> >> Now a 64-bit seqno will probably be robust for forseeable future, but
> >> when it comes to implement that on 32-bit hardware and compare it to a
> >> simple fence object approach,
> > Using same kind of arithemic as use for jiffies would do it provided that
> > there is a checking that we never let someobject pass above a certain age.
> 
> But wouldn't the search-for-max scheme break if blocks complete out of
> order?

Seems i use wrong word in my email, the pseudo code is correct however, it's
not max it's min. So by knowing the minimum global seqno of all hw block you
know that at very least all hw block are past that point. Which means the
order into which blocks complete is irrevelant. But it also means that you
might delay more than needed the unbinding, thought pseudo code shows way to
alieviate and minimize this delay.

It's doable i had it kind of working couple years ago when we were reworking
fence in radeon driver but never got around to finish something clean because
implicit sync needs the reservation scheme which means you kind of need a
fence per buffer (well getting rid of that is where things get complicated
and painful).

Explicit sync is just so much nicer, but we made decision long ago about
implicit, another thing i am deeply regreting now.

Cheers,
Jérôme

> 
> /Thomas
> 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-08-16 15:32 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-12 22:13 Fence, timeline and android sync points Jerome Glisse
2014-08-13  1:23 ` Jerome Glisse
2014-08-13  7:59 ` Christian König
2014-08-13 13:41   ` Jerome Glisse
2014-08-13 14:08     ` Christian König
2014-08-13 15:56       ` Jerome Glisse
2014-08-13  8:28 ` Daniel Vetter
2014-08-13 13:36   ` Jerome Glisse
2014-08-13 15:54     ` Daniel Vetter
2014-08-13 17:07       ` Jerome Glisse
2014-08-14  9:08         ` Daniel Vetter
2014-08-14 14:23           ` Jerome Glisse
2014-08-14 15:55             ` Daniel Vetter
2014-08-14 18:18               ` Jerome Glisse
2014-08-14 18:47                 ` Daniel Vetter
2014-08-14 19:15                   ` Jerome Glisse
2014-08-14 19:40                     ` Maarten Lankhorst
2014-08-14 19:56                       ` Jerome Glisse
2014-08-14 21:20                         ` Daniel Vetter
2014-08-14 21:23                     ` Daniel Vetter
2014-08-14 23:03                       ` Jerome Glisse
2014-08-15  8:07                         ` Daniel Vetter
2014-08-15 14:53                           ` Jerome Glisse
2014-08-14 21:30                     ` Daniel Vetter
2014-08-15  6:54                     ` Thomas Hellstrom
2014-08-15 14:52                       ` Jerome Glisse
2014-08-16  7:01                         ` Thomas Hellstrom
2014-08-16 15:30                           ` Jerome Glisse
2014-08-14  9:15         ` Maarten Lankhorst
2014-08-14 11:53           ` Christian König
2014-08-14 12:37             ` Maarten Lankhorst
2014-08-14 14:31               ` Christian König
2014-08-14 14:09           ` Jerome Glisse
2014-08-14 13:16         ` Rob Clark
2014-08-14 14:12           ` Jerome Glisse
2014-08-14 15:58             ` Daniel Vetter
2014-08-14 18:26               ` Jerome Glisse
2014-08-14 19:16                 ` Maarten Lankhorst
2014-08-14 22:11             ` Rob Clark

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.