All of lore.kernel.org
 help / color / mirror / Atom feed
* Discuss the multi-core media scheduler
@ 2024-04-28 18:26 Daniel Almeida
  2024-04-30 16:46 ` Nicolas Dufresne
  2024-05-03  3:25 ` Laurent Pinchart
  0 siblings, 2 replies; 6+ messages in thread
From: Daniel Almeida @ 2024-04-28 18:26 UTC (permalink / raw)
  To: Hans Verkuil, Laurent Pinchart, Nicolas Dufresne, Mauro Carvalho Chehab
  Cc: Linux Media Mailing List

Hi everyone,

There seems to be a few unsolved problems in the mem2mem framework, one of
which is the lack of support for architectures with multiple heterogeneous
cores. For example, it is currently impossible to describe Mediatek's LAT and
CORE cores to the framework as two independent units to be scheduled. This means
that, at all times, one unit is idle while the other one is working.

I know that this is not the only problem with m2m, but it is where I'd like to
start the discussion. Feel free to add your own requirements to the thread.

My proposed solution is to add a new iteration of mem2mem, which I have named
the Multi-core Media Scheduler for the lack of a better term.

Please note that I will use the terms input/output queues in place of
output/capture for the sake of readability.

-------------------------------------------------------------------------------

The basic idea is to have a core as the basic entity to be scheduled, with its
own input and output VB2 queues. This default will be identical to what we have
today in m2m.

 input        output
<----- core ----->

In all cases, this will be the only interface that the framework will expose to
the outside world. The complexity to handle multiple cores will be hidden from
callers. This will also allow us to keep the implementation compatible with
the current mem2mem interfaces, which expose only two queues.

To support multiple cores, each core can connect to another core to establish a
data dependency, in which case, they will communicate through a new type of
queue, here described as "shared".

 input           shared         output
<----- core0 -------> core1 ------>

This arrangement is basically an extension of the mem2mem idea, like so:

mem2mem2mem2mem

...with as many links as there are cores.

The key idea is that now, cores can be scheduled independently through a call
to schedule(core_number, work) to indicate that they should start processing
the work. They can also be marked as idle independently through a
job_done(core_number) call.

It will be the driver's responsibility to describe the pipeline to the
framework, indicating how cores are connected. The driver will also have to
implement the logic for schedule() and job_done() for a given core.

Queuing buffers into the framework's input queue will push the work into the
pipeline. Whenever a job is done, the framework will push the job into the
queue that is shared with the downstream core and attempt to schedule it. It
will also attempt to pull a workitem from the upstream queue.

When the job is processed by the last core in the pipeline, it will be marked
as done and pushed into the framework's output queue.

At all times, a buffer should have an owner, and the framework will ensure that
cores cannot touch buffers belonging to other cores.

This workflow can be expanded to account for a group of identical cores, here
denoted as "clusters". In such a case, each core will have its own input and
output queues:

 input      output           input      output      output 
<---- core0 ----->          <---- core1 ---->     ------->
                                    <---- core2 ---->
                                    input      output

Ideally, the framework will dispatch work from the output queue with the most
amount of items to the input queue with the least amount of items to balance
the load. This way, clusters and cores can compose to describe complex
architectures.

Of course, this is a rough sketch, and there are lots of unexplained minutiae to
sort out, but I hope that the general idea is enough to get a discussion going.

-- Daniel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Discuss the multi-core media scheduler
  2024-04-28 18:26 Discuss the multi-core media scheduler Daniel Almeida
@ 2024-04-30 16:46 ` Nicolas Dufresne
  2024-04-30 21:39   ` Daniel Almeida
  2024-05-03  3:25 ` Laurent Pinchart
  1 sibling, 1 reply; 6+ messages in thread
From: Nicolas Dufresne @ 2024-04-30 16:46 UTC (permalink / raw)
  To: Daniel Almeida, Hans Verkuil, Laurent Pinchart, Mauro Carvalho Chehab
  Cc: Linux Media Mailing List

Hi Daniel,

Le dimanche 28 avril 2024 à 15:26 -0300, Daniel Almeida a écrit :
> Hi everyone,
> 
> There seems to be a few unsolved problems in the mem2mem framework, one of
> which is the lack of support for architectures with multiple heterogeneous
> cores. For example, it is currently impossible to describe Mediatek's LAT and
> CORE cores to the framework as two independent units to be scheduled. This means
> that, at all times, one unit is idle while the other one is working.
> 
> I know that this is not the only problem with m2m, but it is where I'd like to
> start the discussion. Feel free to add your own requirements to the thread.
> 
> My proposed solution is to add a new iteration of mem2mem, which I have named
> the Multi-core Media Scheduler for the lack of a better term.
> 
> Please note that I will use the terms input/output queues in place of
> output/capture for the sake of readability.

There is one use case that isn't covered here that we really need to move
forward on RPi4/5 is cores that can execute multiple task at once.

In the case of Argon HEVC decoder on the Pi, the Entropy decoder and the
Rescontruction is ran in parallel, but the two function are using the same
trigger/irq pair.

In short, we need to be able to (if there is enough data in the vb2 queue) to
schedule two consecutive jobs at once. On a timeline:

----------------------------------------------------->
[entropy0][no decoder]
                      [entropy1][decode0]
                                         [entropy2][decode1]

Perhaps it already fits in the RFC, but it wasn't expressed clearly as a use
case. For real-time reason, its not really driver responsibility to wait for
buffers to be queued, and a no-op can happen in any of the two functions. Also,
I believe you can mix entropy decoding from one stream, while decoding a frame
from another stream (another video session / m2m ctx).

Nicolas
          

> 
> -------------------------------------------------------------------------------
> 
> The basic idea is to have a core as the basic entity to be scheduled, with its
> own input and output VB2 queues. This default will be identical to what we have
> today in m2m.
> 
>  input        output
> <----- core ----->
> 
> In all cases, this will be the only interface that the framework will expose to
> the outside world. The complexity to handle multiple cores will be hidden from
> callers. This will also allow us to keep the implementation compatible with
> the current mem2mem interfaces, which expose only two queues.
> 
> To support multiple cores, each core can connect to another core to establish a
> data dependency, in which case, they will communicate through a new type of
> queue, here described as "shared".
> 
>  input           shared         output
> <----- core0 -------> core1 ------>
> 
> This arrangement is basically an extension of the mem2mem idea, like so:
> 
> mem2mem2mem2mem
> 
> ...with as many links as there are cores.
> 
> The key idea is that now, cores can be scheduled independently through a call
> to schedule(core_number, work) to indicate that they should start processing
> the work. They can also be marked as idle independently through a
> job_done(core_number) call.
> 
> It will be the driver's responsibility to describe the pipeline to the
> framework, indicating how cores are connected. The driver will also have to
> implement the logic for schedule() and job_done() for a given core.
> 
> Queuing buffers into the framework's input queue will push the work into the
> pipeline. Whenever a job is done, the framework will push the job into the
> queue that is shared with the downstream core and attempt to schedule it. It
> will also attempt to pull a workitem from the upstream queue.
> 
> When the job is processed by the last core in the pipeline, it will be marked
> as done and pushed into the framework's output queue.
> 
> At all times, a buffer should have an owner, and the framework will ensure that
> cores cannot touch buffers belonging to other cores.
> 
> This workflow can be expanded to account for a group of identical cores, here
> denoted as "clusters". In such a case, each core will have its own input and
> output queues:
> 
>  input      output           input      output      output 
> <---- core0 ----->          <---- core1 ---->     ------->
>                                     <---- core2 ---->
>                                     input      output
> 
> Ideally, the framework will dispatch work from the output queue with the most
> amount of items to the input queue with the least amount of items to balance
> the load. This way, clusters and cores can compose to describe complex
> architectures.
> 
> Of course, this is a rough sketch, and there are lots of unexplained minutiae to
> sort out, but I hope that the general idea is enough to get a discussion going.
> 
> -- Daniel
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Discuss the multi-core media scheduler
  2024-04-30 16:46 ` Nicolas Dufresne
@ 2024-04-30 21:39   ` Daniel Almeida
  2024-04-30 21:47     ` Daniel Almeida
  2024-05-01 18:18     ` Nicolas Dufresne
  0 siblings, 2 replies; 6+ messages in thread
From: Daniel Almeida @ 2024-04-30 21:39 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Hans Verkuil, Laurent Pinchart, Mauro Carvalho Chehab,
	Linux Media Mailing List


Hi Nicolas,


> 
> There is one use case that isn't covered here that we really need to move
> forward on RPi4/5 is cores that can execute multiple task at once.
> 
> In the case of Argon HEVC decoder on the Pi, the Entropy decoder and the
> Rescontruction is ran in parallel, but the two function are using the same
> trigger/irq pair.
> 
> In short, we need to be able to (if there is enough data in the vb2 queue) to
> schedule two consecutive jobs at once. On a timeline:
> 
> ----------------------------------------------------->
> [entropy0][no decoder]
>                      [entropy1][decode0]
>                                         [entropy2][decode1]
> 
> Perhaps it already fits in the RFC, but it wasn't expressed clearly as a use
> case. For real-time reason, its not really driver responsibility to wait for
> buffers to be queued, and a no-op can happen in any of the two functions. Also,
> I believe you can mix entropy decoding from one stream, while decoding a frame
> from another stream (another video session / m2m ctx).
> 
> Nicolas
> 

I assume that the cores can be programmed separately, and that you can find which of the two
cores is now idle when processing the interrupt? i.e.: this is effectively the same scenario we have
with Mediatek vcodec?

If so, this is already covered.

Basically, whenever a core is done with a job, that will signal the pipeline to try and make progress.  

i.e: you push `entropy0` and `entropy1` at the beginning of the pipeline, that will cause the entropy 
decoder to start running. Whenever the entropy decoder is done, it will try to schedule the reconstruction
core with `decode0` and start working on `entropy1`.

When the reconstruction core is done, it will push `decode0` to the pipeline’s output
queue and grab `decode1` (from the queue it shares with the upstream core) to work on.

That way, all cores run concurrently, so long as there is work to do.

— Daniel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Discuss the multi-core media scheduler
  2024-04-30 21:39   ` Daniel Almeida
@ 2024-04-30 21:47     ` Daniel Almeida
  2024-05-01 18:18     ` Nicolas Dufresne
  1 sibling, 0 replies; 6+ messages in thread
From: Daniel Almeida @ 2024-04-30 21:47 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Hans Verkuil, Laurent Pinchart, Mauro Carvalho Chehab,
	Linux Media Mailing List

Ah, forgot to comment on this:

> Also,
> I believe you can mix entropy decoding from one stream, while decoding a frame
> from another stream (another video session / m2m ctx).

I don’t see how this is a problem. The current framework can already serve multiple
sessions as you pointed out.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Discuss the multi-core media scheduler
  2024-04-30 21:39   ` Daniel Almeida
  2024-04-30 21:47     ` Daniel Almeida
@ 2024-05-01 18:18     ` Nicolas Dufresne
  1 sibling, 0 replies; 6+ messages in thread
From: Nicolas Dufresne @ 2024-05-01 18:18 UTC (permalink / raw)
  To: Daniel Almeida
  Cc: Hans Verkuil, Laurent Pinchart, Mauro Carvalho Chehab,
	Linux Media Mailing List

Le mardi 30 avril 2024 à 18:39 -0300, Daniel Almeida a écrit :
> Hi Nicolas,
> 
> 
> > 
> > There is one use case that isn't covered here that we really need to move
> > forward on RPi4/5 is cores that can execute multiple task at once.
> > 
> > In the case of Argon HEVC decoder on the Pi, the Entropy decoder and the
> > Rescontruction is ran in parallel, but the two function are using the same
> > trigger/irq pair.
> > 
> > In short, we need to be able to (if there is enough data in the vb2 queue) to
> > schedule two consecutive jobs at once. On a timeline:
> > 
> > ----------------------------------------------------->
> > [entropy0][no decoder]
> >                      [entropy1][decode0]
> >                                         [entropy2][decode1]
> > 
> > Perhaps it already fits in the RFC, but it wasn't expressed clearly as a use
> > case. For real-time reason, its not really driver responsibility to wait for
> > buffers to be queued, and a no-op can happen in any of the two functions. Also,
> > I believe you can mix entropy decoding from one stream, while decoding a frame
> > from another stream (another video session / m2m ctx).
> > 
> > Nicolas
> > 
> 
> I assume that the cores can be programmed separately, and that you can find which of the two
> cores is now idle when processing the interrupt? i.e.: this is effectively the same scenario we have
> with Mediatek vcodec?

No, there is only 1 core, that implements two features. The scheduling of one
core in this case is still complex, since if possible it should be fed with
multiple jobs.

> 
> If so, this is already covered.
> 
> Basically, whenever a core is done with a job, that will signal the pipeline to try and make progress.  

In current model, a job represent the executation of a task on a single core.
And that task is limited to one mem2mem ctx. In MTK, to fill the pipeline, you'd
need to pick work from possibly multiple mem2mem ctx.

> 
> i.e: you push `entropy0` and `entropy1` at the beginning of the pipeline, that will cause the entropy 
> decoder to start running. Whenever the entropy decoder is done, it will try to schedule the reconstruction
> core with `decode0` and start working on `entropy1`.
> 
> When the reconstruction core is done, it will push `decode0` to the pipeline’s output
> queue and grab `decode1` (from the queue it shares with the upstream core) to work on.
> 
> That way, all cores run concurrently, so long as there is work to do.
> 
> — Daniel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Discuss the multi-core media scheduler
  2024-04-28 18:26 Discuss the multi-core media scheduler Daniel Almeida
  2024-04-30 16:46 ` Nicolas Dufresne
@ 2024-05-03  3:25 ` Laurent Pinchart
  1 sibling, 0 replies; 6+ messages in thread
From: Laurent Pinchart @ 2024-05-03  3:25 UTC (permalink / raw)
  To: Daniel Almeida
  Cc: Hans Verkuil, Nicolas Dufresne, Mauro Carvalho Chehab,
	Linux Media Mailing List

Hi Daniel,

On Sun, Apr 28, 2024 at 03:26:35PM -0300, Daniel Almeida wrote:
> Hi everyone,
> 
> There seems to be a few unsolved problems in the mem2mem framework, one of
> which is the lack of support for architectures with multiple heterogeneous
> cores. For example, it is currently impossible to describe Mediatek's LAT and
> CORE cores to the framework as two independent units to be scheduled. This means
> that, at all times, one unit is idle while the other one is working.
> 
> I know that this is not the only problem with m2m, but it is where I'd like to
> start the discussion. Feel free to add your own requirements to the thread.

I'll add a comment, which doesn't solve your problem, but is possibly
still relevant.

We have a need to serve multiple clients and schedule them with
memory-to-memory ISPs. Those devices don't use the M2M framework, as
they have more than just one input and one output queue, and need to
handle formats and selection rectangles in addition to controls and
buffer queues. A few out-of-tree drivers currently create multiple
"virtual" device instances to address this need. I don't like this
solution much, as it creates a lot of video devices, and sets an
arbitrary bound to the number of clients.

We're instead considering solving the issue by exposing the ability to
submit a job through the media controller device. Similarly to the M2M
framework, we would use multiple opens with one file handle per client.
This is similar to the request API, but instead of setting per-request
parameters through video devices and subdevs, we would pass them all in
one go through the media controller device.

At this point we don't foresee the need to support multi-core ISPs, but
there's clearly a need for scheduling multiple clients.

> My proposed solution is to add a new iteration of mem2mem, which I have named
> the Multi-core Media Scheduler for the lack of a better term.
> 
> Please note that I will use the terms input/output queues in place of
> output/capture for the sake of readability.
> 
> -------------------------------------------------------------------------------
> 
> The basic idea is to have a core as the basic entity to be scheduled, with its
> own input and output VB2 queues. This default will be identical to what we have
> today in m2m.
> 
>  input        output
> <----- core ----->
> 
> In all cases, this will be the only interface that the framework will expose to
> the outside world. The complexity to handle multiple cores will be hidden from
> callers. This will also allow us to keep the implementation compatible with
> the current mem2mem interfaces, which expose only two queues.
> 
> To support multiple cores, each core can connect to another core to establish a
> data dependency, in which case, they will communicate through a new type of
> queue, here described as "shared".
> 
>  input           shared         output
> <----- core0 -------> core1 ------>
> 
> This arrangement is basically an extension of the mem2mem idea, like so:
> 
> mem2mem2mem2mem
> 
> ...with as many links as there are cores.
> 
> The key idea is that now, cores can be scheduled independently through a call
> to schedule(core_number, work) to indicate that they should start processing
> the work. They can also be marked as idle independently through a
> job_done(core_number) call.
> 
> It will be the driver's responsibility to describe the pipeline to the
> framework, indicating how cores are connected. The driver will also have to
> implement the logic for schedule() and job_done() for a given core.
> 
> Queuing buffers into the framework's input queue will push the work into the
> pipeline. Whenever a job is done, the framework will push the job into the
> queue that is shared with the downstream core and attempt to schedule it. It
> will also attempt to pull a workitem from the upstream queue.
> 
> When the job is processed by the last core in the pipeline, it will be marked
> as done and pushed into the framework's output queue.
> 
> At all times, a buffer should have an owner, and the framework will ensure that
> cores cannot touch buffers belonging to other cores.
> 
> This workflow can be expanded to account for a group of identical cores, here
> denoted as "clusters". In such a case, each core will have its own input and
> output queues:
> 
>  input      output           input      output      output 
> <---- core0 ----->          <---- core1 ---->     ------->
>                                     <---- core2 ---->
>                                     input      output
> 
> Ideally, the framework will dispatch work from the output queue with the most
> amount of items to the input queue with the least amount of items to balance
> the load. This way, clusters and cores can compose to describe complex
> architectures.
> 
> Of course, this is a rough sketch, and there are lots of unexplained minutiae to
> sort out, but I hope that the general idea is enough to get a discussion going.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-05-03  3:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-28 18:26 Discuss the multi-core media scheduler Daniel Almeida
2024-04-30 16:46 ` Nicolas Dufresne
2024-04-30 21:39   ` Daniel Almeida
2024-04-30 21:47     ` Daniel Almeida
2024-05-01 18:18     ` Nicolas Dufresne
2024-05-03  3:25 ` Laurent Pinchart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.