Re: [PATCH v4] media: docs-rst: Document m2m stateless video decoder interface

From: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
To: Nicolas Dufresne <nicolas@ndufresne.ca>,
	Alexandre Courbot <acourbot@chromium.org>,
	Tomasz Figa <tfiga@chromium.org>,
	Maxime Ripard <maxime.ripard@bootlin.com>,
	Hans Verkuil <hverkuil@xs4all.nl>,
	Dafna Hirschfeld <dafna3@gmail.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	linux-media@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4] media: docs-rst: Document m2m stateless video decoder interface
Date: Wed, 17 Apr 2019 17:40:43 +0200	[thread overview]
Message-ID: <2de76e9b61c79cefbcd89ce60f115d0fb98e89e4.camel@bootlin.com> (raw)
In-Reply-To: <c1d9ca7ed7f64b3255cb66918c70b752e5a7c3ba.camel@ndufresne.ca>

Hi,

On Wed, 2019-04-17 at 11:30 -0400, Nicolas Dufresne wrote:
> Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > Documents the protocol that user-space should follow when
> > > > communicating with stateless video decoders.
> > > > 
> > > > The stateless video decoding API makes use of the new request and tags
> > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > should probably still be considered staging for a short while.
> > 
> > [...]
> > 
> > > From an IRC discussion with Paul and some more digging, I have found a
> > > design problem in the decoding process.
> > > 
> > > In H264 and HEVC you can have multiple decoding unit per frames
> > > (slices). This type of encoding is increasingly popular, specially for
> > > low latency streaming use cases. The wording of this spec does allow
> > > for the notion of decoding unit, and in practice it has been proven to
> > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > something important to know is that the FFMPEG RFC implements decoding
> > > in lock steps. Which means:
> > > 
> > >   1. It queues a single free capture buffer
> > >   2. It queues an output buffer, set controls, queue the request
> > >   3. It waits for a capture buffer to reach state done
> > >   4. It dequeues that capture buffer, and queue it back again
> > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > >      have a complete frame. After what, it restart at step 1
> > > 
> > > So the implementation makes no use of the queues. There is no batch
> > > processing, so we might not be able to reach the maximum hardware
> > > throughput.
> > > 
> > > So the optimal method would look like the following, but there comes
> > > the design issue.
> > > 
> > >   1. Queue a single free capture buffer
> > >   2. Queue output buffer for slice 1, set controls, queue the request
> > >   3. Queue output buffer for slice 2, set controls, queue the request
> > >   4. Wait for completion
> > > 
> > > The problem is in step 4. Completion means that the capture buffer done
> > > decoding a single unit. So assuming the driver supports matching the
> > > timestamp against the queued buffer, instead of waiting for a new
> > > buffer, the driver would have to mark twice the same buffer to done
> > > state, which is just not working to inform userspace that all slices
> > > are decoded into the one capture buffer they share.
> > 
> > Interestingly, I'm experiencing the exact same problem dealing with a
> > 2D graphics blitter that has limited ouput scaling abilities which
> > imply handlnig a large scaling operation as multiple clipped smaller
> > scaling operations. The issue is basically that multiple jobs have to
> > be submitted to complete a single frame and relying on an indication
> > from the destination buffer (such as a fence) doesn't work to indicate
> > that all the operations were completed, since we get the indication at
> > each step instead of at the end of the batch.
> 
> That looks similar to the IMX.6 IPU m2m driver. It splits the image in
> tiles of 1024x1024 and process each tile separately. This driver has
> been around for a long time, so I guess they have a solution to that.
> They don't need requests, because there is nothing to be bundled with
> the input image. I know that Renesas folks have started working on a
> de-interlacer. Again, this kind of driver may process and reuse input
> buffers for motion compensation, but I don't think they need special
> userspace API for that.

Thanks for the reference! I hope it's not a blitter that was
contributed as a V4L2 driver instead of DRM, as it probably would be
more useful in DRM (but that's way beside the point).

> > One idea I see to solve this is to have a notion of batch in the driver
> > (for our situation, that would be in v4l2) and provide means to get a
> > done indication for that entity.
> 
> Can't you just make this part of your driver state machine ?

Yes definitely, and I forgot to mention that's in DRM, not V4L2.

Anyway from that point on, I was back to talking about our codec
situation, not my 2D blitter anymore :)

> > I think we could extend the request API to allow this. We already
> > represent requests as individual file descriptors, we could totally
> > group requests in batches and get a sync fd for the batch to poll on
> > when we need to return the frames. It would be good if we could expose
> > this in a way that makes it work with DRM as an in fence for display.
> > Then we can pretty much schedule our flip + decoding together (which is
> > quite nice to have when we're running late on the decoding side).
> > 
> > What do you think?
> 
> I'm not sure why this specific thing needs a userspace exposition.

Indeed, I'll handle it in my 2D driver.

Cheers,

Paul

> > It feels to me like the request API was designed to open up the way for
> > these kinds of improvements, so I'm sure we can find an agreeable
> > solution that extends the API.
> > 
> > > To me, multi slice encoded stream are just too common, and they will
> > > also exist for AV1. So we really need a solution to this that does not
> > > require operating in lock steps. Specially that some HW can decode
> > > multiple slices in parallel (multi core), we would not want to prevent
> > > that HW from being used efficiently. On top of this, we need a solution
> > > so that we can also keep queuing slice of the following frames if they
> > > arrive before decoding is done.
> > 
> > Agreed.
> > 
> > > I don't have a solution yet myself, but it would be nice to come up
> > > with something before we freeze this API.
> > 
> > I think it's rather independent from the codec used and this is
> > something that should be handled at the request API level. 
> > 
> > I'm not sure we can always expect the hardware to be able to operate on
> > a per-slice basis. I think it would be useful to reflect this in the
> > pixel format, so that we also have a possibility for a gathered slice
> > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > hardware that will need to decode one frame in one go from a contiguous
> > buffer with all the slice data appended.
> > 
> > This updates my pixel format proposition from IRC to the following:
> > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > (appended buffer), slice params as v4l2 control (legacy);
> > 
> > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > 
> > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > slice params encoded in the buffer;
> > 
> > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > slice params encoded in the buffer and in slice params control;
> > 
> > Also, we need to make sure to have a per-slice bit offset to the
> > encoded data in the slice params control so that the same slice buffer
> > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > an annex-b slice and use any of the formats with it).
> > 
> > For the legacy format, we need to specify that the appended slices
> > don't repeat the annex-b start code and NAL header.
> > 
> > What do you think?
> > 
> > >  By the way, if we could queue
> > > twice the same buffer, that would in principal work, but internally
> > > there is only one state per buffer. If you do external allocation, then
> > > in theory you could workaround that, but then it's ugly, because you'll
> > > have two buffers with the same timestamp.
> > 
> > One advantage of the request API is that buffers are actually queued
> > when the request is processed, so this might not be too problematic.
> > 
> > I think what we need boils down to:
> > - Being able to queue the same output buffer to multiple requests,
> > which the request API should already allow;
> > - Being able to grab the right capture buffer based on the output
> > timestamp so that the different requests for the slices are rendered to
> > the same destination buffer.
> > 
> > For the second point, I don't really have a clear idea of whether we
> > can already expect v4l2 to allow picking a buffer that was marked done
> > but was not de-queued by userspace yet. It might already be allowed and
> > we could just implement something to lookup the buffer to grab by
> > timestamp.
> > 
> > > An argument that was made early was that we don't need to support this
> > > right away because userspace can combine all the slices into one
> > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > extra control to tell the driver the offset to each slices, because the
> > > raw H264 does not have enough information to be parsed. RAW slice are
> > > also I believe de-emulated, which means the code use to prevent having
> > > pattern looking like a start code has been removed, so you cannot just
> > > prepend start codes. De-emulation seems better placed in userspace if
> > > the HW does not take care.
> > 
> > Mhh I'd like to avoid having having to specify the offset to each slice
> > for the legacy case. Just appending the encoded data (excluding slice
> > header and start code) works for cedrus and I think it makes sense more
> > generally. The idea is to only expose a single slice params and act as
> > if it was just one big slice buffer.
> > 
> > Come to think of it, maybe we need annex-b and mixed fashions of that
> > legacy pixfmt too...
> > 
> > > I also very dislike the idea that we would enforce merging all slice
> > > into the same buffer. The entire purpose of slices and the reason they
> > > are used in practice is that you can start decoding slices before you
> > > have all slices of a frame. This reduce drastically the latency for
> > > streaming use cases, like video conferencing. So forcing the merging of
> > > slices is basically like pretending slices have no benefits.
> > 
> > Of course, we don't want things to stay like this and this rework is
> > definitely needed to get serious performance and latency going.
> > 
> > One thing you should also be aware of: we're currently using a
> > workqueue between the job done irq and scheduling the next frame (in
> > v4l2 m2m).
> > 
> > Maybe we could manage to fit that into an atomic path to schedule the
> > next request in the previous job done irq context.
> > 
> > > I have just exposed the problem I see for now, to see what comes up.
> > > But I hope we be able to propose solution too in the short term (in no
> > > one beats me at it).
> > 
> > Seems that we have good grounds for a discussion!
> > 
> > Cheers,
> > 
> > Paul
> > 
> > > > +
> > > > +A typical frame would thus be decoded using the following sequence:
> > > > +
> > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > +
> > > > +    * **Required fields:**
> > > > +
> > > > +      ``index``
> > > > +          index of the buffer being queued.
> > > > +
> > > > +      ``type``
> > > > +          type of the buffer.
> > > > +
> > > > +      ``bytesused``
> > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > +
> > > > +      ``flags``
> > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > +
> > > > +      ``request_fd``
> > > > +          must be set to the file descriptor of the decoding request.
> > > > +
> > > > +      ``timestamp``
> > > > +          must be set to a unique value per frame. This value will be propagated
> > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > +          as the reference of another.
> > > > +
> > > > +2. Set the codec-specific controls for the decoding request, using
> > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > +
> > > > +    * **Required fields:**
> > > > +
> > > > +      ``which``
> > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > +
> > > > +      ``request_fd``
> > > > +          must be set to the file descriptor of the decoding request.
> > > > +
> > > > +      other fields
> > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > +          array must contain all the codec-specific controls required to decode
> > > > +          a frame.
> > > > +
> > > > +   .. note::
> > > > +
> > > > +      It is possible to specify the controls in different invocations of
> > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > +      at the moment of request submission is the one that will be considered.
> > > > +
> > > > +   .. note::
> > > > +
> > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > +
> > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > +   request FD.
> > > > +
> > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > +    required controls are missing from the request, then
> > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > +
> > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > +
> > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > +error, then all following decoded frames that refer to it also have the
> > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > +produce a (likely corrupted) frame.
> > > > +
> > > > +Buffer management while decoding
> > > > +================================
> > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > +used by the client for as long as they are not queued again. "Used" here
> > > > +encompasses using the buffer for compositing or display.
> > > > +
> > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > +buffer.
> > > > +
> > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > +
> > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > +reference buffer when the following conditions are met:
> > > > +
> > > > +1. All the requests for frames affected by the reference frame have been
> > > > +   queued, and
> > > > +
> > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > +   referencing frames have been queued.
> > > > +
> > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > +all the resources associated with reference frames. This means that the client
> > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > +won't need them afterwards.
> > > > +
> > > > +Seeking
> > > > +=======
> > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > +corresponding to the new stream position. It must however be aware that
> > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > +valid state is sent to the decoder.
> > > > +
> > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > +from the pre-seek position.
> > > > +
> > > > +Pause
> > > > +=====
> > > > +
> > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > +will remain idle.
> > > > +
> > > > +Dynamic resolution change
> > > > +=========================
> > > > +
> > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > +the initialization sequence again with the new resolution:
> > > > +
> > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > +   corresponding output buffers.
> > > > +
> > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > +   queues.
> > > > +
> > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > +
> > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > +   Note that due to resolution constraints, a different format may need to be
> > > > +   picked on the ``CAPTURE`` queue.
> > > > +
> > > > +Drain
> > > > +=====
> > > > +
> > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > +decoder.
-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com