Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support

From: Thierry Reding <thierry.reding@gmail.com>
To: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>,
	Tomasz Figa <tfiga@chromium.org>,
	Nicolas Dufresne <nicolas@ndufresne.ca>,
	Linux Media Mailing List <linux-media@vger.kernel.org>,
	Hans Verkuil <hverkuil-cisco@xs4all.nl>,
	Alexandre Courbot <acourbot@chromium.org>,
	Maxime Ripard <maxime.ripard@bootlin.com>,
	Jernej Skrabec <jernej.skrabec@siol.net>,
	Ezequiel Garcia <ezequiel@collabora.com>,
	Jonas Karlman <jonas@kwiboo.se>
Subject: Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
Date: Wed, 22 May 2019 13:39:24 +0200	[thread overview]
Message-ID: <20190522113924.GE30938@ulmo> (raw)
In-Reply-To: <0961169f7abbecbc0f8382f946bd37dc9bea8507.camel@bootlin.com>

[-- Attachment #1: Type: text/plain, Size: 12767 bytes --]

On Wed, May 22, 2019 at 11:29:13AM +0200, Paul Kocialkowski wrote:
> Le mercredi 22 mai 2019 à 10:32 +0200, Thierry Reding a écrit :
> > On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote:
> > > On Wed, 22 May 2019 15:39:37 +0900
> > > Tomasz Figa <tfiga@chromium.org> wrote:
> > > 
> > > > > It would be premature to state that we are excluding. We are just
> > > > > trying to find one format to get things upstream, and make sure we have
> > > > > a plan how to extend it. Trying to support everything on the first try
> > > > > is not going to work so well.
> > > > > 
> > > > > What is interesting to provide is how does you IP achieve multi-slice
> > > > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > > > Typical questions are:
> > > > > 
> > > > >   1. Do all slices have to be contiguous in memory
> > > > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > > > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> > > > 
> > > > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > > > driver that was created mostly by RK themselves, we've been assuming
> > > > that the interrupt is for the complete frame, without any problems.
> > > 
> > > I confirm that's what happens when all slices forming a frame are packed
> > > in a single output buffer: you only get one interrupt at the end of the
> > > decoding process (in that case, when the frame is decoded). Of course,
> > > if you split things up and do per-slice decoding instead (one slice per
> > > buffer) you get an interrupt per slice, though I didn't manage to make
> > > that work.
> > > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> > > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> > > interrupt on the second slice, which makes me think some states are
> > > reset between the 2 operations leading the engine to think that the
> > > second slice is part of a new frame.
> > 
> > That sounds a lot like how this works on Tegra. My understanding is that
> > for slice decoding you'd also get an interrupt every time a full slice
> > has been decoded perhaps coupled with another "frame done" interrupt
> > when the full frame has been decoded after the last slice.
> > 
> > In frame-level decode mode you don't get interrupts in between and
> > instead only get the "frame done" interrupt. Unless something went wrong
> > during decoding, in which case you also get an interrupt but with error
> > flags and status registers that help determine what exactly happened.
> > 
> > > Anyway, it doesn't sound like a crazy idea to support both per-slice
> > > and per-frame decoding and maybe have a way to expose what a
> > > specific codec can do (through an extra cap mechanism).
> > 
> > Yeah, I think it makes sense to support both for devices that can do
> > both. From what Nicolas said it may make sense for an application to
> > want to do slice-level decoding if receiving a stream from the network
> > and frame-level decoding if playing back from a local file. If a driver
> > supports both, the application could detect that and choose the
> > appropriate format.
> > 
> > It sounds to me like using different input formats for that would be a
> > very natural way to describe it. Applications can already detect the set
> > of supported input formats and set the format when they allocate buffers
> > so that should work very nicely.
> 
> Pixel formats are indeed the natural way to go about this, but I have
> some reservations in this case. Slices are the natural unit of video
> streams, just like frames are to display hardware. Part of the pipeline
> configuration is slice-specific, so in theory, the pipeline needs to be
> reconfigured with each slice.
> 
> What we have been doing in Cedrus is to currently gather all the slices
> and use the last slice's specific configuration for the pipeline, which
> sort of works, but is very likely not a good idea.

To be honest, my testing has been very minimal, so it's quite possible
that I've always only run into examples with either only a single slice
or multiple slices with the same configuration. Or perhaps with
differing configurations but non-significant (or non-noticable)
differences.

> You mentionned that the Tegra VPU currentyl always operates in frame
> mode (even when the stream actually has multiple slices, which I assume
> are gathered at some point). I wonder how it goes about configuring
> different slice parameters (which are specific to each slice, not
> frame) for the different slices.

That's part of the beauty of the frame-level decoding mode (I think
that's call SXE-P). The syntax engine has access to the complete
bitstream and can parse all the information that it needs. There's some
data that we pass into the decoder from the SPS and PPS, but other than
that the VDE will do everything by itself.

> I believe we should at least always expose per-slice granularity in the
> pixel format and requests. Maybe we could have a way to allow multiple
> slices to be gathered in the source buffer and have a control slice
> array for each request. In that case, we'd have a single request queued
> for the series of slices, with a bit offset in each control to the
> matching slice.
> 
> Then we could specify that such slices must be appended in a way that
> suits most decoders that would have to operate per-frame (so we need to
> figure this out) and worst case, we'll always have offsets in the
> controls if we need to setup a bounce buffer in the driver because
> things are not laid out the way we specified.
> 
> Then we introduce a specific cap to indicate which mode is supported
> (per-slice and/or per-frame) and adapt our ffmpeg reference to be able
> to operate in both modes.
> 
> That adds some complexity for userspace, but I don't think we can avoid
> it at this point and it feels better than having two different pixel
> formats (which would probably be even more complex to manage for
> userspace).
> 
> What do you think?

I'm not sure I understand why this would be simpler than exposing two
different pixel formats. It sounds like essentially the same thing, just
with a different method.

One advantage I see with your approach is that it more formally defines
how slices are passed. This might be a good thing to do anyway. I'm not
sure if software stacks provide that information anyway. If they do this
would be trivial to achieve. If they don't this could be an extra burden
on userspace for decoder that don't need it.

Would it perhaps be possible to make this slice meta data optional? For
example, could we just provide an H.264 slice pixel format and then let
userspace fill in buffers in whatever way they want, provided that they
follow some rules (must be annex B or something else, concatenated
slices, ...) and then if there's an extra control specifying the offsets
of individual slices drivers can use that, if not they just pass the
bitstream buffer to the hardware if frame-level decoding is supported
and let the hardware do its thing?

Hardware that has requirements different from that could require the
meta data to be present and fail otherwise.

On the other hand, userspace would have to be prepared to deal with this
type of hardware anyway, so it basically needs to provide the meta data
in any case. Perhaps the meta data could be optional if a buffer
contains a single slice.

One other thing that occurred to me is that the meta data could perhaps
contain a more elaborate description of the data in the slice. But that
has the problem that it can't be detected upfront, so userspace can't
discover whether the decoder can handle that data until an error is
returned from the decoder upon receiving the meta data.

To answer your question: I don't feel strongly one way or the other. The
above is really just discussing the specifics of how the data is passed,
but we don't really know what exactly the data is that we need to pass.

> > > The other option would be to support only per-slice decoding with a
> > > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> > > only support per-frame decoding know when they should trigger the
> > > decoding operation. The downside is that it implies having a bounce
> > > buffer where the driver can pack slices to be decoded on the END_FRAME
> > > event.
> > 
> > I vaguely remember that that's what the video codec abstraction does in
> > Mesa/Gallium. 
> 
> Well, if it's exposed through VDPAU or VAAPI, the interface already
> operates per-slice and it would certainly not be a big issue to change
> that.

The video pipe callbacks can implement a ->decode_bitstream() callback
that gets a number of buffer/size pairs along with a picture description
(which corresponds roughly to the SPS/PPS). The buffer/size pairs are
exactly what's passed in from VDPAU or VAAPI. It looks like VDPAU can
pass multiple slices, each per VdpBitstreamBuffer, whereas VAAPI passes
only a single buffer at a time at the driver level.

(Interesting side-note: VDPAU seems to require the start code to be part
of the bitstream, whereas the VAAPI state tracker in Mesa will go and
check whether a buffer contains the start code and prepend it via SG if
not. So at the pipe_video_codec level it seems the decision was made to
use annex B as the lowest common denominator).

> Talking about the mesa/gallium video decoding stuff, I think it would
> be worth having V4L2 interfaces for that now that we have the Request
> API.

Yeah, I think that'd be nice, but I'm not sure that you're going to find
someone to redo all the work...

> Basically, Nvidia GPUs have video decoding blocks (which could be
> similar to the ones present on Tegra) that are accessed through a
> firmware running on a Falcon MCU on the GPU side.

Yeah, the video decoding blocks on GPUs are very similar to the ones
found on more recent Tegra. The big difference, of course, is that on
Tegra they are separate (platform) devices, whereas on the GPU they are
part of the PCI device's register space. It'd be nice if we could
somehow share drivers between the two, but I'm not sure that that's
possible. Besides the different bus there are also difference is how
memory is managed (video RAM on GPU vs. system memory on Tegra) and so
on.

> Having a standardized firmware interface for these and a V4L2 M2M
> driver for the interface would certainly make it easier for everyone to
> handle that. I don't really see why these video decoding hardware has
> to be exposed through the display stack anyway and one could want to
> use the GPU's video decoder without bringing up the shading cores.

Are you saying that it might be possible to structure this as basically
two "backend" drivers that each expose the command stream interface and
then build a "frontend" driver that could talk to either backend? That
sounds like a really nice idea, but I'm not sure that it'd work.

> > I'm not very familiar with V4L2, but this seems like it
> > could be problematic to integrate with the way that V4L2 works in
> > general. Perhaps sending a special buffer (0 length or whatever) to mark
> > the end of a frame would work. But this is probably something that
> > others have already thought about, since slice-level decoding is what
> > most people are using, hence there must already be a way for userspace
> > to somehow synchronize input vs. output buffers. Or does this currently
> > just work by queueing bitstream buffers as fast as possible and then
> > dequeueing frame buffers as they become available?
> 
> We have a Request API mechanism where we group controls (parsed
> bitstream meta-data) and source (OUTPUT) buffers together and submit
> them tied. When each request gets processed its buffer enters the
> OUTPUT queue, which gets picked up by the driver and associated with
> the first destination (CAPTURE) buffer available. Then the driver grabs
> the buffers and applies the controls matching the source buffer's
> request before starting decoding with M2M.
> 
> We have already worked on handling the case of requiring a single
> destination buffer for the different slices, by having a flag to
> indicate whether the destination buffer should be held.

Right. So sounds like the request is the natural boundary here. I guess
that would allow drivers to manually concatenate accumulated bitstream
buffers into a single one.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]