Le mercredi 22 mai 2019 à 11:29 +0200, Paul Kocialkowski a écrit :
> Le mercredi 22 mai 2019 à 10:32 +0200, Thierry Reding a écrit :
> > On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote:
> > > On Wed, 22 May 2019 15:39:37 +0900
> > > Tomasz Figa <tfiga@chromium.org> wrote:
> > > 
> > > > > It would be premature to state that we are excluding. We are just
> > > > > trying to find one format to get things upstream, and make sure we have
> > > > > a plan how to extend it. Trying to support everything on the first try
> > > > > is not going to work so well.
> > > > > 
> > > > > What is interesting to provide is how does you IP achieve multi-slice
> > > > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > > > Typical questions are:
> > > > > 
> > > > >   1. Do all slices have to be contiguous in memory
> > > > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > > > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> > > > 
> > > > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > > > driver that was created mostly by RK themselves, we've been assuming
> > > > that the interrupt is for the complete frame, without any problems.
> > > 
> > > I confirm that's what happens when all slices forming a frame are packed
> > > in a single output buffer: you only get one interrupt at the end of the
> > > decoding process (in that case, when the frame is decoded). Of course,
> > > if you split things up and do per-slice decoding instead (one slice per
> > > buffer) you get an interrupt per slice, though I didn't manage to make
> > > that work.
> > > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> > > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> > > interrupt on the second slice, which makes me think some states are
> > > reset between the 2 operations leading the engine to think that the
> > > second slice is part of a new frame.
> > 
> > That sounds a lot like how this works on Tegra. My understanding is that
> > for slice decoding you'd also get an interrupt every time a full slice
> > has been decoded perhaps coupled with another "frame done" interrupt
> > when the full frame has been decoded after the last slice.
> > 
> > In frame-level decode mode you don't get interrupts in between and
> > instead only get the "frame done" interrupt. Unless something went wrong
> > during decoding, in which case you also get an interrupt but with error
> > flags and status registers that help determine what exactly happened.
> > 
> > > Anyway, it doesn't sound like a crazy idea to support both per-slice
> > > and per-frame decoding and maybe have a way to expose what a
> > > specific codec can do (through an extra cap mechanism).
> > 
> > Yeah, I think it makes sense to support both for devices that can do
> > both. From what Nicolas said it may make sense for an application to
> > want to do slice-level decoding if receiving a stream from the network
> > and frame-level decoding if playing back from a local file. If a driver
> > supports both, the application could detect that and choose the
> > appropriate format.
> > 
> > It sounds to me like using different input formats for that would be a
> > very natural way to describe it. Applications can already detect the set
> > of supported input formats and set the format when they allocate buffers
> > so that should work very nicely.
> 
> Pixel formats are indeed the natural way to go about this, but I have
> some reservations in this case. Slices are the natural unit of video
> streams, just like frames are to display hardware. Part of the pipeline
> configuration is slice-specific, so in theory, the pipeline needs to be
> reconfigured with each slice.
> 
> What we have been doing in Cedrus is to currently gather all the slices
> and use the last slice's specific configuration for the pipeline, which
> sort of works, but is very likely not a good idea.
> 
> You mentionned that the Tegra VPU currentyl always operates in frame
> mode (even when the stream actually has multiple slices, which I assume
> are gathered at some point). I wonder how it goes about configuring
> different slice parameters (which are specific to each slice, not
> frame) for the different slices. 

Per-frame CODEC won't ask for the l0/l1 list, which is slice specific.
This is the case for the RK3288, we don't pass that information.
Instead we build a list from the DPB entries, this is the list before
the applying the modifications found in the slice header. The HW will
do the rest.

> 
> I believe we should at least always expose per-slice granularity in the
> pixel format and requests. Maybe we could have a way to allow multiple
> slices to be gathered in the source buffer and have a control slice
> array for each request. In that case, we'd have a single request queued
> for the series of slices, with a bit offset in each control to the
> matching slice.
> 
> Then we could specify that such slices must be appended in a way that
> suits most decoders that would have to operate per-frame (so we need to
> figure this out) and worst case, we'll always have offsets in the
> controls if we need to setup a bounce buffer in the driver because
> things are not laid out the way we specified.
> 
> Then we introduce a specific cap to indicate which mode is supported
> (per-slice and/or per-frame) and adapt our ffmpeg reference to be able
> to operate in both modes.
> 
> That adds some complexity for userspace, but I don't think we can avoid
> it at this point and it feels better than having two different pixel
> formats (which would probably be even more complex to manage for
> userspace).
> 
> What do you think?
> 
> > > The other option would be to support only per-slice decoding with a
> > > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> > > only support per-frame decoding know when they should trigger the
> > > decoding operation. The downside is that it implies having a bounce
> > > buffer where the driver can pack slices to be decoded on the END_FRAME
> > > event.
> > 
> > I vaguely remember that that's what the video codec abstraction does in
> > Mesa/Gallium. 
> 
> Well, if it's exposed through VDPAU or VAAPI, the interface already
> operates per-slice and it would certainly not be a big issue to change
> that.

VDPAU seem to be per-frame to me (I only read that API recently once).
I believe this is the main difference between both. But most VDPAU
drivers needs to do their own final bit of parsing. The VAAPI has a
start/end call, so with a bounce buffer you can implement the other way
too in your driver. But then the cons is that userspace may be doing
parsing that won't be used by the driver.

> 
> Talking about the mesa/gallium video decoding stuff, I think it would
> be worth having V4L2 interfaces for that now that we have the Request
> API.
> 
> Basically, Nvidia GPUs have video decoding blocks (which could be
> similar to the ones present on Tegra) that are accessed through a
> firmware running on a Falcon MCU on the GPU side.
> 
> Having a standardized firmware interface for these and a V4L2 M2M
> driver for the interface would certainly make it easier for everyone to
> handle that. I don't really see why these video decoding hardware has
> to be exposed through the display stack anyway and one could want to
> use the GPU's video decoder without bringing up the shading cores.
> 
> > I'm not very familiar with V4L2, but this seems like it
> > could be problematic to integrate with the way that V4L2 works in
> > general. Perhaps sending a special buffer (0 length or whatever) to mark
> > the end of a frame would work. But this is probably something that
> > others have already thought about, since slice-level decoding is what
> > most people are using, hence there must already be a way for userspace
> > to somehow synchronize input vs. output buffers. Or does this currently
> > just work by queueing bitstream buffers as fast as possible and then
> > dequeueing frame buffers as they become available?
> 
> We have a Request API mechanism where we group controls (parsed
> bitstream meta-data) and source (OUTPUT) buffers together and submit
> them tied. When each request gets processed its buffer enters the
> OUTPUT queue, which gets picked up by the driver and associated with
> the first destination (CAPTURE) buffer available. Then the driver grabs
> the buffers and applies the controls matching the source buffer's
> request before starting decoding with M2M.
> 
> We have already worked on handling the case of requiring a single
> destination buffer for the different slices, by having a flag to
> indicate whether the destination buffer should be held.
> 
> Cheers,
> 
> Paul
>