Le mercredi 22 mai 2019 à 13:39 +0200, Thierry Reding a écrit : > On Wed, May 22, 2019 at 11:29:13AM +0200, Paul Kocialkowski wrote: > > Le mercredi 22 mai 2019 à 10:32 +0200, Thierry Reding a écrit : > > > On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote: > > > > On Wed, 22 May 2019 15:39:37 +0900 > > > > Tomasz Figa wrote: > > > > > > > > > > It would be premature to state that we are excluding. We are just > > > > > > trying to find one format to get things upstream, and make sure we have > > > > > > a plan how to extend it. Trying to support everything on the first try > > > > > > is not going to work so well. > > > > > > > > > > > > What is interesting to provide is how does you IP achieve multi-slice > > > > > > decoding per frame. That's what we are studying on the RK/Hantro chip. > > > > > > Typical questions are: > > > > > > > > > > > > 1. Do all slices have to be contiguous in memory > > > > > > 2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ? > > > > > > 3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do) > > > > > > > > > > AFAICT, the bit about RK3288 isn't true. At least in our downstream > > > > > driver that was created mostly by RK themselves, we've been assuming > > > > > that the interrupt is for the complete frame, without any problems. > > > > > > > > I confirm that's what happens when all slices forming a frame are packed > > > > in a single output buffer: you only get one interrupt at the end of the > > > > decoding process (in that case, when the frame is decoded). Of course, > > > > if you split things up and do per-slice decoding instead (one slice per > > > > buffer) you get an interrupt per slice, though I didn't manage to make > > > > that work. > > > > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not > > > > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering) > > > > interrupt on the second slice, which makes me think some states are > > > > reset between the 2 operations leading the engine to think that the > > > > second slice is part of a new frame. > > > > > > That sounds a lot like how this works on Tegra. My understanding is that > > > for slice decoding you'd also get an interrupt every time a full slice > > > has been decoded perhaps coupled with another "frame done" interrupt > > > when the full frame has been decoded after the last slice. > > > > > > In frame-level decode mode you don't get interrupts in between and > > > instead only get the "frame done" interrupt. Unless something went wrong > > > during decoding, in which case you also get an interrupt but with error > > > flags and status registers that help determine what exactly happened. > > > > > > > Anyway, it doesn't sound like a crazy idea to support both per-slice > > > > and per-frame decoding and maybe have a way to expose what a > > > > specific codec can do (through an extra cap mechanism). > > > > > > Yeah, I think it makes sense to support both for devices that can do > > > both. From what Nicolas said it may make sense for an application to > > > want to do slice-level decoding if receiving a stream from the network > > > and frame-level decoding if playing back from a local file. If a driver > > > supports both, the application could detect that and choose the > > > appropriate format. > > > > > > It sounds to me like using different input formats for that would be a > > > very natural way to describe it. Applications can already detect the set > > > of supported input formats and set the format when they allocate buffers > > > so that should work very nicely. > > > > Pixel formats are indeed the natural way to go about this, but I have > > some reservations in this case. Slices are the natural unit of video > > streams, just like frames are to display hardware. Part of the pipeline > > configuration is slice-specific, so in theory, the pipeline needs to be > > reconfigured with each slice. > > > > What we have been doing in Cedrus is to currently gather all the slices > > and use the last slice's specific configuration for the pipeline, which > > sort of works, but is very likely not a good idea. > > To be honest, my testing has been very minimal, so it's quite possible > that I've always only run into examples with either only a single slice > or multiple slices with the same configuration. Or perhaps with > differing configurations but non-significant (or non-noticable) > differences. > > > You mentionned that the Tegra VPU currentyl always operates in frame > > mode (even when the stream actually has multiple slices, which I assume > > are gathered at some point). I wonder how it goes about configuring > > different slice parameters (which are specific to each slice, not > > frame) for the different slices. > > That's part of the beauty of the frame-level decoding mode (I think > that's call SXE-P). The syntax engine has access to the complete > bitstream and can parse all the information that it needs. There's some > data that we pass into the decoder from the SPS and PPS, but other than > that the VDE will do everything by itself. > > > I believe we should at least always expose per-slice granularity in the > > pixel format and requests. Maybe we could have a way to allow multiple > > slices to be gathered in the source buffer and have a control slice > > array for each request. In that case, we'd have a single request queued > > for the series of slices, with a bit offset in each control to the > > matching slice. > > > > Then we could specify that such slices must be appended in a way that > > suits most decoders that would have to operate per-frame (so we need to > > figure this out) and worst case, we'll always have offsets in the > > controls if we need to setup a bounce buffer in the driver because > > things are not laid out the way we specified. > > > > Then we introduce a specific cap to indicate which mode is supported > > (per-slice and/or per-frame) and adapt our ffmpeg reference to be able > > to operate in both modes. > > > > That adds some complexity for userspace, but I don't think we can avoid > > it at this point and it feels better than having two different pixel > > formats (which would probably be even more complex to manage for > > userspace). > > > > What do you think? > > I'm not sure I understand why this would be simpler than exposing two > different pixel formats. It sounds like essentially the same thing, just > with a different method. > > One advantage I see with your approach is that it more formally defines > how slices are passed. This might be a good thing to do anyway. I'm not > sure if software stacks provide that information anyway. If they do this > would be trivial to achieve. If they don't this could be an extra burden > on userspace for decoder that don't need it. Just to feed the discussion, in GStreamer it would be exposed like this (except that this is full bitstream, not just slices): /* FULL Frame */ video/x-h264,stream-format=byte-stream,alignment=au /* One of more NAL per memory buffer */ video/x-h264,stream-format=byte-stream,alignment=nal "stream-format=byte-stream" means with start-code, where you could AVC or AVC3 bitstream too. We do that, so you have a common format, with variant. I'm worried having too many formats will not scale in the long term, that's all, I still think this solution works too. But note that we already have _H264 and _H264_NOSC format. And then, how do you call a stream that only has slice nals, but all all slice of a frame per buffer ... p.s. In Tegra OMX, there is a control to pick between AU/NAL, so I'm pretty sure the HW support both ways. > > Would it perhaps be possible to make this slice meta data optional? For > example, could we just provide an H.264 slice pixel format and then let > userspace fill in buffers in whatever way they want, provided that they > follow some rules (must be annex B or something else, concatenated > slices, ...) and then if there's an extra control specifying the offsets > of individual slices drivers can use that, if not they just pass the > bitstream buffer to the hardware if frame-level decoding is supported > and let the hardware do its thing? > > Hardware that has requirements different from that could require the > meta data to be present and fail otherwise. > > On the other hand, userspace would have to be prepared to deal with this > type of hardware anyway, so it basically needs to provide the meta data > in any case. Perhaps the meta data could be optional if a buffer > contains a single slice. > > One other thing that occurred to me is that the meta data could perhaps > contain a more elaborate description of the data in the slice. But that > has the problem that it can't be detected upfront, so userspace can't > discover whether the decoder can handle that data until an error is > returned from the decoder upon receiving the meta data. > > To answer your question: I don't feel strongly one way or the other. The > above is really just discussing the specifics of how the data is passed, > but we don't really know what exactly the data is that we need to pass. > > > > > The other option would be to support only per-slice decoding with a > > > > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that > > > > only support per-frame decoding know when they should trigger the > > > > decoding operation. The downside is that it implies having a bounce > > > > buffer where the driver can pack slices to be decoded on the END_FRAME > > > > event. > > > > > > I vaguely remember that that's what the video codec abstraction does in > > > Mesa/Gallium. > > > > Well, if it's exposed through VDPAU or VAAPI, the interface already > > operates per-slice and it would certainly not be a big issue to change > > that. > > The video pipe callbacks can implement a ->decode_bitstream() callback > that gets a number of buffer/size pairs along with a picture description > (which corresponds roughly to the SPS/PPS). The buffer/size pairs are > exactly what's passed in from VDPAU or VAAPI. It looks like VDPAU can > pass multiple slices, each per VdpBitstreamBuffer, whereas VAAPI passes > only a single buffer at a time at the driver level. > > (Interesting side-note: VDPAU seems to require the start code to be part > of the bitstream, whereas the VAAPI state tracker in Mesa will go and > check whether a buffer contains the start code and prepend it via SG if > not. So at the pipe_video_codec level it seems the decision was made to > use annex B as the lowest common denominator). > > > Talking about the mesa/gallium video decoding stuff, I think it would > > be worth having V4L2 interfaces for that now that we have the Request > > API. > > Yeah, I think that'd be nice, but I'm not sure that you're going to find > someone to redo all the work... > > > Basically, Nvidia GPUs have video decoding blocks (which could be > > similar to the ones present on Tegra) that are accessed through a > > firmware running on a Falcon MCU on the GPU side. > > Yeah, the video decoding blocks on GPUs are very similar to the ones > found on more recent Tegra. The big difference, of course, is that on > Tegra they are separate (platform) devices, whereas on the GPU they are > part of the PCI device's register space. It'd be nice if we could > somehow share drivers between the two, but I'm not sure that that's > possible. Besides the different bus there are also difference is how > memory is managed (video RAM on GPU vs. system memory on Tegra) and so > on. > > > Having a standardized firmware interface for these and a V4L2 M2M > > driver for the interface would certainly make it easier for everyone to > > handle that. I don't really see why these video decoding hardware has > > to be exposed through the display stack anyway and one could want to > > use the GPU's video decoder without bringing up the shading cores. > > Are you saying that it might be possible to structure this as basically > two "backend" drivers that each expose the command stream interface and > then build a "frontend" driver that could talk to either backend? That > sounds like a really nice idea, but I'm not sure that it'd work. > > > > I'm not very familiar with V4L2, but this seems like it > > > could be problematic to integrate with the way that V4L2 works in > > > general. Perhaps sending a special buffer (0 length or whatever) to mark > > > the end of a frame would work. But this is probably something that > > > others have already thought about, since slice-level decoding is what > > > most people are using, hence there must already be a way for userspace > > > to somehow synchronize input vs. output buffers. Or does this currently > > > just work by queueing bitstream buffers as fast as possible and then > > > dequeueing frame buffers as they become available? > > > > We have a Request API mechanism where we group controls (parsed > > bitstream meta-data) and source (OUTPUT) buffers together and submit > > them tied. When each request gets processed its buffer enters the > > OUTPUT queue, which gets picked up by the driver and associated with > > the first destination (CAPTURE) buffer available. Then the driver grabs > > the buffers and applies the controls matching the source buffer's > > request before starting decoding with M2M. > > > > We have already worked on handling the case of requiring a single > > destination buffer for the different slices, by having a flag to > > indicate whether the destination buffer should be held. > > Right. So sounds like the request is the natural boundary here. I guess > that would allow drivers to manually concatenate accumulated bitstream > buffers into a single one. > > Thierry