On Tue, May 21, 2019 at 12:07:47PM -0400, Nicolas Dufresne wrote: > Le mardi 21 mai 2019 à 17:09 +0200, Thierry Reding a écrit : > > On Tue, May 21, 2019 at 01:44:50PM +0200, Paul Kocialkowski wrote: > > > Hi, > > > > > > On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote: > > > > On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski > > > > wrote: > > > > > Hi, > > > > > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit : > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit : > > > > > > > Hi, > > > > > > > > > > > > > > With the Rockchip stateless VPU driver in the works, we now have a > > > > > > > better idea of what the situation is like on platforms other than > > > > > > > Allwinner. This email shares my conclusions about the situation and how > > > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly. > > > > > > > > > > > > > > - Per-slice decoding > > > > > > > > > > > > > > We've discussed this one already[0] and Hans has submitted a patch[1] > > > > > > > to implement the required core bits. When we agree it looks good, we > > > > > > > should lift the restriction that all slices must be concatenated and > > > > > > > have them submitted as individual requests. > > > > > > > > > > > > > > One question is what to do about other controls. I feel like it would > > > > > > > make sense to always pass all the required controls for decoding the > > > > > > > slice, including the ones that don't change across slices. But there > > > > > > > may be no particular advantage to this and only downsides. Not doing it > > > > > > > and relying on the "control cache" can work, but we need to specify > > > > > > > that only a single stream can be decoded per opened instance of the > > > > > > > v4l2 device. This is the assumption we're going with for handling > > > > > > > multi-slice anyway, so it shouldn't be an issue. > > > > > > > > > > > > My opinion on this is that the m2m instance is a state, and the driver > > > > > > should be responsible of doing time-division multiplexing across > > > > > > multiple m2m instance jobs. Doing the time-division multiplexing in > > > > > > userspace would require some sort of daemon to work properly across > > > > > > processes. I also think the kernel is better place for doing resource > > > > > > access scheduling in general. > > > > > > > > > > I agree with that yes. We always have a single m2m context and specific > > > > > controls per opened device so keeping cached values works out well. > > > > > > > > > > So maybe we shall explicitly require that the request with the first > > > > > slice for a frame also contains the per-frame controls. > > > > > > > > > > > > > Agreed. > > > > > > > > One more argument not to allow such multiplexing is that despite the > > > > API being called "stateless", there is actually some state saved > > > > between frames, e.g. the Rockchip decoder writes some intermediate > > > > data to some local buffers which need to be given to the decoder to > > > > decode the next frame. Actually, on Rockchip there is even a > > > > requirement to keep the reference list entries in the same order > > > > between frames. > > > > > > Well, what I'm suggesting is to have one stream per m2m context, but it > > > should certainly be possible to have multiple m2m contexts (multiple > > > userspace open calls) that decode different streams concurrently. > > > > > > Is that really going to be a problem for Rockchip? If so, then the > > > driver should probably enforce allowing a single userspace open and m2m > > > context at a time. > > > > If you have hardware storing data necessary to the decoding process in > > buffers local to the decoder you'd have to have some sort of context > > switch operation that backs up the data in those buffers before you > > switch to a different context and restore those buffers when you switch > > back. We have similar hardware on Tegra, though I'm not exactly familiar > > with the details of what is saved and how essential it is. My > > understanding is that those internal buffers can be copied to external > > RAM or vice versa, but I suspect that this isn't going to be very > > efficient. It may very well be that restricting to a single userspace > > open is the most sensible option. > > That would be by far the worst for a browser use case where an adds > might have stolen that single instance you have available in HW. It's > normal that context switching will have some impact on performance, but > in general, most of the time, the other instances will be idles by > userspace. If there is not context switches, there should be no (or > very little overhead). Of course, it's should not be a heard > requirement to get a driver in the kernel, I'm not saying that. Sounds like we're in agreement. I didn't mean to imply that all drivers should be single-open. I was just trying to say that there may be cases where it's not possible or highly impractical to do a context switch or multiple ones in a driver. > p.s. In the IMX8M/Hantro G1 they specifically says that the single core > decoder can handle up to 8 1080p60 streams at the same time. But there > is some buffers being write-back by the IP for every slice (at the end > of the decoded reference frames). I know that there is a similar mechanism on VDE for Tegra where an extra auxiliary buffer can be defined where extra data is written, though it's only used for some profiles (H.264 constrained baseline for example does not seem to require that). I think this has to do with reference picture marking. It may very well be that the other internal buffers don't actually need to persist across multiple frame decode operations, which would of course eliminate any concurrency issues. Sorry for this being somewhat vague. I've only begun to familiarize myself with the VDE and I keep getting side-tracked by a bunch of other things. But I'm trying to pitch in while the discussion is ongoing in the hope that it will help us come up with the best solution. Thierry