Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support

From: Thierry Reding <thierry.reding@gmail.com>
To: Nicolas Dufresne <nicolas@ndufresne.ca>
Cc: Paul Kocialkowski <paul.kocialkowski@bootlin.com>,
	Linux Media Mailing List <linux-media@vger.kernel.org>,
	Hans Verkuil <hverkuil-cisco@xs4all.nl>,
	Tomasz Figa <tfiga@chromium.org>,
	Alexandre Courbot <acourbot@chromium.org>,
	Boris Brezillon <boris.brezillon@collabora.com>,
	Maxime Ripard <maxime.ripard@bootlin.com>,
	Jernej Skrabec <jernej.skrabec@siol.net>,
	Ezequiel Garcia <ezequiel@collabora.com>,
	Jonas Karlman <jonas@kwiboo.se>
Subject: Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
Date: Tue, 4 Jun 2019 10:50:55 +0200	[thread overview]
Message-ID: <20190604085055.GD9048@ulmo> (raw)
In-Reply-To: <a2f6bac6596da86d597d9ac4c12b1f72b772dbe5.camel@ndufresne.ca>

[-- Attachment #1: Type: text/plain, Size: 6738 bytes --]

On Mon, Jun 03, 2019 at 02:52:44PM -0400, Nicolas Dufresne wrote:
> Le lundi 03 juin 2019 à 13:24 +0200, Thierry Reding a écrit :
> > On Wed, May 15, 2019 at 12:09:45PM +0200, Paul Kocialkowski wrote:
> > > Hi,
> > > 
> > > With the Rockchip stateless VPU driver in the works, we now have a
> > > better idea of what the situation is like on platforms other than
> > > Allwinner. This email shares my conclusions about the situation and how
> > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > 
> > > - Per-slice decoding
> > > 
> > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > to implement the required core bits. When we agree it looks good, we
> > > should lift the restriction that all slices must be concatenated and
> > > have them submitted as individual requests.
> > > 
> > > One question is what to do about other controls. I feel like it would
> > > make sense to always pass all the required controls for decoding the
> > > slice, including the ones that don't change across slices. But there
> > > may be no particular advantage to this and only downsides. Not doing it
> > > and relying on the "control cache" can work, but we need to specify
> > > that only a single stream can be decoded per opened instance of the
> > > v4l2 device. This is the assumption we're going with for handling
> > > multi-slice anyway, so it shouldn't be an issue.
> > > 
> > > - Annex-B formats
> > > 
> > > I don't think we have really reached a conclusion on the pixel formats
> > > we want to expose. The main issue is how to deal with codecs that need
> > > the full slice NALU with start code, where the slice_header is
> > > duplicated in raw bitstream, when others are fine with just the encoded
> > > slice data and the parsed slice header control.
> > > 
> > > My initial thinking was that we'd need 3 formats:
> > > - One that only takes only the slice compressed data (without raw slice
> > > header and start code);
> > > - One that takes both the NALU data (including start code, raw header
> > > and compressed data) and slice header controls;
> > > - One that takes the NALU data but no slice header.
> > > 
> > > But I no longer think the latter really makes sense in the context of
> > > stateless video decoding.
> > > 
> > > A side-note: I think we should definitely have data offsets in every
> > > case, so that implementations can just push the whole NALU regardless
> > > of the format if they're lazy.
> > 
> > I started an NVIDIA internal discussion about this to get some thoughts
> > from our local experts and to fill in my gaps in understanding of NVIDIA
> > hardware that we might want to support.
> > 
> > As far as input format goes, there was pretty broad consensus that in
> > order for the ABI to be most broadly useful we need to settle on the
> > lowest common denominator, while drawing some inspiration from existing
> > APIs because they've already gone through a lot of these discussions and
> > came up with standard interfaces to deal with the differences between
> > decoders.
> 
> Note that we are making a statement with the sateless/stateful split.
> The userspace overhead is non-negligible if you start passing all this
> useless data to a stateful HW. About other implementation, that's what
> we went through in order to reach the state we are at now.
> 
> It's interesting that you have this dicussion with NVIDIA specialist,
> that being said, I think it would be better to provide with the actual
> data (how different generation of HW works) before providing
> conclusions made by your team. Right now, we have deeply studied
> Cedrus, Hantro and Rockchip IP, and that's how we manage to reach this
> low overhead compromise. What we really want to see, is if there exist
> NVidia HW, that does not fit any of the two interface, and why.

Sorry if I was being condescending, that was not my intention. I was
trying to share what I was able to learn in the short time while the
discussion was happening.

If I understand correctly, I think NVIDIA hardware falls in the category
covered by the second interface, that is: NALU data (start code, raw
header, compressed data) and slice header controls.

I'm trying to get some other things out of the way first, but then I
hope to have time to go back to porting the VDE driver to V4L2 so that I
have something more concrete to contribute.

> > In more concrete terms this means that we'll want to provide as much
> > data to the kernel as possible. On one hand that means that we need to
> > do all of the header parsing etc. in userspace and pass it to the kernel
> > to support hardware that can't parse this data by itself. At the same
> > time we want to provide the full bitstream to the kernel to make sure
> > that hardware that does some (or all) of the parsing itself has access
> > to this. We absolutely want to avoid having to reconstruct some of the
> > bitstream that userspace may not have passed in order to optimize for
> > some usecases.
> 
> Passing the entire bitstream without reconstruction is near impossible
> for a VDPAU or VAAPI driver. Even for FFMPEG, it makes everything much
> more complex. I think at some point we need to draw a line what this
> new API should cover.

I think that's totally reasonable. I'm just trying to make sure that
this is something that will work for Tegra. It'd be very unfortunate
if we had to do something else entirely because V4L2 didn't cover what
we need.

> An example here, we have decided to support a new format H264_SLICE,
> and this format has been defined as "slice only" stream where pps,sps
> et. would be described in C structure. There is nothing that prevents
> adding other formats in the future. What we would like is that this
> remains as inclusive as possible to the "slice" accelerators we know,
> hence adding "per-frame" decoding, since we know the "per-slice"
> decoding is compatible. We also know that this does not add more work
> to existing userspace code the supports similar accelerator.
> 
> In fact, the first thing we kept in mind in our work is that it's very
> difficult to implement this userspace, so keeping in mind compatibility
> with VAAPI/VDPAU existing userspace (like the accelerator in FFMPEG and
> GStreamer) felt like essential to lead toward fully Open Source
> solution.

Okay, thanks for clarifying that. Sounds like I was misinterpreting
where the discussion was headed.

We'll most likely need something other than the H264_SLICE format for
Tegra, so as long as that's something you guys will remain open to, that
sounds good to me.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]