Re: Stateless Encoding uAPI Discussion and Proposal - Hsia-Jun Li

From: Hsia-Jun Li <Randy.Li@synaptics.com>
To: Nicolas Dufresne <nicolas.dufresne@collabora.com>,
	Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Cc: linux-kernel@vger.kernel.org, linux-media@vger.kernel.org,
	"Hans Verkuil" <hverkuil@xs4all.nl>,
	"Sakari Ailus" <sakari.ailus@iki.fi>,
	"Andrzej Pietrasiewicz" <andrzej.p@collabora.com>,
	"Michael Tretter" <m.tretter@pengutronix.de>,
	"Jernej Škrabec" <jernej.skrabec@gmail.com>,
	"Chen-Yu Tsai" <wens@csie.org>,
	"Samuel Holland" <samuel@sholland.org>,
	"Thomas Petazzoni" <thomas.petazzoni@bootlin.com>
Subject: Re: Stateless Encoding uAPI Discussion and Proposal
Date: Wed, 23 Aug 2023 11:04:49 +0800	[thread overview]
Message-ID: <52e9b710-5011-656b-aebf-8d57e6496ddd@synaptics.com> (raw)
In-Reply-To: <a0fa6559c3933a5a4c8b7502282adae3429e0b57.camel@collabora.com>

On 8/23/23 04:31, Nicolas Dufresne wrote:
> CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> Hi,
>>
> 
> [...]
> 
>>> In cable streaming notably, the RC job is to monitor the about of bits over a
>>> period of time (the window). This window is defined by the streaming hardware
>>> buffering capabilities. Best at this point is to start reading through HRD
>>> specifications, and open source rate control implementation (notably x264).
>>>
>>> I think overall, we can live with adding hints were needed, and if the gop
>>> information is appropriate hint, then we can just reuse the existing control.
>>>
>> Why we still care about GOP here. Hardware have no idea about GOP at
>> all. Although in codec likes HEVC, IDR and intra pictures's nalu header
>> is different, there is not different in the hardware coding
>> configration. NALU header is generated by the userspace usually.
>>
>> While future encoding would regard the current encoded picture as an IDR
>> is completed decided by the userspace.
> 
> The discussion was around having basic RC algorithm in the kernel driver,
What I am thinking is who would use a basic RC algorithm in the kernel?
We are designing a toy algorithm which all hardware could use, while it 
would introduce a complex structure to make the userspace work with it.

Vendor would need to try to fit their model in an interface with limited 
functions.
> possibly making use of hardware specific features without actually exposing it
> all to userspace. So assuming we do that:
> 
> Paul's concern is that for best result, an RC algorithm could use knowledge of
> keyframe placement to preserve bucket space (possibly using the last keyframe
> size as a hint). Exposing the GOP structure in some form allow "prediction", so
> the adaption can lookahead future budget without introducing latency. There is
> an alternative, which is to require ahead of time queuing of encode requests.
It sounds like a fixed bitrate RC. Then this RC algorithm would in 
charge of selecting the reference frames?

Suppose we are talking about Hantro H1 which people here are familiar with.
An intra frame would usually cost the most hardware time to encode and 
contribute a lot to the size of a GOP(fixed bitrate).

If we ignore the inter frame, that would lead to a bad quality image.
One case here is decide whether I would use a previous intra frame as 
the reference or just the last frame
Userspace should be able to decide when to request a intra frame or 
reencode the current inter frame to intra frame.
> But this does introduce latency since the way it works in V4L2 today, we need
> the picture to be filled by the time we request an encode.
> 
> Though, if we drop the GOP structure and favour this approach, the latency could
> be regain later by introducing fence base streaming. The technique would be for
> a video source (like a capture driver) to pass dmabuf that aren't filled yet,
> but have a companion fence. This would allow queuing requests ahead of time, and
> all we need is enough pre-allocation to accommodate the desired look ahead. Only
> issue is that perhaps this violates the fundamental of "short term" delivery of
> fences. But fences can also fail I think, in case the capture was stopped.
> 
I don't think it would help. Fence is a thing for DRM/GPU without a queue.
Even with a fence, would the video sink tell us the motion delta here?
> We can certainly move forward with this as a future solution, or just don't
> implement future aware RC algorithm in term to avoid the huge task this involves
> (and possibly patents?)
> 
I think we should not restrict how the userspace(vendor) operate the 
hardware.
> [...]
>>>
> 
>>> Of course, the subject is much more relevant when there is encoders with more
>>> then 1 reference. But you are correct, what the commands do, is allow to change,
>>> add or remove any reference from the list (random modification), as long as they
>>> fit in the codec contraints (like the DPB size notably). This is the only way
>>> one can implement temporal SVC reference pattern, robust reference trees or RTP
>>> RPSI. Note that long term reference also exists, and are less complex then these
>>> commands.
>>>
>>
>> If we the userspace could manage the lifetime of reconstruction
>> buffers(assignment, reference), we don't need a command here.
> 
> Sorry if I created confusion, the comments was something specific to H.264
> coding. Its a compressed form for the reference lists. This information is coded
> in the slice header and enabled through adaptive_ref_pic_marking_mode_flag
> 
> It was suggested so far to leave h264 slice headers writing to the driver. This
> is motivated by H264 slice header not being byte aligned in size, so the
H.264, H.265 has the byte_alignment() in nalu. You don't need skip bits 
feature which could be found in H1.

> slice_data() is hard to combine. Also, some hardware actually produce the
> slice_header. This needs actual hardware interface analyses, cause an H.264
> slice header is worth nothing if it cannot instruct the decoder how to maintain
> the desired reference state.
> 
I don't even think we should write the slice header into the CAPTURE 
buffer, which would cause a cache problem. Ususally the slice header 
would be written only when that slice data is copy out.
That is much more easily that userspace wrapper handle this.

> I think this aspect should probably not be generalized to all CODECs, since the
> packing semantic can largely differ. When the codec header is indeed byte
> aligned, it can easily be seperate and combined by application, improve the
> application flexibility, reducing the kernel API complexity.
>>
>> It is just a problem of how to design another request API control
>> structure to select which buffers would be used for list0, list1.
>>> I this raises a big question, and I never checked how this worked with let's say
>>> VA. Shall we let the driver resolve the changes into commands (VP8 have
>>> something similar, while VP9 and AV1 are refresh flags, which are just trivial
>>> to compute). I believe I'll have to investigate this further.
>>>
>>>>>
>>> [...]
> 
> regards,
> Nicolas

-- 
Hsia-Jun(Randy) Li