linux-media.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Stateless Encoding uAPI Discussion and Proposal
@ 2023-07-11 17:12 Paul Kocialkowski
  2023-07-11 18:18 ` Nicolas Dufresne
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Paul Kocialkowski @ 2023-07-11 17:12 UTC (permalink / raw)
  To: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Nicolas Dufresne, Andrzej Pietrasiewicz, Michael Tretter
  Cc: Jernej Škrabec, Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 8644 bytes --]

Hi everyone!

After various discussions following Andrzej's talk at EOSS, feedback from the
Media Summit (which I could not attend unfortunately) and various direct
discussions, I have compiled some thoughts and ideas about stateless encoders
support with various proposals. This is the result of a few years of interest
in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
which turned out to have numerous design issues.

I am now working on a H.264 encoder driver for Allwinner platforms (currently
focusing on the V3/V3s), which already provides some usable bitstream and will
be published soon.

This is a very long email where I've tried to split things into distinct topics
and explain a few concepts to make sure everyone is on the same page.

# Bitstream Headers

Stateless encoders typically do not generate all the bitstream headers and
sometimes no header at all (e.g. Allwinner encoder does not even produce slice
headers). There's often some hardware block that makes bit-level writing to the
destination buffer easier (deals with alignment, etc).

The values of the bitstream headers must be in line with how the compressed
data bitstream is generated and generally follow the codec specification.
Some encoders might allow configuring all the fields found in the headers,
others may only allow configuring a few or have specific constraints regarding
which values are allowed.

As a result, we cannot expect that any given encoder is able to produce frames
for any set of headers. Reporting related constraints and limitations (beyond
profile/level) seems quite difficult and error-prone.

So it seems that keeping header generation in-kernel only (close to where the
hardware is actually configured) is the safest approach.

# Codec Features

Codecs have many variable features that can be enabled or not and specific
configuration fields that can take various values. There is usually some
top-level indication of profile/level that restricts what can be used.

This is a very similar situation to stateful encoding, where codec-specific
controls are used to report and set profile/level and configure these aspects.
A particularly nice thing about it is that we can reuse these existing controls
and add new ones in the future for features that are not yet covered.

This approach feels more flexible than designing new structures with a selected
set of parameters (that could match the existing controls) for each codec.

# Reference and Reconstruction Management

With stateless encoding, we need to tell the hardware which frames need to be
used as references for encoding the current frame and make sure we have the
these references available as decoded frames in memory.

Regardless of references, stateless encoders typically need some memory space to
write the decoded (known as reconstructed) frame while it's being encoded.

One question here is how many slots for decoded pictures should be allocated
by the driver when starting to stream. There is usually a maximum number of
reference frames that can be used at a time, although perhaps there is a use
case to keeping more around and alternative between them for future references.

Another question is how the driver should keep track of which frame will be used
as a reference in the future and which one can be evicted from the pool of
decoded pictures if it's not going to be used anymore.

A restrictive approach would be to let the driver alone manage that, similarly
to how stateful encoders behave. However it might provide extra flexibility
(and memory gain) to allow userspace to configure the maximum number of possible
reference frames. In that case it becomes necessary to indicate if a given
frame will be used as a reference in the future (maybe using a buffer flag)
and to indicate which previous reference frames (probably to be identified with
the matching output buffer's timestamp) should be used for the current encode.
This could be done with a new dedicated control (as a variable-sized array of
timestamps). Note that userspace would have to update it for every frame or the
reference frames will remain the same for future encodes.

The driver will then make sure to keep the reconstructed buffer around, in one
of the slots. When there's no slot left, the driver will drop the oldest
reference it has (maybe with a bounce buffer to still allow it to be used as a
reference for the current encode).

With this behavior defined in the uAPI spec, userspace will also be able to
keep track of which previous frame is no longer allowed as a reference.

# Frame Types

Stateless encoder drivers will typically instruct the hardware to encode either
an intra-coded or an inter-coded frame. While a stream composed only of a single
intra-coded frame followed by only inter-coded frames is possible, it's
generally not desirable as it is not very robust against data loss and makes
seeking difficult.

As a result, the frame type is usually decided based on a given GOP size
(the frequency at which a new intra-coded frame is produced) while intra-coded
frames can be explicitly requested upon request. Stateful encoders implement
these through dedicated controls:
- V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
- V4L2_CID_MPEG_VIDEO_GOP_SIZE
- V4L2_CID_MPEG_VIDEO_H264_I_PERIOD

It seems that reusing them would be possible, which would let the driver decide
of the particular frame type.

However it makes the reference frame management a bit trickier since reference
frames might be requested from userspace for a frame that ends up being
intra-coded. We can either allow this and silently ignore the info or expect
that userspace keeps track of the GOP index and not send references on the first
frame.

In some codecs, there's also a notion of barrier key-frames (IDR frames in
H.264) that strictly forbid using any past reference beyond the frame.
There seems to be an assumption that the GOP start uses this kind of frame
(and not any intra-coded frame), while the force key frame control does not
particularly specify it.

In that case we should flush the list of references and userspace should no
longer provide references to them for future frames. This puts a requirement on
userspace to keep track of GOP start in order to know when to flush its
reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
could also indicate a general intra-coded frame that is not a barrier.

So another possibility would be for userspace to explicitly indicate which
frame type to use (in a codec-specific way) and act accordingly, leaving any
notion of GOP up to userspace. I feel like this might be the easiest approach
while giving an extra degree of control to userspace.

# Rate Control

Another important feature of encoders is the ability to control the amount of
data produced following different rate control strategies. Stateful encoders
typically do this in-firmware and expose controls for selecting the strategy
and associated targets.

It seems desirable to support both automatic and manual rate-control to
userspace.

Automatic control would be implemented kernel-side (with algos possibly shared
across drivers) and reuse existing stateful controls. The advantage is
simplicity (userspace does not need to carry its own rate-control
implementation) and to ensure that there is a built-in mechanism for common
strategies available for every driver (no mandatory dependency on a proprietary
userspace stack). There may also be extra statistics or controls available to
the driver that allow finer-grain control.

Manual control allows userspace to get creative and requires the ability to set
the quantization parameter (QP) directly for each frame (controls are already
as many stateful encoders also support it).

# Regions of Interest

Regions of interest (ROIs) allow specifying sub-regions of the frame that should
be prioritized for quality. Stateless encoders typically support a limited
number and allow setting specific QP values for these regions.

While the QP value should be used directly in manual rate-control, we probably
want to have some "level of importance" setting for kernel-side rate-control,
along with the dimensions/position of each ROI. This could be expressed with
a new structure containing all these elements and presented as a variable-sized
array control with as many elements as the hardware can support.

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-11 17:12 Stateless Encoding uAPI Discussion and Proposal Paul Kocialkowski
@ 2023-07-11 18:18 ` Nicolas Dufresne
  2023-07-12 14:07   ` Paul Kocialkowski
  2023-07-26  8:18   ` Hans Verkuil
  2023-07-21 18:19 ` Michael Grzeschik
  2023-08-10 13:44 ` Paul Kocialkowski
  2 siblings, 2 replies; 29+ messages in thread
From: Nicolas Dufresne @ 2023-07-11 18:18 UTC (permalink / raw)
  To: Paul Kocialkowski, linux-kernel, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter
  Cc: Jernej Škrabec, Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
> Hi everyone!
> 
> After various discussions following Andrzej's talk at EOSS, feedback from the
> Media Summit (which I could not attend unfortunately) and various direct
> discussions, I have compiled some thoughts and ideas about stateless encoders
> support with various proposals. This is the result of a few years of interest
> in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
> which turned out to have numerous design issues.
> 
> I am now working on a H.264 encoder driver for Allwinner platforms (currently
> focusing on the V3/V3s), which already provides some usable bitstream and will
> be published soon.
> 
> This is a very long email where I've tried to split things into distinct topics
> and explain a few concepts to make sure everyone is on the same page.
> 
> # Bitstream Headers
> 
> Stateless encoders typically do not generate all the bitstream headers and
> sometimes no header at all (e.g. Allwinner encoder does not even produce slice
> headers). There's often some hardware block that makes bit-level writing to the
> destination buffer easier (deals with alignment, etc).
> 
> The values of the bitstream headers must be in line with how the compressed
> data bitstream is generated and generally follow the codec specification.
> Some encoders might allow configuring all the fields found in the headers,
> others may only allow configuring a few or have specific constraints regarding
> which values are allowed.
> 
> As a result, we cannot expect that any given encoder is able to produce frames
> for any set of headers. Reporting related constraints and limitations (beyond
> profile/level) seems quite difficult and error-prone.
> 
> So it seems that keeping header generation in-kernel only (close to where the
> hardware is actually configured) is the safest approach.

This seems to match with what happened with the Hantro VP8 proof of concept. The
encoder does not produce the frame header, but also, it produces 2 encoded
buffers which cannot be made contiguous at the hardware level. This notion of
plane in coded data wasn't something that blended well with the rest of the API
and we didn't want to copy in the kernel while the userspace would also be
forced to copy to align the headers. Our conclusion was that it was best to
generate the headers and copy both segment before delivering to userspace. I
suspect this type of situation will be quite common.

> 
> # Codec Features
> 
> Codecs have many variable features that can be enabled or not and specific
> configuration fields that can take various values. There is usually some
> top-level indication of profile/level that restricts what can be used.
> 
> This is a very similar situation to stateful encoding, where codec-specific
> controls are used to report and set profile/level and configure these aspects.
> A particularly nice thing about it is that we can reuse these existing controls
> and add new ones in the future for features that are not yet covered.
> 
> This approach feels more flexible than designing new structures with a selected
> set of parameters (that could match the existing controls) for each codec.

Though, reading more into this emails, we still have a fair amount of controls
to design and add, probably some compound controls too ?

> 
> # Reference and Reconstruction Management
> 
> With stateless encoding, we need to tell the hardware which frames need to be
> used as references for encoding the current frame and make sure we have the
> these references available as decoded frames in memory.
> 
> Regardless of references, stateless encoders typically need some memory space to
> write the decoded (known as reconstructed) frame while it's being encoded.
> 
> One question here is how many slots for decoded pictures should be allocated
> by the driver when starting to stream. There is usually a maximum number of
> reference frames that can be used at a time, although perhaps there is a use
> case to keeping more around and alternative between them for future references.
> 
> Another question is how the driver should keep track of which frame will be used
> as a reference in the future and which one can be evicted from the pool of
> decoded pictures if it's not going to be used anymore.
> 
> A restrictive approach would be to let the driver alone manage that, similarly
> to how stateful encoders behave. However it might provide extra flexibility
> (and memory gain) to allow userspace to configure the maximum number of possible
> reference frames. In that case it becomes necessary to indicate if a given
> frame will be used as a reference in the future (maybe using a buffer flag)
> and to indicate which previous reference frames (probably to be identified with
> the matching output buffer's timestamp) should be used for the current encode.
> This could be done with a new dedicated control (as a variable-sized array of
> timestamps). Note that userspace would have to update it for every frame or the
> reference frames will remain the same for future encodes.
> 
> The driver will then make sure to keep the reconstructed buffer around, in one
> of the slots. When there's no slot left, the driver will drop the oldest
> reference it has (maybe with a bounce buffer to still allow it to be used as a
> reference for the current encode).
> 
> With this behavior defined in the uAPI spec, userspace will also be able to
> keep track of which previous frame is no longer allowed as a reference.

If we want, we could mirror the stateless decoders here. During the decoding, we
pass a "dpb" or a reference list, which represent all the active references.
These do not have to be used by the current frame, but the driver is allowed to
use this list to cleanup and free unused memory (or reuse in case it has a fixed
slot model, like mtk vcodec).

On top of this, we add a list of references to be used for producing the current
frame. Usually, the picture references are indices into the dpb/reference list
of timestamp. This makes validation easier.  We'll have to define how many
reference can be used I think since unlike decoders, encoders don't have to
fully implement levels and profiles.

> 
> # Frame Types
> 
> Stateless encoder drivers will typically instruct the hardware to encode either
> an intra-coded or an inter-coded frame. While a stream composed only of a single
> intra-coded frame followed by only inter-coded frames is possible, it's
> generally not desirable as it is not very robust against data loss and makes
> seeking difficult.

Let's avoid this generalization in our document and design. In RTP streaming,
like WebRTP or SIP, it is desirable to use open GOP (with nothing else then P
frames all the time, except the very first one). The FORCE_KEY_FRAME is meant to
allow handling RTP PLI (and other similar feedback). Its quite rare an
application would mix close GOP and FORCE_KEY_FRAME, but its allowed though.
What I've seen the most, is the FORCE_KEY_FRAME would just start a new GOP,
following size and period from this new point.

> 
> As a result, the frame type is usually decided based on a given GOP size
> (the frequency at which a new intra-coded frame is produced) while intra-coded
> frames can be explicitly requested upon request. Stateful encoders implement
> these through dedicated controls:
> - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
> - V4L2_CID_MPEG_VIDEO_GOP_SIZE
> - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
> 
> It seems that reusing them would be possible, which would let the driver decide
> of the particular frame type.
> 
> However it makes the reference frame management a bit trickier since reference
> frames might be requested from userspace for a frame that ends up being
> intra-coded. We can either allow this and silently ignore the info or expect
> that userspace keeps track of the GOP index and not send references on the first
> frame.
> 
> In some codecs, there's also a notion of barrier key-frames (IDR frames in
> H.264) that strictly forbid using any past reference beyond the frame.
> There seems to be an assumption that the GOP start uses this kind of frame
> (and not any intra-coded frame), while the force key frame control does not
> particularly specify it.
> 
> In that case we should flush the list of references and userspace should no
> longer provide references to them for future frames. This puts a requirement on
> userspace to keep track of GOP start in order to know when to flush its
> reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
> could also indicate a general intra-coded frame that is not a barrier.
> 
> So another possibility would be for userspace to explicitly indicate which
> frame type to use (in a codec-specific way) and act accordingly, leaving any
> notion of GOP up to userspace. I feel like this might be the easiest approach
> while giving an extra degree of control to userspace.

I also lean toward this approach ...

> 
> # Rate Control
> 
> Another important feature of encoders is the ability to control the amount of
> data produced following different rate control strategies. Stateful encoders
> typically do this in-firmware and expose controls for selecting the strategy
> and associated targets.
> 
> It seems desirable to support both automatic and manual rate-control to
> userspace.
> 
> Automatic control would be implemented kernel-side (with algos possibly shared
> across drivers) and reuse existing stateful controls. The advantage is
> simplicity (userspace does not need to carry its own rate-control
> implementation) and to ensure that there is a built-in mechanism for common
> strategies available for every driver (no mandatory dependency on a proprietary
> userspace stack). There may also be extra statistics or controls available to
> the driver that allow finer-grain control.

Though not controlling the GOP (or no gop) might require a bit more work on
driver side. Today, we do have queues of request, queues of buffer etc. But it
is still quite difficult to do lookahead these queues. That is only useful if
rate control algorithm can use future frame type (like keyframe) to make
decisions. That could be me pushing to far here though.

> 
> Manual control allows userspace to get creative and requires the ability to set
> the quantization parameter (QP) directly for each frame (controls are already
> as many stateful encoders also support it).
> 
> # Regions of Interest
> 
> Regions of interest (ROIs) allow specifying sub-regions of the frame that should
> be prioritized for quality. Stateless encoders typically support a limited
> number and allow setting specific QP values for these regions.
> 
> While the QP value should be used directly in manual rate-control, we probably
> want to have some "level of importance" setting for kernel-side rate-control,
> along with the dimensions/position of each ROI. This could be expressed with
> a new structure containing all these elements and presented as a variable-sized
> array control with as many elements as the hardware can support.

Do you see any difference in ROI for stateful and stateless ? This looks like a
feature we could combined. Also, ROI exist for cameras too, I'd probably try and
keep them separate though.

This is a very good overview of the hard work ahead of us. Looking forward on
this journey and your Allwinner driver.

regards,
Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-11 18:18 ` Nicolas Dufresne
@ 2023-07-12 14:07   ` Paul Kocialkowski
  2023-07-25  3:33     ` Hsia-Jun Li
  2023-07-26  8:18   ` Hans Verkuil
  1 sibling, 1 reply; 29+ messages in thread
From: Paul Kocialkowski @ 2023-07-12 14:07 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 15109 bytes --]

Hi Nicolas,

Thanks for the quick reply!

On Tue 11 Jul 23, 14:18, Nicolas Dufresne wrote:
> Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
> > Hi everyone!
> > 
> > After various discussions following Andrzej's talk at EOSS, feedback from the
> > Media Summit (which I could not attend unfortunately) and various direct
> > discussions, I have compiled some thoughts and ideas about stateless encoders
> > support with various proposals. This is the result of a few years of interest
> > in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
> > which turned out to have numerous design issues.
> > 
> > I am now working on a H.264 encoder driver for Allwinner platforms (currently
> > focusing on the V3/V3s), which already provides some usable bitstream and will
> > be published soon.
> > 
> > This is a very long email where I've tried to split things into distinct topics
> > and explain a few concepts to make sure everyone is on the same page.
> > 
> > # Bitstream Headers
> > 
> > Stateless encoders typically do not generate all the bitstream headers and
> > sometimes no header at all (e.g. Allwinner encoder does not even produce slice
> > headers). There's often some hardware block that makes bit-level writing to the
> > destination buffer easier (deals with alignment, etc).
> > 
> > The values of the bitstream headers must be in line with how the compressed
> > data bitstream is generated and generally follow the codec specification.
> > Some encoders might allow configuring all the fields found in the headers,
> > others may only allow configuring a few or have specific constraints regarding
> > which values are allowed.
> > 
> > As a result, we cannot expect that any given encoder is able to produce frames
> > for any set of headers. Reporting related constraints and limitations (beyond
> > profile/level) seems quite difficult and error-prone.
> > 
> > So it seems that keeping header generation in-kernel only (close to where the
> > hardware is actually configured) is the safest approach.
> 
> This seems to match with what happened with the Hantro VP8 proof of concept. The
> encoder does not produce the frame header, but also, it produces 2 encoded
> buffers which cannot be made contiguous at the hardware level. This notion of
> plane in coded data wasn't something that blended well with the rest of the API
> and we didn't want to copy in the kernel while the userspace would also be
> forced to copy to align the headers. Our conclusion was that it was best to
> generate the headers and copy both segment before delivering to userspace. I
> suspect this type of situation will be quite common.

Makes sense! I guess the same will need to be done for Hantro H1 H.264 encoding
(in my PoC the software-generated headers were crafted in userspace and didn't
have to be part of the same buffer as the coded data).

> > 
> > # Codec Features
> > 
> > Codecs have many variable features that can be enabled or not and specific
> > configuration fields that can take various values. There is usually some
> > top-level indication of profile/level that restricts what can be used.
> > 
> > This is a very similar situation to stateful encoding, where codec-specific
> > controls are used to report and set profile/level and configure these aspects.
> > A particularly nice thing about it is that we can reuse these existing controls
> > and add new ones in the future for features that are not yet covered.
> > 
> > This approach feels more flexible than designing new structures with a selected
> > set of parameters (that could match the existing controls) for each codec.
> 
> Though, reading more into this emails, we still have a fair amount of controls
> to design and add, probably some compound controls too ?

Yeah definitely. My point here is merely that we should reuse existing control
for general codec features, but I don't think we'll get around introducing new
ones for stateless-specific parts.

> > 
> > # Reference and Reconstruction Management
> > 
> > With stateless encoding, we need to tell the hardware which frames need to be
> > used as references for encoding the current frame and make sure we have the
> > these references available as decoded frames in memory.
> > 
> > Regardless of references, stateless encoders typically need some memory space to
> > write the decoded (known as reconstructed) frame while it's being encoded.
> > 
> > One question here is how many slots for decoded pictures should be allocated
> > by the driver when starting to stream. There is usually a maximum number of
> > reference frames that can be used at a time, although perhaps there is a use
> > case to keeping more around and alternative between them for future references.
> > 
> > Another question is how the driver should keep track of which frame will be used
> > as a reference in the future and which one can be evicted from the pool of
> > decoded pictures if it's not going to be used anymore.
> > 
> > A restrictive approach would be to let the driver alone manage that, similarly
> > to how stateful encoders behave. However it might provide extra flexibility
> > (and memory gain) to allow userspace to configure the maximum number of possible
> > reference frames. In that case it becomes necessary to indicate if a given
> > frame will be used as a reference in the future (maybe using a buffer flag)
> > and to indicate which previous reference frames (probably to be identified with
> > the matching output buffer's timestamp) should be used for the current encode.
> > This could be done with a new dedicated control (as a variable-sized array of
> > timestamps). Note that userspace would have to update it for every frame or the
> > reference frames will remain the same for future encodes.
> > 
> > The driver will then make sure to keep the reconstructed buffer around, in one
> > of the slots. When there's no slot left, the driver will drop the oldest
> > reference it has (maybe with a bounce buffer to still allow it to be used as a
> > reference for the current encode).
> > 
> > With this behavior defined in the uAPI spec, userspace will also be able to
> > keep track of which previous frame is no longer allowed as a reference.
> 
> If we want, we could mirror the stateless decoders here. During the decoding, we
> pass a "dpb" or a reference list, which represent all the active references.
> These do not have to be used by the current frame, but the driver is allowed to
> use this list to cleanup and free unused memory (or reuse in case it has a fixed
> slot model, like mtk vcodec).
> 
> On top of this, we add a list of references to be used for producing the current
> frame. Usually, the picture references are indices into the dpb/reference list
> of timestamp. This makes validation easier.  We'll have to define how many
> reference can be used I think since unlike decoders, encoders don't have to
> fully implement levels and profiles.

So that would be a very explicit description instead of expecting drivers to
do the maintainance and userspace to figure out which frame was evicted from
the list. So yeah this feels more robust!

Regarding the number of reference frames, I think we need to specify both
how many references can be used at a time (number of hardware slots) and how
many total references can be in the reference list (number of rec buffers to
keep around).

We could also decide that making the current frame part of the global reference
list is a way to indicate that its reconstruction buffer must be kept around,
or we could have a separate way to indicate that. I lean towards the former
since it would put all reference-related things in one place and avoid coming
up with a new buffer flag or such.

Also we would probably still need to do some validation driver-side to make
sure that userspace doesn't put references in the list that were not marked
as such when encoded (and for which the reconstruction buffer may have been
recycled already).

> > 
> > # Frame Types
> > 
> > Stateless encoder drivers will typically instruct the hardware to encode either
> > an intra-coded or an inter-coded frame. While a stream composed only of a single
> > intra-coded frame followed by only inter-coded frames is possible, it's
> > generally not desirable as it is not very robust against data loss and makes
> > seeking difficult.
> 
> Let's avoid this generalization in our document and design. In RTP streaming,
> like WebRTP or SIP, it is desirable to use open GOP (with nothing else then P
> frames all the time, except the very first one). The FORCE_KEY_FRAME is meant to
> allow handling RTP PLI (and other similar feedback). Its quite rare an
> application would mix close GOP and FORCE_KEY_FRAME, but its allowed though.
> What I've seen the most, is the FORCE_KEY_FRAME would just start a new GOP,
> following size and period from this new point.

Okay fair enough, thanks for the details!

> > 
> > As a result, the frame type is usually decided based on a given GOP size
> > (the frequency at which a new intra-coded frame is produced) while intra-coded
> > frames can be explicitly requested upon request. Stateful encoders implement
> > these through dedicated controls:
> > - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
> > - V4L2_CID_MPEG_VIDEO_GOP_SIZE
> > - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
> > 
> > It seems that reusing them would be possible, which would let the driver decide
> > of the particular frame type.
> > 
> > However it makes the reference frame management a bit trickier since reference
> > frames might be requested from userspace for a frame that ends up being
> > intra-coded. We can either allow this and silently ignore the info or expect
> > that userspace keeps track of the GOP index and not send references on the first
> > frame.
> > 
> > In some codecs, there's also a notion of barrier key-frames (IDR frames in
> > H.264) that strictly forbid using any past reference beyond the frame.
> > There seems to be an assumption that the GOP start uses this kind of frame
> > (and not any intra-coded frame), while the force key frame control does not
> > particularly specify it.
> > 
> > In that case we should flush the list of references and userspace should no
> > longer provide references to them for future frames. This puts a requirement on
> > userspace to keep track of GOP start in order to know when to flush its
> > reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
> > could also indicate a general intra-coded frame that is not a barrier.
> > 
> > So another possibility would be for userspace to explicitly indicate which
> > frame type to use (in a codec-specific way) and act accordingly, leaving any
> > notion of GOP up to userspace. I feel like this might be the easiest approach
> > while giving an extra degree of control to userspace.
> 
> I also lean toward this approach ...
> 
> > 
> > # Rate Control
> > 
> > Another important feature of encoders is the ability to control the amount of
> > data produced following different rate control strategies. Stateful encoders
> > typically do this in-firmware and expose controls for selecting the strategy
> > and associated targets.
> > 
> > It seems desirable to support both automatic and manual rate-control to
> > userspace.
> > 
> > Automatic control would be implemented kernel-side (with algos possibly shared
> > across drivers) and reuse existing stateful controls. The advantage is
> > simplicity (userspace does not need to carry its own rate-control
> > implementation) and to ensure that there is a built-in mechanism for common
> > strategies available for every driver (no mandatory dependency on a proprietary
> > userspace stack). There may also be extra statistics or controls available to
> > the driver that allow finer-grain control.
> 
> Though not controlling the GOP (or no gop) might require a bit more work on
> driver side. Today, we do have queues of request, queues of buffer etc. But it
> is still quite difficult to do lookahead these queues. That is only useful if
> rate control algorithm can use future frame type (like keyframe) to make
> decisions. That could be me pushing to far here though.

Yes I agree the interaction between userspace GOP control and kernel-side
rate-contrly might be quite tricky without any indication of what the next frame
types will be.

Maybe we could only allow explicit frame type configuration when using manual
rate-control and have kernel-side GOP management when in-kernel rc is used
(and we can allow it with manual rate-control too). I like having this option
because it allows for simple userspace implementations.

Note that this could perhaps also be added as an optional feature
for stateful encoders since some of them seem to be able to instruct the
firmware what frame type to use (in addition to directly controlling QP).
There's also a good chance that this feature is not available when using
a firmware-backed rc algorithm.

> > 
> > Manual control allows userspace to get creative and requires the ability to set
> > the quantization parameter (QP) directly for each frame (controls are already
> > as many stateful encoders also support it).
> > 
> > # Regions of Interest
> > 
> > Regions of interest (ROIs) allow specifying sub-regions of the frame that should
> > be prioritized for quality. Stateless encoders typically support a limited
> > number and allow setting specific QP values for these regions.
> > 
> > While the QP value should be used directly in manual rate-control, we probably
> > want to have some "level of importance" setting for kernel-side rate-control,
> > along with the dimensions/position of each ROI. This could be expressed with
> > a new structure containing all these elements and presented as a variable-sized
> > array control with as many elements as the hardware can support.
> 
> Do you see any difference in ROI for stateful and stateless ? This looks like a
> feature we could combined. Also, ROI exist for cameras too, I'd probably try and
> keep them separate though.

I feel like the stateful/stateless behavior should be the same, so that could be
a shared control too. Also we could use a QP delta which would apply to both
manual and in-kernel rate-control, but maybe that's too low-level in the latter
case (not very obvious when a relevant delta could be when userspace has no idea
of the current frame-wide QP value).

> This is a very good overview of the hard work ahead of us. Looking forward on
> this journey and your Allwinner driver.

Thanks a lot for your input!

Honestly I was expecting that it would be more difficult than decoding, but it
turns out it might not be the case.

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-11 17:12 Stateless Encoding uAPI Discussion and Proposal Paul Kocialkowski
  2023-07-11 18:18 ` Nicolas Dufresne
@ 2023-07-21 18:19 ` Michael Grzeschik
  2023-07-24 14:03   ` Nicolas Dufresne
  2023-08-10 13:44 ` Paul Kocialkowski
  2 siblings, 1 reply; 29+ messages in thread
From: Michael Grzeschik @ 2023-07-21 18:19 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Nicolas Dufresne, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 4293 bytes --]

Hi everyone!

Just to let you know. I have just pushed a Branch that includes some first
steps to make the h264-stateless encoder working in Gstreamer. The work is
based on the VP8 Stateless Encoder patches Benjamin Gaignard created.

https://gitlab.freedesktop.org/mgrzeschik/gstreamer/-/commits/1.22/topic/h264-stateless-encoder

The codec this is used with, is the rkvenc that can be found on rockchip
rk3568. I will send an RFC driver for that in the next weeks after my vacation.

On Tue, Jul 11, 2023 at 07:12:41PM +0200, Paul Kocialkowski wrote:
>After various discussions following Andrzej's talk at EOSS, feedback from the
>Media Summit (which I could not attend unfortunately) and various direct
>discussions, I have compiled some thoughts and ideas about stateless encoders
>support with various proposals. This is the result of a few years of interest
>in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
>which turned out to have numerous design issues.
>
>I am now working on a H.264 encoder driver for Allwinner platforms (currently
>focusing on the V3/V3s), which already provides some usable bitstream and will
>be published soon.
>
>This is a very long email where I've tried to split things into distinct topics
>and explain a few concepts to make sure everyone is on the same page.
>
># Bitstream Headers
>
>Stateless encoders typically do not generate all the bitstream headers and
>sometimes no header at all (e.g. Allwinner encoder does not even produce slice
>headers). There's often some hardware block that makes bit-level writing to the
>destination buffer easier (deals with alignment, etc).
>
>The values of the bitstream headers must be in line with how the compressed
>data bitstream is generated and generally follow the codec specification.
>Some encoders might allow configuring all the fields found in the headers,
>others may only allow configuring a few or have specific constraints regarding
>which values are allowed.
>
>As a result, we cannot expect that any given encoder is able to produce frames
>for any set of headers. Reporting related constraints and limitations (beyond
>profile/level) seems quite difficult and error-prone.
>
>So it seems that keeping header generation in-kernel only (close to where the
>hardware is actually configured) is the safest approach.

For the case with the rkvenc, the headers are also not created by the
kernel driver. Instead we use the gst_h264_bit_writer_sps/pps functions
that are part of the codecparsers module.

># Codec Features
>
>Codecs have many variable features that can be enabled or not and specific
>configuration fields that can take various values. There is usually some
>top-level indication of profile/level that restricts what can be used.
>
>This is a very similar situation to stateful encoding, where codec-specific
>controls are used to report and set profile/level and configure these aspects.
>A particularly nice thing about it is that we can reuse these existing controls
>and add new ones in the future for features that are not yet covered.
>
>This approach feels more flexible than designing new structures with a selected
>set of parameters (that could match the existing controls) for each codec.

I back the Idea of generic profiles instead of explicit configuration
from the usersapace point of view.

The parameterization works like this:

Read the sane default parameter set from the driver.
Modify the parameters based on the userspace decisions.
 - (currently hardcoded and not based on any user input)
Write the updated parameters back to the driver.

># Reference and Reconstruction Management
<snip>

># Frame Types
<snip>

># Rate Control
<snip>

># Regions of Interest
<snip>

Since the first shot of the rkvenc is I-Frame only code, these other topics are
currently undefined and unimplemented in the Gstreamer stack.


Regards,
Michael

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-21 18:19 ` Michael Grzeschik
@ 2023-07-24 14:03   ` Nicolas Dufresne
  2023-07-25  9:09     ` Paul Kocialkowski
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Dufresne @ 2023-07-24 14:03 UTC (permalink / raw)
  To: Michael Grzeschik, Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Le vendredi 21 juillet 2023 à 20:19 +0200, Michael Grzeschik a écrit :
> > As a result, we cannot expect that any given encoder is able to produce frames
> > for any set of headers. Reporting related constraints and limitations (beyond
> > profile/level) seems quite difficult and error-prone.
> > 
> > So it seems that keeping header generation in-kernel only (close to where the
> > hardware is actually configured) is the safest approach.
> 
> For the case with the rkvenc, the headers are also not created by the
> kernel driver. Instead we use the gst_h264_bit_writer_sps/pps functions
> that are part of the codecparsers module.

One level of granularity we can add is split headers (like SPS/PPS) and
slice/frame headers. It remains that in some cases, like HEVC, when the slice
header is byte aligned, it can be nice to be able to handle it at application
side in order to avoid limiting SVC support (and other creative features) by our
API/abstraction limitations. I think a certain level of "per CODEC" reasoning is
also needed. Just like, I would not want to have to ask the kernel to generate
user data SEI and other in-band data.

regards,
Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-12 14:07   ` Paul Kocialkowski
@ 2023-07-25  3:33     ` Hsia-Jun Li
  2023-07-25 12:15       ` Paul Kocialkowski
  0 siblings, 1 reply; 29+ messages in thread
From: Hsia-Jun Li @ 2023-07-25  3:33 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: linux-kernel, Nicolas Dufresne, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni



On 7/12/23 22:07, Paul Kocialkowski wrote:
> Hi Nicolas,
> 
> Thanks for the quick reply!
> 
> On Tue 11 Jul 23, 14:18, Nicolas Dufresne wrote:
>> Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
>>> Hi everyone!
>>>
>>> After various discussions following Andrzej's talk at EOSS, feedback from the
>>> Media Summit (which I could not attend unfortunately) and various direct
>>> discussions, I have compiled some thoughts and ideas about stateless encoders
>>> support with various proposals. This is the result of a few years of interest
>>> in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
>>> which turned out to have numerous design issues.
>>>
>>> I am now working on a H.264 encoder driver for Allwinner platforms (currently
>>> focusing on the V3/V3s), which already provides some usable bitstream and will
>>> be published soon.
>>>
>>> This is a very long email where I've tried to split things into distinct topics
>>> and explain a few concepts to make sure everyone is on the same page.
>>>
>>> # Bitstream Headers
>>>
>>> Stateless encoders typically do not generate all the bitstream headers and
>>> sometimes no header at all (e.g. Allwinner encoder does not even produce slice
>>> headers). There's often some hardware block that makes bit-level writing to the
>>> destination buffer easier (deals with alignment, etc).
>>>
>>> The values of the bitstream headers must be in line with how the compressed
>>> data bitstream is generated and generally follow the codec specification.
>>> Some encoders might allow configuring all the fields found in the headers,
>>> others may only allow configuring a few or have specific constraints regarding
>>> which values are allowed.
>>>
>>> As a result, we cannot expect that any given encoder is able to produce frames
>>> for any set of headers. Reporting related constraints and limitations (beyond
>>> profile/level) seems quite difficult and error-prone.
>>>
>>> So it seems that keeping header generation in-kernel only (close to where the
>>> hardware is actually configured) is the safest approach.
>> This seems to match with what happened with the Hantro VP8 proof of concept. The
>> encoder does not produce the frame header, but also, it produces 2 encoded
>> buffers which cannot be made contiguous at the hardware level. This notion of
>> plane in coded data wasn't something that blended well with the rest of the API
>> and we didn't want to copy in the kernel while the userspace would also be
>> forced to copy to align the headers. Our conclusion was that it was best to
>> generate the headers and copy both segment before delivering to userspace. I
>> suspect this type of situation will be quite common.
> Makes sense! I guess the same will need to be done for Hantro H1 H.264 encoding
> (in my PoC the software-generated headers were crafted in userspace and didn't
> have to be part of the same buffer as the coded data).
We just need a method to indicate where the hardware could write its 
slice data or compressed frame.
While we would decided which frame that the current frame should refer, 
the (some) hardware may discard our decision, which reference picture 
set would use less bits. Unless the codec supports a fill up method, it 
could lead to a gap between header and frame data.
> 
>>> # Codec Features
>>>
>>> Codecs have many variable features that can be enabled or not and specific
>>> configuration fields that can take various values. There is usually some
>>> top-level indication of profile/level that restricts what can be used.
>>>
>>> This is a very similar situation to stateful encoding, where codec-specific
>>> controls are used to report and set profile/level and configure these aspects.
>>> A particularly nice thing about it is that we can reuse these existing controls
>>> and add new ones in the future for features that are not yet covered.
>>>
>>> This approach feels more flexible than designing new structures with a selected
>>> set of parameters (that could match the existing controls) for each codec.
>> Though, reading more into this emails, we still have a fair amount of controls
>> to design and add, probably some compound controls too ?
> Yeah definitely. My point here is merely that we should reuse existing control
> for general codec features, but I don't think we'll get around introducing new
> ones for stateless-specific parts.
> 
Things likes profile, level or tiers could be reused. It make no sense 
to expose those vendor special feature.
Besides, profile, level or tiers are usually stored in the sequence 
header or uncompressed header, hardware doesn't care about that.

I think we should go with the vendor registers buffer way that I always 
said. There are many encoding tools that a codec offer, variants of 
hardware may not support or use them all. The context switching between 
userspace and kernel would drive you mad for so many controls.
>>> # Reference and Reconstruction Management
>>>
>>> With stateless encoding, we need to tell the hardware which frames need to be
>>> used as references for encoding the current frame and make sure we have the
>>> these references available as decoded frames in memory.
>>>
>>> Regardless of references, stateless encoders typically need some memory space to
>>> write the decoded (known as reconstructed) frame while it's being encoded.
>>>
>>> One question here is how many slots for decoded pictures should be allocated
>>> by the driver when starting to stream. There is usually a maximum number of
>>> reference frames that can be used at a time, although perhaps there is a use
>>> case to keeping more around and alternative between them for future references.
>>>
>>> Another question is how the driver should keep track of which frame will be used
>>> as a reference in the future and which one can be evicted from the pool of
>>> decoded pictures if it's not going to be used anymore.
>>>
>>> A restrictive approach would be to let the driver alone manage that, similarly
>>> to how stateful encoders behave. However it might provide extra flexibility
>>> (and memory gain) to allow userspace to configure the maximum number of possible
>>> reference frames. In that case it becomes necessary to indicate if a given
>>> frame will be used as a reference in the future (maybe using a buffer flag)
>>> and to indicate which previous reference frames (probably to be identified with
>>> the matching output buffer's timestamp) should be used for the current encode.
>>> This could be done with a new dedicated control (as a variable-sized array of
>>> timestamps). Note that userspace would have to update it for every frame or the
>>> reference frames will remain the same for future encodes.
>>>
>>> The driver will then make sure to keep the reconstructed buffer around, in one
>>> of the slots. When there's no slot left, the driver will drop the oldest
>>> reference it has (maybe with a bounce buffer to still allow it to be used as a
>>> reference for the current encode).
>>>
>>> With this behavior defined in the uAPI spec, userspace will also be able to
>>> keep track of which previous frame is no longer allowed as a reference.
>> If we want, we could mirror the stateless decoders here. During the decoding, we
>> pass a "dpb" or a reference list, which represent all the active references.
>> These do not have to be used by the current frame, but the driver is allowed to
>> use this list to cleanup and free unused memory (or reuse in case it has a fixed
>> slot model, like mtk vcodec).
>>
>> On top of this, we add a list of references to be used for producing the current
>> frame. Usually, the picture references are indices into the dpb/reference list
>> of timestamp. This makes validation easier.  We'll have to define how many
>> reference can be used I think since unlike decoders, encoders don't have to
>> fully implement levels and profiles.
> So that would be a very explicit description instead of expecting drivers to
> do the maintainance and userspace to figure out which frame was evicted from
> the list. So yeah this feels more robust!
> 
> Regarding the number of reference frames, I think we need to specify both
> how many references can be used at a time (number of hardware slots) and how
> many total references can be in the reference list (number of rec buffers to
> keep around).
> 
> We could also decide that making the current frame part of the global reference
> list is a way to indicate that its reconstruction buffer must be kept around,
> or we could have a separate way to indicate that. I lean towards the former
> since it would put all reference-related things in one place and avoid coming
> up with a new buffer flag or such.
> 
> Also we would probably still need to do some validation driver-side to make
> sure that userspace doesn't put references in the list that were not marked
> as such when encoded (and for which the reconstruction buffer may have been
> recycled already).
> 
DPB is the only thing we need to decide any API here under the vendor 
registers buffer way. We need the driver to translate the buffer 
reference to the address that hardware could use and in its right registers.

The major problem is how to export the reconstruction buffer which was 
hidden for many years.
This could be disccused in the other thread like V4L2 ext buffer api.
>>> # Frame Types
>>>
>>> Stateless encoder drivers will typically instruct the hardware to encode either
>>> an intra-coded or an inter-coded frame. While a stream composed only of a single
>>> intra-coded frame followed by only inter-coded frames is possible, it's
>>> generally not desirable as it is not very robust against data loss and makes
>>> seeking difficult.
>> Let's avoid this generalization in our document and design. In RTP streaming,
>> like WebRTP or SIP, it is desirable to use open GOP (with nothing else then P
>> frames all the time, except the very first one). The FORCE_KEY_FRAME is meant to
>> allow handling RTP PLI (and other similar feedback). Its quite rare an
>> application would mix close GOP and FORCE_KEY_FRAME, but its allowed though.
>> What I've seen the most, is the FORCE_KEY_FRAME would just start a new GOP,
>> following size and period from this new point.
> Okay fair enough, thanks for the details!
> 
>>> As a result, the frame type is usually decided based on a given GOP size
>>> (the frequency at which a new intra-coded frame is produced) while intra-coded
>>> frames can be explicitly requested upon request. Stateful encoders implement
>>> these through dedicated controls:
>>> - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
>>> - V4L2_CID_MPEG_VIDEO_GOP_SIZE
>>> - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
>>>
>>> It seems that reusing them would be possible, which would let the driver decide
>>> of the particular frame type.
>>>
>>> However it makes the reference frame management a bit trickier since reference
>>> frames might be requested from userspace for a frame that ends up being
>>> intra-coded. We can either allow this and silently ignore the info or expect
>>> that userspace keeps track of the GOP index and not send references on the first
>>> frame.
>>>
>>> In some codecs, there's also a notion of barrier key-frames (IDR frames in
>>> H.264) that strictly forbid using any past reference beyond the frame.
>>> There seems to be an assumption that the GOP start uses this kind of frame
>>> (and not any intra-coded frame), while the force key frame control does not
>>> particularly specify it.
>>>
>>> In that case we should flush the list of references and userspace should no
>>> longer provide references to them for future frames. This puts a requirement on
>>> userspace to keep track of GOP start in order to know when to flush its
>>> reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
>>> could also indicate a general intra-coded frame that is not a barrier.
>>>
>>> So another possibility would be for userspace to explicitly indicate which
>>> frame type to use (in a codec-specific way) and act accordingly, leaving any
>>> notion of GOP up to userspace. I feel like this might be the easiest approach
>>> while giving an extra degree of control to userspace.
>> I also lean toward this approach ...
>>
>>> # Rate Control
>>>
>>> Another important feature of encoders is the ability to control the amount of
>>> data produced following different rate control strategies. Stateful encoders
>>> typically do this in-firmware and expose controls for selecting the strategy
>>> and associated targets.
>>>
>>> It seems desirable to support both automatic and manual rate-control to
>>> userspace.
>>>
>>> Automatic control would be implemented kernel-side (with algos possibly shared
>>> across drivers) and reuse existing stateful controls. The advantage is
>>> simplicity (userspace does not need to carry its own rate-control
>>> implementation) and to ensure that there is a built-in mechanism for common
>>> strategies available for every driver (no mandatory dependency on a proprietary
>>> userspace stack). There may also be extra statistics or controls available to
>>> the driver that allow finer-grain control.
>> Though not controlling the GOP (or no gop) might require a bit more work on
>> driver side. Today, we do have queues of request, queues of buffer etc. But it
>> is still quite difficult to do lookahead these queues. That is only useful if
>> rate control algorithm can use future frame type (like keyframe) to make
>> decisions. That could be me pushing to far here though.
> Yes I agree the interaction between userspace GOP control and kernel-side
> rate-contrly might be quite tricky without any indication of what the next frame
> types will be.
> 
> Maybe we could only allow explicit frame type configuration when using manual
> rate-control and have kernel-side GOP management when in-kernel rc is used
> (and we can allow it with manual rate-control too). I like having this option
> because it allows for simple userspace implementations.
> 
> Note that this could perhaps also be added as an optional feature
> for stateful encoders since some of them seem to be able to instruct the
> firmware what frame type to use (in addition to directly controlling QP).
> There's also a good chance that this feature is not available when using
> a firmware-backed rc algorithm.
> 
>>> Manual control allows userspace to get creative and requires the ability to set
>>> the quantization parameter (QP) directly for each frame (controls are already
>>> as many stateful encoders also support it).
>>>
>>> # Regions of Interest
>>>
>>> Regions of interest (ROIs) allow specifying sub-regions of the frame that should
>>> be prioritized for quality. Stateless encoders typically support a limited
>>> number and allow setting specific QP values for these regions.
>>>
>>> While the QP value should be used directly in manual rate-control, we probably
>>> want to have some "level of importance" setting for kernel-side rate-control,
>>> along with the dimensions/position of each ROI. This could be expressed with
>>> a new structure containing all these elements and presented as a variable-sized
>>> array control with as many elements as the hardware can support.
>> Do you see any difference in ROI for stateful and stateless ? This looks like a
>> feature we could combined. Also, ROI exist for cameras too, I'd probably try and
>> keep them separate though.
> I feel like the stateful/stateless behavior should be the same, so that could be
> a shared control too. Also we could use a QP delta which would apply to both
> manual and in-kernel rate-control, but maybe that's too low-level in the latter
> case (not very obvious when a relevant delta could be when userspace has no idea
> of the current frame-wide QP value).
> 
>> This is a very good overview of the hard work ahead of us. Looking forward on
>> this journey and your Allwinner driver.
> Thanks a lot for your input!
> 
> Honestly I was expecting that it would be more difficult than decoding, but it
> turns out it might not be the case.
> 
Such rate control or quailty report would be complete vendor special.
We just need a method that let driver report those encoding statistic to 
the userspace.
> Cheers,
> 
> Paul
> 
> -- Paul Kocialkowski, Bootlin Embedded Linux and kernel engineering 
> https://bootlin.com

-- 
Hsia-Jun(Randy) Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-24 14:03   ` Nicolas Dufresne
@ 2023-07-25  9:09     ` Paul Kocialkowski
  2023-07-26 20:02       ` Nicolas Dufresne
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Kocialkowski @ 2023-07-25  9:09 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Michael Grzeschik, linux-kernel, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 2633 bytes --]

Hi Nicolas,

On Mon 24 Jul 23, 10:03, Nicolas Dufresne wrote:
> Le vendredi 21 juillet 2023 à 20:19 +0200, Michael Grzeschik a écrit :
> > > As a result, we cannot expect that any given encoder is able to produce frames
> > > for any set of headers. Reporting related constraints and limitations (beyond
> > > profile/level) seems quite difficult and error-prone.
> > > 
> > > So it seems that keeping header generation in-kernel only (close to where the
> > > hardware is actually configured) is the safest approach.
> > 
> > For the case with the rkvenc, the headers are also not created by the
> > kernel driver. Instead we use the gst_h264_bit_writer_sps/pps functions
> > that are part of the codecparsers module.
> 
> One level of granularity we can add is split headers (like SPS/PPS) and
> slice/frame headers.

Do you mean asking the driver to return a buffer with only SPS/PPS and then
return another buffer with the slice/frame header?

Looks like there's already a control for it: V4L2_CID_MPEG_VIDEO_HEADER_MODE
which takes either
- V4L2_MPEG_VIDEO_HEADER_MODE_SEPARATE: looks like what you're suggesting
- V4L2_MPEG_VIDEO_HEADER_MODE_JOINED_WITH_1ST_FRAME: usual case

So that could certainly be supported to easily allow userspace to stuff extra
NALUs in-between.

> It remains that in some cases, like HEVC, when the slice
> header is byte aligned, it can be nice to be able to handle it at application
> side in order to avoid limiting SVC support (and other creative features) by our
> API/abstraction limitations.

Do you see something in the headers that we expect the kernel to generate that
would need specific changes to support features like SVC?

From what I can see there's a svc_extension_flag that's only set for specific
NALUs (prefix_nal_unit/lice_layer_extension) so these could be inserted by
userspace.

Also I'm not very knowledgeable about SVC so it's not very clear to me if it's
possible to take an encoder that doesn't support SVC and turn the resulting
stream into something SVC-ready by adding extra NAL units or if the encoder
should be a lot more involved.

Also do you know if we have stateful codecs supporting SVC?

> I think a certain level of "per CODEC" reasoning is
> also needed. Just like, I would not want to have to ask the kernel to generate
> user data SEI and other in-band data.

Yeah it looks like there is definitely a need for adding extra NALUs from
userspace without passing that data to the kernel.

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-25  3:33     ` Hsia-Jun Li
@ 2023-07-25 12:15       ` Paul Kocialkowski
  2023-07-26  2:49         ` Hsia-Jun Li
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Kocialkowski @ 2023-07-25 12:15 UTC (permalink / raw)
  To: Hsia-Jun Li
  Cc: linux-kernel, Nicolas Dufresne, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 19145 bytes --]

Hey,

Long time, good to see you are still around and interested in these topics :)

On Tue 25 Jul 23, 11:33, Hsia-Jun Li wrote:
> On 7/12/23 22:07, Paul Kocialkowski wrote:
> > Hi Nicolas,
> > 
> > Thanks for the quick reply!
> > 
> > On Tue 11 Jul 23, 14:18, Nicolas Dufresne wrote:
> > > Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
> > > > Hi everyone!
> > > > 
> > > > After various discussions following Andrzej's talk at EOSS, feedback from the
> > > > Media Summit (which I could not attend unfortunately) and various direct
> > > > discussions, I have compiled some thoughts and ideas about stateless encoders
> > > > support with various proposals. This is the result of a few years of interest
> > > > in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
> > > > which turned out to have numerous design issues.
> > > > 
> > > > I am now working on a H.264 encoder driver for Allwinner platforms (currently
> > > > focusing on the V3/V3s), which already provides some usable bitstream and will
> > > > be published soon.
> > > > 
> > > > This is a very long email where I've tried to split things into distinct topics
> > > > and explain a few concepts to make sure everyone is on the same page.
> > > > 
> > > > # Bitstream Headers
> > > > 
> > > > Stateless encoders typically do not generate all the bitstream headers and
> > > > sometimes no header at all (e.g. Allwinner encoder does not even produce slice
> > > > headers). There's often some hardware block that makes bit-level writing to the
> > > > destination buffer easier (deals with alignment, etc).
> > > > 
> > > > The values of the bitstream headers must be in line with how the compressed
> > > > data bitstream is generated and generally follow the codec specification.
> > > > Some encoders might allow configuring all the fields found in the headers,
> > > > others may only allow configuring a few or have specific constraints regarding
> > > > which values are allowed.
> > > > 
> > > > As a result, we cannot expect that any given encoder is able to produce frames
> > > > for any set of headers. Reporting related constraints and limitations (beyond
> > > > profile/level) seems quite difficult and error-prone.
> > > > 
> > > > So it seems that keeping header generation in-kernel only (close to where the
> > > > hardware is actually configured) is the safest approach.
> > > This seems to match with what happened with the Hantro VP8 proof of concept. The
> > > encoder does not produce the frame header, but also, it produces 2 encoded
> > > buffers which cannot be made contiguous at the hardware level. This notion of
> > > plane in coded data wasn't something that blended well with the rest of the API
> > > and we didn't want to copy in the kernel while the userspace would also be
> > > forced to copy to align the headers. Our conclusion was that it was best to
> > > generate the headers and copy both segment before delivering to userspace. I
> > > suspect this type of situation will be quite common.
> > Makes sense! I guess the same will need to be done for Hantro H1 H.264 encoding
> > (in my PoC the software-generated headers were crafted in userspace and didn't
> > have to be part of the same buffer as the coded data).
> We just need a method to indicate where the hardware could write its slice
> data or compressed frame.
> While we would decided which frame that the current frame should refer, the
> (some) hardware may discard our decision, which reference picture set would
> use less bits. Unless the codec supports a fill up method, it could lead to
> a gap between header and frame data.

I think I would need a bit more context to understand this case, especially
what the hardware could decide to discard.

My understanding is that the VP8 encoder needs to write part of the header
separately from the coded data and uses distinct address registers for the two.
So the approach is to move the hw-generated headers and coded data before
returning to userspace.

> > 
> > > > # Codec Features
> > > > 
> > > > Codecs have many variable features that can be enabled or not and specific
> > > > configuration fields that can take various values. There is usually some
> > > > top-level indication of profile/level that restricts what can be used.
> > > > 
> > > > This is a very similar situation to stateful encoding, where codec-specific
> > > > controls are used to report and set profile/level and configure these aspects.
> > > > A particularly nice thing about it is that we can reuse these existing controls
> > > > and add new ones in the future for features that are not yet covered.
> > > > 
> > > > This approach feels more flexible than designing new structures with a selected
> > > > set of parameters (that could match the existing controls) for each codec.
> > > Though, reading more into this emails, we still have a fair amount of controls
> > > to design and add, probably some compound controls too ?
> > Yeah definitely. My point here is merely that we should reuse existing control
> > for general codec features, but I don't think we'll get around introducing new
> > ones for stateless-specific parts.
> > 
> Things likes profile, level or tiers could be reused. It make no sense to
> expose those vendor special feature.
> Besides, profile, level or tiers are usually stored in the sequence header
> or uncompressed header, hardware doesn't care about that.
> 
> I think we should go with the vendor registers buffer way that I always
> said. There are many encoding tools that a codec offer, variants of hardware
> may not support or use them all. The context switching between userspace and
> kernel would drive you mad for so many controls.

I am strongly against this approach, instead I think we need to keep all
vendor-specific parts in the kernel driver and provide a clean unified userspace
API.

Also I think V4L2 has way to set multiple controls at once, so the
userspace/kernel context switching is rather minimal and within reasonable
expectations. Of course it will never be as efficient as userspace mapping the
hardware registers in virtual memory but there are so many problems with this
approach that it's really not worth it.

> > > > # Reference and Reconstruction Management
> > > > 
> > > > With stateless encoding, we need to tell the hardware which frames need to be
> > > > used as references for encoding the current frame and make sure we have the
> > > > these references available as decoded frames in memory.
> > > > 
> > > > Regardless of references, stateless encoders typically need some memory space to
> > > > write the decoded (known as reconstructed) frame while it's being encoded.
> > > > 
> > > > One question here is how many slots for decoded pictures should be allocated
> > > > by the driver when starting to stream. There is usually a maximum number of
> > > > reference frames that can be used at a time, although perhaps there is a use
> > > > case to keeping more around and alternative between them for future references.
> > > > 
> > > > Another question is how the driver should keep track of which frame will be used
> > > > as a reference in the future and which one can be evicted from the pool of
> > > > decoded pictures if it's not going to be used anymore.
> > > > 
> > > > A restrictive approach would be to let the driver alone manage that, similarly
> > > > to how stateful encoders behave. However it might provide extra flexibility
> > > > (and memory gain) to allow userspace to configure the maximum number of possible
> > > > reference frames. In that case it becomes necessary to indicate if a given
> > > > frame will be used as a reference in the future (maybe using a buffer flag)
> > > > and to indicate which previous reference frames (probably to be identified with
> > > > the matching output buffer's timestamp) should be used for the current encode.
> > > > This could be done with a new dedicated control (as a variable-sized array of
> > > > timestamps). Note that userspace would have to update it for every frame or the
> > > > reference frames will remain the same for future encodes.
> > > > 
> > > > The driver will then make sure to keep the reconstructed buffer around, in one
> > > > of the slots. When there's no slot left, the driver will drop the oldest
> > > > reference it has (maybe with a bounce buffer to still allow it to be used as a
> > > > reference for the current encode).
> > > > 
> > > > With this behavior defined in the uAPI spec, userspace will also be able to
> > > > keep track of which previous frame is no longer allowed as a reference.
> > > If we want, we could mirror the stateless decoders here. During the decoding, we
> > > pass a "dpb" or a reference list, which represent all the active references.
> > > These do not have to be used by the current frame, but the driver is allowed to
> > > use this list to cleanup and free unused memory (or reuse in case it has a fixed
> > > slot model, like mtk vcodec).
> > > 
> > > On top of this, we add a list of references to be used for producing the current
> > > frame. Usually, the picture references are indices into the dpb/reference list
> > > of timestamp. This makes validation easier.  We'll have to define how many
> > > reference can be used I think since unlike decoders, encoders don't have to
> > > fully implement levels and profiles.
> > So that would be a very explicit description instead of expecting drivers to
> > do the maintainance and userspace to figure out which frame was evicted from
> > the list. So yeah this feels more robust!
> > 
> > Regarding the number of reference frames, I think we need to specify both
> > how many references can be used at a time (number of hardware slots) and how
> > many total references can be in the reference list (number of rec buffers to
> > keep around).
> > 
> > We could also decide that making the current frame part of the global reference
> > list is a way to indicate that its reconstruction buffer must be kept around,
> > or we could have a separate way to indicate that. I lean towards the former
> > since it would put all reference-related things in one place and avoid coming
> > up with a new buffer flag or such.
> > 
> > Also we would probably still need to do some validation driver-side to make
> > sure that userspace doesn't put references in the list that were not marked
> > as such when encoded (and for which the reconstruction buffer may have been
> > recycled already).
> > 
> DPB is the only thing we need to decide any API here under the vendor
> registers buffer way. We need the driver to translate the buffer reference
> to the address that hardware could use and in its right registers.
> 
> The major problem is how to export the reconstruction buffer which was
> hidden for many years.
> This could be disccused in the other thread like V4L2 ext buffer api.

Following my previous point, I am also strongly against exposing the
reconstruction buffer to userspace.

> > > > # Frame Types
> > > > 
> > > > Stateless encoder drivers will typically instruct the hardware to encode either
> > > > an intra-coded or an inter-coded frame. While a stream composed only of a single
> > > > intra-coded frame followed by only inter-coded frames is possible, it's
> > > > generally not desirable as it is not very robust against data loss and makes
> > > > seeking difficult.
> > > Let's avoid this generalization in our document and design. In RTP streaming,
> > > like WebRTP or SIP, it is desirable to use open GOP (with nothing else then P
> > > frames all the time, except the very first one). The FORCE_KEY_FRAME is meant to
> > > allow handling RTP PLI (and other similar feedback). Its quite rare an
> > > application would mix close GOP and FORCE_KEY_FRAME, but its allowed though.
> > > What I've seen the most, is the FORCE_KEY_FRAME would just start a new GOP,
> > > following size and period from this new point.
> > Okay fair enough, thanks for the details!
> > 
> > > > As a result, the frame type is usually decided based on a given GOP size
> > > > (the frequency at which a new intra-coded frame is produced) while intra-coded
> > > > frames can be explicitly requested upon request. Stateful encoders implement
> > > > these through dedicated controls:
> > > > - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
> > > > - V4L2_CID_MPEG_VIDEO_GOP_SIZE
> > > > - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
> > > > 
> > > > It seems that reusing them would be possible, which would let the driver decide
> > > > of the particular frame type.
> > > > 
> > > > However it makes the reference frame management a bit trickier since reference
> > > > frames might be requested from userspace for a frame that ends up being
> > > > intra-coded. We can either allow this and silently ignore the info or expect
> > > > that userspace keeps track of the GOP index and not send references on the first
> > > > frame.
> > > > 
> > > > In some codecs, there's also a notion of barrier key-frames (IDR frames in
> > > > H.264) that strictly forbid using any past reference beyond the frame.
> > > > There seems to be an assumption that the GOP start uses this kind of frame
> > > > (and not any intra-coded frame), while the force key frame control does not
> > > > particularly specify it.
> > > > 
> > > > In that case we should flush the list of references and userspace should no
> > > > longer provide references to them for future frames. This puts a requirement on
> > > > userspace to keep track of GOP start in order to know when to flush its
> > > > reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
> > > > could also indicate a general intra-coded frame that is not a barrier.
> > > > 
> > > > So another possibility would be for userspace to explicitly indicate which
> > > > frame type to use (in a codec-specific way) and act accordingly, leaving any
> > > > notion of GOP up to userspace. I feel like this might be the easiest approach
> > > > while giving an extra degree of control to userspace.
> > > I also lean toward this approach ...
> > > 
> > > > # Rate Control
> > > > 
> > > > Another important feature of encoders is the ability to control the amount of
> > > > data produced following different rate control strategies. Stateful encoders
> > > > typically do this in-firmware and expose controls for selecting the strategy
> > > > and associated targets.
> > > > 
> > > > It seems desirable to support both automatic and manual rate-control to
> > > > userspace.
> > > > 
> > > > Automatic control would be implemented kernel-side (with algos possibly shared
> > > > across drivers) and reuse existing stateful controls. The advantage is
> > > > simplicity (userspace does not need to carry its own rate-control
> > > > implementation) and to ensure that there is a built-in mechanism for common
> > > > strategies available for every driver (no mandatory dependency on a proprietary
> > > > userspace stack). There may also be extra statistics or controls available to
> > > > the driver that allow finer-grain control.
> > > Though not controlling the GOP (or no gop) might require a bit more work on
> > > driver side. Today, we do have queues of request, queues of buffer etc. But it
> > > is still quite difficult to do lookahead these queues. That is only useful if
> > > rate control algorithm can use future frame type (like keyframe) to make
> > > decisions. That could be me pushing to far here though.
> > Yes I agree the interaction between userspace GOP control and kernel-side
> > rate-contrly might be quite tricky without any indication of what the next frame
> > types will be.
> > 
> > Maybe we could only allow explicit frame type configuration when using manual
> > rate-control and have kernel-side GOP management when in-kernel rc is used
> > (and we can allow it with manual rate-control too). I like having this option
> > because it allows for simple userspace implementations.
> > 
> > Note that this could perhaps also be added as an optional feature
> > for stateful encoders since some of them seem to be able to instruct the
> > firmware what frame type to use (in addition to directly controlling QP).
> > There's also a good chance that this feature is not available when using
> > a firmware-backed rc algorithm.
> > 
> > > > Manual control allows userspace to get creative and requires the ability to set
> > > > the quantization parameter (QP) directly for each frame (controls are already
> > > > as many stateful encoders also support it).
> > > > 
> > > > # Regions of Interest
> > > > 
> > > > Regions of interest (ROIs) allow specifying sub-regions of the frame that should
> > > > be prioritized for quality. Stateless encoders typically support a limited
> > > > number and allow setting specific QP values for these regions.
> > > > 
> > > > While the QP value should be used directly in manual rate-control, we probably
> > > > want to have some "level of importance" setting for kernel-side rate-control,
> > > > along with the dimensions/position of each ROI. This could be expressed with
> > > > a new structure containing all these elements and presented as a variable-sized
> > > > array control with as many elements as the hardware can support.
> > > Do you see any difference in ROI for stateful and stateless ? This looks like a
> > > feature we could combined. Also, ROI exist for cameras too, I'd probably try and
> > > keep them separate though.
> > I feel like the stateful/stateless behavior should be the same, so that could be
> > a shared control too. Also we could use a QP delta which would apply to both
> > manual and in-kernel rate-control, but maybe that's too low-level in the latter
> > case (not very obvious when a relevant delta could be when userspace has no idea
> > of the current frame-wide QP value).
> > 
> > > This is a very good overview of the hard work ahead of us. Looking forward on
> > > this journey and your Allwinner driver.
> > Thanks a lot for your input!
> > 
> > Honestly I was expecting that it would be more difficult than decoding, but it
> > turns out it might not be the case.
> > 
> Such rate control or quailty report would be complete vendor special.
>
> We just need a method that let driver report those encoding statistic to the
> userspace.

Returning the encoded bitstream size is perfectly generic and available to
every encoder. Maybe we could also return some average QP value since that
seems quite common. Other than that the rest should be kept in-kernel so we
can have a generic API.

Also it seems that the Hantro H1 specific mechanism (checkpoint-based) is not
necessarily a lot better than regular QP-wide settings.

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-25 12:15       ` Paul Kocialkowski
@ 2023-07-26  2:49         ` Hsia-Jun Li
  2023-07-26 19:53           ` Nicolas Dufresne
  0 siblings, 1 reply; 29+ messages in thread
From: Hsia-Jun Li @ 2023-07-26  2:49 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: linux-kernel, Nicolas Dufresne, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni



On 7/25/23 20:15, Paul Kocialkowski wrote:
> Subject:
> Re: Stateless Encoding uAPI Discussion and Proposal
> From:
> Paul Kocialkowski <paul.kocialkowski@bootlin.com>
> Date:
> 7/25/23, 20:15
> 
> To:
> Hsia-Jun Li <Randy.Li@synaptics.com>
> CC:
> linux-kernel@vger.kernel.org, Nicolas Dufresne 
> <nicolas.dufresne@collabora.com>, linux-media@vger.kernel.org, Hans 
> Verkuil <hverkuil@xs4all.nl>, Sakari Ailus <sakari.ailus@iki.fi>, 
> Andrzej Pietrasiewicz <andrzej.p@collabora.com>, Michael Tretter 
> <m.tretter@pengutronix.de>, Jernej Škrabec <jernej.skrabec@gmail.com>, 
> Chen-Yu Tsai <wens@csie.org>, Samuel Holland <samuel@sholland.org>, 
> Thomas Petazzoni <thomas.petazzoni@bootlin.com>
> 
> 
> Hey,
> 
> Long time, good to see you are still around and interested in these topics 😄
> 
> On Tue 25 Jul 23, 11:33, Hsia-Jun Li wrote:
>> On 7/12/23 22:07, Paul Kocialkowski wrote:
>>> Hi Nicolas,
>>>
>>> Thanks for the quick reply!
>>>
>>> On Tue 11 Jul 23, 14:18, Nicolas Dufresne wrote:
>>>> Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
>>>>> Hi everyone!
>>>>>
>>>>> After various discussions following Andrzej's talk at EOSS, feedback from the
>>>>> Media Summit (which I could not attend unfortunately) and various direct
>>>>> discussions, I have compiled some thoughts and ideas about stateless encoders
>>>>> support with various proposals. This is the result of a few years of interest
>>>>> in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
>>>>> which turned out to have numerous design issues.
>>>>>
>>>>> I am now working on a H.264 encoder driver for Allwinner platforms (currently
>>>>> focusing on the V3/V3s), which already provides some usable bitstream and will
>>>>> be published soon.
>>>>>
>>>>> This is a very long email where I've tried to split things into distinct topics
>>>>> and explain a few concepts to make sure everyone is on the same page.
>>>>>
>>>>> # Bitstream Headers
>>>>>
>>>>> Stateless encoders typically do not generate all the bitstream headers and
>>>>> sometimes no header at all (e.g. Allwinner encoder does not even produce slice
>>>>> headers). There's often some hardware block that makes bit-level writing to the
>>>>> destination buffer easier (deals with alignment, etc).
>>>>>
>>>>> The values of the bitstream headers must be in line with how the compressed
>>>>> data bitstream is generated and generally follow the codec specification.
>>>>> Some encoders might allow configuring all the fields found in the headers,
>>>>> others may only allow configuring a few or have specific constraints regarding
>>>>> which values are allowed.
>>>>>
>>>>> As a result, we cannot expect that any given encoder is able to produce frames
>>>>> for any set of headers. Reporting related constraints and limitations (beyond
>>>>> profile/level) seems quite difficult and error-prone.
>>>>>
>>>>> So it seems that keeping header generation in-kernel only (close to where the
>>>>> hardware is actually configured) is the safest approach.
>>>> This seems to match with what happened with the Hantro VP8 proof of concept. The
>>>> encoder does not produce the frame header, but also, it produces 2 encoded
>>>> buffers which cannot be made contiguous at the hardware level. This notion of
>>>> plane in coded data wasn't something that blended well with the rest of the API
>>>> and we didn't want to copy in the kernel while the userspace would also be
>>>> forced to copy to align the headers. Our conclusion was that it was best to
>>>> generate the headers and copy both segment before delivering to userspace. I
>>>> suspect this type of situation will be quite common.
>>> Makes sense! I guess the same will need to be done for Hantro H1 H.264 encoding
>>> (in my PoC the software-generated headers were crafted in userspace and didn't
>>> have to be part of the same buffer as the coded data).
>> We just need a method to indicate where the hardware could write its slice
>> data or compressed frame.
>> While we would decided which frame that the current frame should refer, the
>> (some) hardware may discard our decision, which reference picture set would
>> use less bits. Unless the codec supports a fill up method, it could lead to
>> a gap between header and frame data.
> I think I would need a bit more context to understand this case, especially
> what the hardware could decide to discard.
> 
I known the Hantro can't do this. While such design is not unusal. HW 
could tell we can't CU that we could do inter predict from one previous 
reconstruction frame, then it is not necessary to have it in its RPS.
> My understanding is that the VP8 encoder needs to write part of the header
> separately from the coded data and uses distinct address registers for the two.
I don't think Hantro H1 would do that.
> So the approach is to move the hw-generated headers and coded data before
> returning to userspace.
> 
>>>>> # Codec Features
>>>>>
>>>>> Codecs have many variable features that can be enabled or not and specific
>>>>> configuration fields that can take various values. There is usually some
>>>>> top-level indication of profile/level that restricts what can be used.
>>>>>
>>>>> This is a very similar situation to stateful encoding, where codec-specific
>>>>> controls are used to report and set profile/level and configure these aspects.
>>>>> A particularly nice thing about it is that we can reuse these existing controls
>>>>> and add new ones in the future for features that are not yet covered.
>>>>>
>>>>> This approach feels more flexible than designing new structures with a selected
>>>>> set of parameters (that could match the existing controls) for each codec.
>>>> Though, reading more into this emails, we still have a fair amount of controls
>>>> to design and add, probably some compound controls too ?
>>> Yeah definitely. My point here is merely that we should reuse existing control
>>> for general codec features, but I don't think we'll get around introducing new
>>> ones for stateless-specific parts.
>>>
>> Things likes profile, level or tiers could be reused. It make no sense to
>> expose those vendor special feature.
>> Besides, profile, level or tiers are usually stored in the sequence header
>> or uncompressed header, hardware doesn't care about that.
>>
>> I think we should go with the vendor registers buffer way that I always
>> said. There are many encoding tools that a codec offer, variants of hardware
>> may not support or use them all. The context switching between userspace and
>> kernel would drive you mad for so many controls.
> I am strongly against this approach, instead I think we need to keep all
> vendor-specific parts in the kernel driver and provide a clean unified userspace
> API.
> 
We are driving away vendor participation. Besides, the current design is 
a performance bottleneck.
> Also I think V4L2 has way to set multiple controls at once, so the
> userspace/kernel context switching is rather minimal and within reasonable
> expectations. Of course it will never be as efficient as userspace mapping the
> hardware registers in virtual memory but there are so many problems with this
> approach that it's really not worth it.
> 
I am not talking about mapping the register to the userspace.
The userspace would generate a register set for the current frame, while 
the kernel should fill that register set with buffer address and trigger 
the hardware to apply the register.

Generating a register set from control or even fill partial slice header 
cost many resources.

And what we try to define may not fit for real hardware design, you 
could only cover the most of hardwares would require but not vendor 
didn't have to follow that. Besides, codec spec could be updated even 
after it has been released for a while.

>>>>> # Reference and Reconstruction Management
>>>>>
>>>>> With stateless encoding, we need to tell the hardware which frames need to be
>>>>> used as references for encoding the current frame and make sure we have the
>>>>> these references available as decoded frames in memory.
>>>>>
>>>>> Regardless of references, stateless encoders typically need some memory space to
>>>>> write the decoded (known as reconstructed) frame while it's being encoded.
>>>>>
>>>>> One question here is how many slots for decoded pictures should be allocated
>>>>> by the driver when starting to stream. There is usually a maximum number of
>>>>> reference frames that can be used at a time, although perhaps there is a use
>>>>> case to keeping more around and alternative between them for future references.
>>>>>
>>>>> Another question is how the driver should keep track of which frame will be used
>>>>> as a reference in the future and which one can be evicted from the pool of
>>>>> decoded pictures if it's not going to be used anymore.
>>>>>
>>>>> A restrictive approach would be to let the driver alone manage that, similarly
>>>>> to how stateful encoders behave. However it might provide extra flexibility
>>>>> (and memory gain) to allow userspace to configure the maximum number of possible
>>>>> reference frames. In that case it becomes necessary to indicate if a given
>>>>> frame will be used as a reference in the future (maybe using a buffer flag)
>>>>> and to indicate which previous reference frames (probably to be identified with
>>>>> the matching output buffer's timestamp) should be used for the current encode.
>>>>> This could be done with a new dedicated control (as a variable-sized array of
>>>>> timestamps). Note that userspace would have to update it for every frame or the
>>>>> reference frames will remain the same for future encodes.
>>>>>
>>>>> The driver will then make sure to keep the reconstructed buffer around, in one
>>>>> of the slots. When there's no slot left, the driver will drop the oldest
>>>>> reference it has (maybe with a bounce buffer to still allow it to be used as a
>>>>> reference for the current encode).
>>>>>
>>>>> With this behavior defined in the uAPI spec, userspace will also be able to
>>>>> keep track of which previous frame is no longer allowed as a reference.
>>>> If we want, we could mirror the stateless decoders here. During the decoding, we
>>>> pass a "dpb" or a reference list, which represent all the active references.
>>>> These do not have to be used by the current frame, but the driver is allowed to
>>>> use this list to cleanup and free unused memory (or reuse in case it has a fixed
>>>> slot model, like mtk vcodec).
>>>>
>>>> On top of this, we add a list of references to be used for producing the current
>>>> frame. Usually, the picture references are indices into the dpb/reference list
>>>> of timestamp. This makes validation easier.  We'll have to define how many
>>>> reference can be used I think since unlike decoders, encoders don't have to
>>>> fully implement levels and profiles.
>>> So that would be a very explicit description instead of expecting drivers to
>>> do the maintainance and userspace to figure out which frame was evicted from
>>> the list. So yeah this feels more robust!
>>>
>>> Regarding the number of reference frames, I think we need to specify both
>>> how many references can be used at a time (number of hardware slots) and how
>>> many total references can be in the reference list (number of rec buffers to
>>> keep around).
>>>
>>> We could also decide that making the current frame part of the global reference
>>> list is a way to indicate that its reconstruction buffer must be kept around,
>>> or we could have a separate way to indicate that. I lean towards the former
>>> since it would put all reference-related things in one place and avoid coming
>>> up with a new buffer flag or such.
>>>
>>> Also we would probably still need to do some validation driver-side to make
>>> sure that userspace doesn't put references in the list that were not marked
>>> as such when encoded (and for which the reconstruction buffer may have been
>>> recycled already).
>>>
>> DPB is the only thing we need to decide any API here under the vendor
>> registers buffer way. We need the driver to translate the buffer reference
>> to the address that hardware could use and in its right registers.
>>
>> The major problem is how to export the reconstruction buffer which was
>> hidden for many years.
>> This could be disccused in the other thread like V4L2 ext buffer api.
> Following my previous point, I am also strongly against exposing the
> reconstruction buffer to userspace.
> 
Android hates peoploe allocating a huge of memory without 
userspace's(Android's core system) awareness.

Whether a reconstruction frame would be used for long term reference(or 
golden frame) is complete up to userspace decision. For example, when we 
encoding a part of SVC layer 1, we may not reference to a frame in layer 
0, should we let the hardware discard that? Later, we may decide to 
refernce it again.

Besides, I don't like the timetstamp way to refer a buffer here, one 
input graphics buffer could produce multiple reconstruction buffer(with 
different coding options), which is common for SVC case.
>>>>> # Frame Types
>>>>>
>>>>> Stateless encoder drivers will typically instruct the hardware to encode either
>>>>> an intra-coded or an inter-coded frame. While a stream composed only of a single
>>>>> intra-coded frame followed by only inter-coded frames is possible, it's
>>>>> generally not desirable as it is not very robust against data loss and makes
>>>>> seeking difficult.
>>>> Let's avoid this generalization in our document and design. In RTP streaming,
>>>> like WebRTP or SIP, it is desirable to use open GOP (with nothing else then P
>>>> frames all the time, except the very first one). The FORCE_KEY_FRAME is meant to
>>>> allow handling RTP PLI (and other similar feedback). Its quite rare an
>>>> application would mix close GOP and FORCE_KEY_FRAME, but its allowed though.
>>>> What I've seen the most, is the FORCE_KEY_FRAME would just start a new GOP,
>>>> following size and period from this new point.
>>> Okay fair enough, thanks for the details!
>>>
>>>>> As a result, the frame type is usually decided based on a given GOP size
>>>>> (the frequency at which a new intra-coded frame is produced) while intra-coded
>>>>> frames can be explicitly requested upon request. Stateful encoders implement
>>>>> these through dedicated controls:
>>>>> - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
>>>>> - V4L2_CID_MPEG_VIDEO_GOP_SIZE
>>>>> - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
>>>>>
>>>>> It seems that reusing them would be possible, which would let the driver decide
>>>>> of the particular frame type.
>>>>>
>>>>> However it makes the reference frame management a bit trickier since reference
>>>>> frames might be requested from userspace for a frame that ends up being
>>>>> intra-coded. We can either allow this and silently ignore the info or expect
>>>>> that userspace keeps track of the GOP index and not send references on the first
>>>>> frame.
>>>>>
>>>>> In some codecs, there's also a notion of barrier key-frames (IDR frames in
>>>>> H.264) that strictly forbid using any past reference beyond the frame.
>>>>> There seems to be an assumption that the GOP start uses this kind of frame
>>>>> (and not any intra-coded frame), while the force key frame control does not
>>>>> particularly specify it.
>>>>>
>>>>> In that case we should flush the list of references and userspace should no
>>>>> longer provide references to them for future frames. This puts a requirement on
>>>>> userspace to keep track of GOP start in order to know when to flush its
>>>>> reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
>>>>> could also indicate a general intra-coded frame that is not a barrier.
>>>>>
>>>>> So another possibility would be for userspace to explicitly indicate which
>>>>> frame type to use (in a codec-specific way) and act accordingly, leaving any
>>>>> notion of GOP up to userspace. I feel like this might be the easiest approach
>>>>> while giving an extra degree of control to userspace.
>>>> I also lean toward this approach ...
>>>>
>>>>> # Rate Control
>>>>>
>>>>> Another important feature of encoders is the ability to control the amount of
>>>>> data produced following different rate control strategies. Stateful encoders
>>>>> typically do this in-firmware and expose controls for selecting the strategy
>>>>> and associated targets.
>>>>>
>>>>> It seems desirable to support both automatic and manual rate-control to
>>>>> userspace.
>>>>>
>>>>> Automatic control would be implemented kernel-side (with algos possibly shared
>>>>> across drivers) and reuse existing stateful controls. The advantage is
>>>>> simplicity (userspace does not need to carry its own rate-control
>>>>> implementation) and to ensure that there is a built-in mechanism for common
>>>>> strategies available for every driver (no mandatory dependency on a proprietary
>>>>> userspace stack). There may also be extra statistics or controls available to
>>>>> the driver that allow finer-grain control.
>>>> Though not controlling the GOP (or no gop) might require a bit more work on
>>>> driver side. Today, we do have queues of request, queues of buffer etc. But it
>>>> is still quite difficult to do lookahead these queues. That is only useful if
>>>> rate control algorithm can use future frame type (like keyframe) to make
>>>> decisions. That could be me pushing to far here though.
>>> Yes I agree the interaction between userspace GOP control and kernel-side
>>> rate-contrly might be quite tricky without any indication of what the next frame
>>> types will be.
>>>
>>> Maybe we could only allow explicit frame type configuration when using manual
>>> rate-control and have kernel-side GOP management when in-kernel rc is used
>>> (and we can allow it with manual rate-control too). I like having this option
>>> because it allows for simple userspace implementations.
>>>
>>> Note that this could perhaps also be added as an optional feature
>>> for stateful encoders since some of them seem to be able to instruct the
>>> firmware what frame type to use (in addition to directly controlling QP).
>>> There's also a good chance that this feature is not available when using
>>> a firmware-backed rc algorithm.
>>>
>>>>> Manual control allows userspace to get creative and requires the ability to set
>>>>> the quantization parameter (QP) directly for each frame (controls are already
>>>>> as many stateful encoders also support it).
>>>>>
>>>>> # Regions of Interest
>>>>>
>>>>> Regions of interest (ROIs) allow specifying sub-regions of the frame that should
>>>>> be prioritized for quality. Stateless encoders typically support a limited
>>>>> number and allow setting specific QP values for these regions.
>>>>>
>>>>> While the QP value should be used directly in manual rate-control, we probably
>>>>> want to have some "level of importance" setting for kernel-side rate-control,
>>>>> along with the dimensions/position of each ROI. This could be expressed with
>>>>> a new structure containing all these elements and presented as a variable-sized
>>>>> array control with as many elements as the hardware can support.
>>>> Do you see any difference in ROI for stateful and stateless ? This looks like a
>>>> feature we could combined. Also, ROI exist for cameras too, I'd probably try and
>>>> keep them separate though.
>>> I feel like the stateful/stateless behavior should be the same, so that could be
>>> a shared control too. Also we could use a QP delta which would apply to both
>>> manual and in-kernel rate-control, but maybe that's too low-level in the latter
>>> case (not very obvious when a relevant delta could be when userspace has no idea
>>> of the current frame-wide QP value).
>>>
>>>> This is a very good overview of the hard work ahead of us. Looking forward on
>>>> this journey and your Allwinner driver.
>>> Thanks a lot for your input!
>>>
>>> Honestly I was expecting that it would be more difficult than decoding, but it
>>> turns out it might not be the case.
>>>
>> Such rate control or quailty report would be complete vendor special.
>>
>> We just need a method that let driver report those encoding statistic to the
>> userspace.
> Returning the encoded bitstream size is perfectly generic and available to
> every encoder. Maybe we could also return some average QP value since that
> seems quite common. Other than that the rest should be kept in-kernel so we
> can have a generic API.
> 
You just throw the tools that a hardware could offer away.
> Also it seems that the Hantro H1 specific mechanism (checkpoint-based) is not
> necessarily a lot better than regular QP-wide settings.
> 
Macroblock level QP control in Hantro H1 is very useful. For FOSS, those 
vendor special statistic or controlling may not be necessary, while the 
real product is not simple.
> Cheers,
> 
> Paul
> 
> -- Paul Kocialkowski, Bootlin Embedded Linux and kernel engineering 
> https://bootlin.com
> 

-- 
Hsia-Jun(Randy) Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-11 18:18 ` Nicolas Dufresne
  2023-07-12 14:07   ` Paul Kocialkowski
@ 2023-07-26  8:18   ` Hans Verkuil
  2023-08-09 14:43     ` Paul Kocialkowski
  1 sibling, 1 reply; 29+ messages in thread
From: Hans Verkuil @ 2023-07-26  8:18 UTC (permalink / raw)
  To: Nicolas Dufresne, Paul Kocialkowski, linux-kernel, linux-media,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter
  Cc: Jernej Škrabec, Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

On 11/07/2023 20:18, Nicolas Dufresne wrote:
> Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
>> Hi everyone!
>>
>> After various discussions following Andrzej's talk at EOSS, feedback from the
>> Media Summit (which I could not attend unfortunately) and various direct
>> discussions, I have compiled some thoughts and ideas about stateless encoders
>> support with various proposals. This is the result of a few years of interest
>> in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
>> which turned out to have numerous design issues.
>>
>> I am now working on a H.264 encoder driver for Allwinner platforms (currently
>> focusing on the V3/V3s), which already provides some usable bitstream and will
>> be published soon.
>>
>> This is a very long email where I've tried to split things into distinct topics
>> and explain a few concepts to make sure everyone is on the same page.
>>
>> # Bitstream Headers
>>
>> Stateless encoders typically do not generate all the bitstream headers and
>> sometimes no header at all (e.g. Allwinner encoder does not even produce slice
>> headers). There's often some hardware block that makes bit-level writing to the
>> destination buffer easier (deals with alignment, etc).
>>
>> The values of the bitstream headers must be in line with how the compressed
>> data bitstream is generated and generally follow the codec specification.
>> Some encoders might allow configuring all the fields found in the headers,
>> others may only allow configuring a few or have specific constraints regarding
>> which values are allowed.
>>
>> As a result, we cannot expect that any given encoder is able to produce frames
>> for any set of headers. Reporting related constraints and limitations (beyond
>> profile/level) seems quite difficult and error-prone.
>>
>> So it seems that keeping header generation in-kernel only (close to where the
>> hardware is actually configured) is the safest approach.
> 
> This seems to match with what happened with the Hantro VP8 proof of concept. The
> encoder does not produce the frame header, but also, it produces 2 encoded
> buffers which cannot be made contiguous at the hardware level. This notion of
> plane in coded data wasn't something that blended well with the rest of the API
> and we didn't want to copy in the kernel while the userspace would also be
> forced to copy to align the headers. Our conclusion was that it was best to
> generate the headers and copy both segment before delivering to userspace. I
> suspect this type of situation will be quite common.
> 
>>
>> # Codec Features
>>
>> Codecs have many variable features that can be enabled or not and specific
>> configuration fields that can take various values. There is usually some
>> top-level indication of profile/level that restricts what can be used.
>>
>> This is a very similar situation to stateful encoding, where codec-specific
>> controls are used to report and set profile/level and configure these aspects.
>> A particularly nice thing about it is that we can reuse these existing controls
>> and add new ones in the future for features that are not yet covered.
>>
>> This approach feels more flexible than designing new structures with a selected
>> set of parameters (that could match the existing controls) for each codec.
> 
> Though, reading more into this emails, we still have a fair amount of controls
> to design and add, probably some compound controls too ?

I expect that for stateless encoders support for read-only requests will be needed:

https://patchwork.linuxtv.org/project/linux-media/list/?series=5647

I worked on that in the past together with dynamic control arrays. The dynamic
array part was merged, but the read-only request part wasn't (there was never a
driver that actually needed it).

I don't know if that series still applies, but if there is a need for it then I
can rebase it and post an RFCv3.

Regards,

	Hans

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-26  2:49         ` Hsia-Jun Li
@ 2023-07-26 19:53           ` Nicolas Dufresne
  2023-07-27  2:45             ` Hsia-Jun Li
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Dufresne @ 2023-07-26 19:53 UTC (permalink / raw)
  To: Hsia-Jun Li, Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Hi,

Le mercredi 26 juillet 2023 à 10:49 +0800, Hsia-Jun Li a écrit :
> > I am strongly against this approach, instead I think we need to keep all
> > vendor-specific parts in the kernel driver and provide a clean unified userspace
> > API.
> > 
> We are driving away vendor participation. Besides, the current design is 
> a performance bottleneck.

I know you have been hammering this argument for many many years. But in
concrete situation, we have conducted tests, and we out perform vendors stacks
that directly hit into hardware register with stateless CODEC. Also, Paul's
proposal, is that fine grain / highly close to metal tuning of the encoding
process should endup in the Linux kernel, so that it can benefit from the
natural hard real-time advantage of a hard IRQ. Just like anything else, we will
find a lot of common methods and shareable code which will benefit in security
and quality, which is very unlike what we normally get from per vendor BSP. The
strategy is the same as everything else in Linux, vendor will adpot it if there
is a clear benefit. And better quality, ease of use, good collection of mature
userspace software is what makes the difference. It does takes time of course.

regards,
Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-25  9:09     ` Paul Kocialkowski
@ 2023-07-26 20:02       ` Nicolas Dufresne
  0 siblings, 0 replies; 29+ messages in thread
From: Nicolas Dufresne @ 2023-07-26 20:02 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Michael Grzeschik, linux-kernel, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni

Le mardi 25 juillet 2023 à 11:09 +0200, Paul Kocialkowski a écrit :
> Hi Nicolas,
> 
> On Mon 24 Jul 23, 10:03, Nicolas Dufresne wrote:
> > Le vendredi 21 juillet 2023 à 20:19 +0200, Michael Grzeschik a écrit :
> > > > As a result, we cannot expect that any given encoder is able to produce frames
> > > > for any set of headers. Reporting related constraints and limitations (beyond
> > > > profile/level) seems quite difficult and error-prone.
> > > > 
> > > > So it seems that keeping header generation in-kernel only (close to where the
> > > > hardware is actually configured) is the safest approach.
> > > 
> > > For the case with the rkvenc, the headers are also not created by the
> > > kernel driver. Instead we use the gst_h264_bit_writer_sps/pps functions
> > > that are part of the codecparsers module.
> > 
> > One level of granularity we can add is split headers (like SPS/PPS) and
> > slice/frame headers.
> 
> Do you mean asking the driver to return a buffer with only SPS/PPS and then
> return another buffer with the slice/frame header?
> 
> Looks like there's already a control for it: V4L2_CID_MPEG_VIDEO_HEADER_MODE
> which takes either
> - V4L2_MPEG_VIDEO_HEADER_MODE_SEPARATE: looks like what you're suggesting
> - V4L2_MPEG_VIDEO_HEADER_MODE_JOINED_WITH_1ST_FRAME: usual case
> 
> So that could certainly be supported to easily allow userspace to stuff extra
> NALUs in-between.

Good point, indeed.

> 
> > It remains that in some cases, like HEVC, when the slice
> > header is byte aligned, it can be nice to be able to handle it at application
> > side in order to avoid limiting SVC support (and other creative features) by our
> > API/abstraction limitations.
> 
> Do you see something in the headers that we expect the kernel to generate that
> would need specific changes to support features like SVC?

Getting the kernel to set the layer IDs, unless we have a full SVC configuration
would just be extra indirections. That being said, if we mention HEVC, these IDs
can be modified in-place as they use a fixed number of bytes. If you can split
the headers appart, generating per layer headers in application makes a lot of
sense.

Traditionally, slice headers are made by stateless accelerators, but not the
SPS/PPS and friend.

> 
> From what I can see there's a svc_extension_flag that's only set for specific
> NALUs (prefix_nal_unit/lice_layer_extension) so these could be inserted by
> userspace.
> 
> Also I'm not very knowledgeable about SVC so it's not very clear to me if it's
> possible to take an encoder that doesn't support SVC and turn the resulting
> stream into something SVC-ready by adding extra NAL units or if the encoder
> should be a lot more involved.

You can use any encoders to create a temporal SVC. Its only about the
referencing pattern, made so you can reduce the framerate (dividing by 2
usually).

For spatial layer, the encoders need scaling capabilities. I'm not totally sure
how multi-view work, but this is most likely just using left eye as reference
(not having an I frame ever for the second eye).

> 
> Also do you know if we have stateful codecs supporting SVC?

We don't at the moment, they all produce headers with layer id hardcoded to 0 as
far as I'm aware. The general plan (if it had continued) might have been to
offer a memu based control, and drivers could offer from a list of preset SVC
pattern. Mimicking what browsers needs:

https://www.w3.org/TR/webrtc-svc/

> 
> > I think a certain level of "per CODEC" reasoning is
> > also needed. Just like, I would not want to have to ask the kernel to generate
> > user data SEI and other in-band data.
> 
> Yeah it looks like there is definitely a need for adding extra NALUs from
> userspace without passing that data to the kernel.
> 
> Cheers,
> 
> Paul
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-26 19:53           ` Nicolas Dufresne
@ 2023-07-27  2:45             ` Hsia-Jun Li
  2023-07-27 17:10               ` Nicolas Dufresne
  0 siblings, 1 reply; 29+ messages in thread
From: Hsia-Jun Li @ 2023-07-27  2:45 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kernel, Paul Kocialkowski, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni



On 7/27/23 03:53, Nicolas Dufresne wrote:
> CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> Hi,
> 
> Le mercredi 26 juillet 2023 à 10:49 +0800, Hsia-Jun Li a écrit :
>>> I am strongly against this approach, instead I think we need to keep all
>>> vendor-specific parts in the kernel driver and provide a clean unified userspace
>>> API.
>>>
>> We are driving away vendor participation. Besides, the current design is
>> a performance bottleneck.
> 
> I know you have been hammering this argument for many many years. But in
> concrete situation, we have conducted tests, and we out perform vendors stacks
> that directly hit into hardware register with stateless CODEC. Also, Paul's
> proposal, is that fine grain / highly close to metal tuning of the encoding
> process should endup in the Linux kernel, so that it can benefit from the
> natural hard real-time advantage of a hard IRQ. Just like anything else, we will
In a real case, especially in those EDR/DVR, NVR, re-encoding could 
happen occasionally. The important is feedback the encoded statistic to 
the controller(userspace) then userspace decided the future 
operation(whether re-encoding this or not).

> find a lot of common methods and shareable code which will benefit in security
The security for a vendor would only mean the protection of its 
intelligence properties. Also userspace and HAL is isolated in Android. 
Security or quality are not a problem here, you can't even run the 
unverified code.
Or we just define an interface that only FOSS would use.
> and quality, which is very unlike what we normally get from per vendor BSP. The
> strategy is the same as everything else in Linux, vendor will adpot it if there
> is a clear benefit. And better quality, ease of use, good collection of mature
Any vendor would like to implement a DRM(digital right, security) video 
pipeline would not even think of this. They are not many vendors that 
just sell plain video codecs hardware.

In such case, we can't even invoke in its memory management, they may 
even drop the V4L2 framework.

Somebody may say why the vendor want the stateless codec, they could 
have a dedicated core to run a firmware. It is simple, if you are 
comparing an ARM cortex-R/M core to an ARM application core, which one 
could performance better? A remote processor could make the memory 
model(cache coherent) more complex. Besides, it is about the cost.
> userspace software is what makes the difference. It does takes time of course.

Anyway, despite those registers and controls part, I think I could input 
the buffer management part here.

Please DO ***NOT*** make a standard that occupied many memory behinds 
usersace and a standard that user has to handle the reconstruction 
buffer holding with a strange mechanism(I mean reconstruction buffer 
lifetime would be manged by userspace manually).
> 
> regards,
> Nicolas

-- 
Hsia-Jun(Randy) Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-27  2:45             ` Hsia-Jun Li
@ 2023-07-27 17:10               ` Nicolas Dufresne
  0 siblings, 0 replies; 29+ messages in thread
From: Nicolas Dufresne @ 2023-07-27 17:10 UTC (permalink / raw)
  To: Hsia-Jun Li
  Cc: linux-kernel, Paul Kocialkowski, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni

Le jeudi 27 juillet 2023 à 10:45 +0800, Hsia-Jun Li a écrit :
> 
> On 7/27/23 03:53, Nicolas Dufresne wrote:
> > CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.
> > 
> > 
> > Hi,
> > 
> > Le mercredi 26 juillet 2023 à 10:49 +0800, Hsia-Jun Li a écrit :
> > > > I am strongly against this approach, instead I think we need to keep all
> > > > vendor-specific parts in the kernel driver and provide a clean unified userspace
> > > > API.
> > > > 
> > > We are driving away vendor participation. Besides, the current design is
> > > a performance bottleneck.
> > 
> > 

. . . 

> Or we just define an interface that only FOSS would use.

We explicitly favour FOSS and make API that guaranty you can use the driver with
FOSS. This is not something we do in secret, this is fundamental to being a GPL
project. On DRM side, were the API is a lot more flexible, they explicitly
reject drivers without an actual FOSS user. We don't strictly have to do that in
V4L2, because the API is done at a higher level. But if we were to come up with
a lower level abstraction, we'd certainly have this rules.

. . .
> 

> 
> Please DO ***NOT*** make a standard that occupied many memory behinds 
> usersace and a standard that user has to handle the reconstruction 
> buffer holding with a strange mechanism(I mean reconstruction buffer 
> lifetime would be manged by userspace manually).

In all fairness, people have limited time, and builds on top of existing
infrastructure. The reason reconstruction buffers won't be exposed is really
simple to understand. We don't have API in current framework to support all the
allocations happening in codec drivers. If we could not progress without that,
I've sure finding solution would become a priority. But the trith is that we can
live without, and are aiming to move forward without.

We can certainly start a thread on the subject, I even have plenty of ideas how
to introduce these without throwing away all the existing stuff. But only if
there is a clear intention to actually implement it. We have plenty on our plate
and exposing reconstruction buffers can certainly wait.

regards,
Nicolas
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-26  8:18   ` Hans Verkuil
@ 2023-08-09 14:43     ` Paul Kocialkowski
  2023-08-09 17:24       ` Andrzej Pietrasiewicz
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Kocialkowski @ 2023-08-09 14:43 UTC (permalink / raw)
  To: Hans Verkuil
  Cc: Nicolas Dufresne, linux-kernel, linux-media, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 5041 bytes --]

Hi Hans,

On Wed 26 Jul 23, 10:18, Hans Verkuil wrote:
> On 11/07/2023 20:18, Nicolas Dufresne wrote:
> > Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
> >> Hi everyone!
> >>
> >> After various discussions following Andrzej's talk at EOSS, feedback from the
> >> Media Summit (which I could not attend unfortunately) and various direct
> >> discussions, I have compiled some thoughts and ideas about stateless encoders
> >> support with various proposals. This is the result of a few years of interest
> >> in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
> >> which turned out to have numerous design issues.
> >>
> >> I am now working on a H.264 encoder driver for Allwinner platforms (currently
> >> focusing on the V3/V3s), which already provides some usable bitstream and will
> >> be published soon.
> >>
> >> This is a very long email where I've tried to split things into distinct topics
> >> and explain a few concepts to make sure everyone is on the same page.
> >>
> >> # Bitstream Headers
> >>
> >> Stateless encoders typically do not generate all the bitstream headers and
> >> sometimes no header at all (e.g. Allwinner encoder does not even produce slice
> >> headers). There's often some hardware block that makes bit-level writing to the
> >> destination buffer easier (deals with alignment, etc).
> >>
> >> The values of the bitstream headers must be in line with how the compressed
> >> data bitstream is generated and generally follow the codec specification.
> >> Some encoders might allow configuring all the fields found in the headers,
> >> others may only allow configuring a few or have specific constraints regarding
> >> which values are allowed.
> >>
> >> As a result, we cannot expect that any given encoder is able to produce frames
> >> for any set of headers. Reporting related constraints and limitations (beyond
> >> profile/level) seems quite difficult and error-prone.
> >>
> >> So it seems that keeping header generation in-kernel only (close to where the
> >> hardware is actually configured) is the safest approach.
> > 
> > This seems to match with what happened with the Hantro VP8 proof of concept. The
> > encoder does not produce the frame header, but also, it produces 2 encoded
> > buffers which cannot be made contiguous at the hardware level. This notion of
> > plane in coded data wasn't something that blended well with the rest of the API
> > and we didn't want to copy in the kernel while the userspace would also be
> > forced to copy to align the headers. Our conclusion was that it was best to
> > generate the headers and copy both segment before delivering to userspace. I
> > suspect this type of situation will be quite common.
> > 
> >>
> >> # Codec Features
> >>
> >> Codecs have many variable features that can be enabled or not and specific
> >> configuration fields that can take various values. There is usually some
> >> top-level indication of profile/level that restricts what can be used.
> >>
> >> This is a very similar situation to stateful encoding, where codec-specific
> >> controls are used to report and set profile/level and configure these aspects.
> >> A particularly nice thing about it is that we can reuse these existing controls
> >> and add new ones in the future for features that are not yet covered.
> >>
> >> This approach feels more flexible than designing new structures with a selected
> >> set of parameters (that could match the existing controls) for each codec.
> > 
> > Though, reading more into this emails, we still have a fair amount of controls
> > to design and add, probably some compound controls too ?
> 
> I expect that for stateless encoders support for read-only requests will be needed:
> 
> https://patchwork.linuxtv.org/project/linux-media/list/?series=5647
> 
> I worked on that in the past together with dynamic control arrays. The dynamic
> array part was merged, but the read-only request part wasn't (there was never a
> driver that actually needed it).
> 
> I don't know if that series still applies, but if there is a need for it then I
> can rebase it and post an RFCv3.

So if I understand this correctly (from a quick look), this would be to allow
stateless encoder drivers to attach a particular control value to a specific
returned frame?

I guess this would be a good match to return statistics about the encoded frame.
However that would probably be expressed in a hardware-specific way so it
seems preferable to not expose this to userspace and handle it in-kernel
instead.

What's really important for userspace to know (in order to do user-side
rate-control, which we definitely want to support) is the resulting bitstream
size. This is already available with bytesused.

So all in all I think we're good with the current status of request support.

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-09 14:43     ` Paul Kocialkowski
@ 2023-08-09 17:24       ` Andrzej Pietrasiewicz
  0 siblings, 0 replies; 29+ messages in thread
From: Andrzej Pietrasiewicz @ 2023-08-09 17:24 UTC (permalink / raw)
  To: Paul Kocialkowski, Hans Verkuil
  Cc: Nicolas Dufresne, linux-kernel, linux-media, Sakari Ailus,
	Michael Tretter, Jernej Škrabec, Chen-Yu Tsai,
	Samuel Holland, Thomas Petazzoni

Hi Paul & Hans,

W dniu 9.08.2023 o 16:43, Paul Kocialkowski pisze:
> Hi Hans,
> 
> On Wed 26 Jul 23, 10:18, Hans Verkuil wrote:
>> On 11/07/2023 20:18, Nicolas Dufresne wrote:
>>> Le mardi 11 juillet 2023 à 19:12 +0200, Paul Kocialkowski a écrit :
>>>> Hi everyone!
>>>>
>>>> After various discussions following Andrzej's talk at EOSS, feedback from the
>>>> Media Summit (which I could not attend unfortunately) and various direct
>>>> discussions, I have compiled some thoughts and ideas about stateless encoders
>>>> support with various proposals. This is the result of a few years of interest
>>>> in the topic, after working on a PoC for the Hantro H1 using the hantro driver,
>>>> which turned out to have numerous design issues.
>>>>
>>>> I am now working on a H.264 encoder driver for Allwinner platforms (currently
>>>> focusing on the V3/V3s), which already provides some usable bitstream and will
>>>> be published soon.
>>>>
>>>> This is a very long email where I've tried to split things into distinct topics
>>>> and explain a few concepts to make sure everyone is on the same page.
>>>>
>>>> # Bitstream Headers
>>>>
>>>> Stateless encoders typically do not generate all the bitstream headers and
>>>> sometimes no header at all (e.g. Allwinner encoder does not even produce slice
>>>> headers). There's often some hardware block that makes bit-level writing to the
>>>> destination buffer easier (deals with alignment, etc).
>>>>
>>>> The values of the bitstream headers must be in line with how the compressed
>>>> data bitstream is generated and generally follow the codec specification.
>>>> Some encoders might allow configuring all the fields found in the headers,
>>>> others may only allow configuring a few or have specific constraints regarding
>>>> which values are allowed.
>>>>
>>>> As a result, we cannot expect that any given encoder is able to produce frames
>>>> for any set of headers. Reporting related constraints and limitations (beyond
>>>> profile/level) seems quite difficult and error-prone.
>>>>
>>>> So it seems that keeping header generation in-kernel only (close to where the
>>>> hardware is actually configured) is the safest approach.
>>>
>>> This seems to match with what happened with the Hantro VP8 proof of concept. The
>>> encoder does not produce the frame header, but also, it produces 2 encoded
>>> buffers which cannot be made contiguous at the hardware level. This notion of
>>> plane in coded data wasn't something that blended well with the rest of the API
>>> and we didn't want to copy in the kernel while the userspace would also be
>>> forced to copy to align the headers. Our conclusion was that it was best to
>>> generate the headers and copy both segment before delivering to userspace. I
>>> suspect this type of situation will be quite common.
>>>
>>>>
>>>> # Codec Features
>>>>
>>>> Codecs have many variable features that can be enabled or not and specific
>>>> configuration fields that can take various values. There is usually some
>>>> top-level indication of profile/level that restricts what can be used.
>>>>
>>>> This is a very similar situation to stateful encoding, where codec-specific
>>>> controls are used to report and set profile/level and configure these aspects.
>>>> A particularly nice thing about it is that we can reuse these existing controls
>>>> and add new ones in the future for features that are not yet covered.
>>>>
>>>> This approach feels more flexible than designing new structures with a selected
>>>> set of parameters (that could match the existing controls) for each codec.
>>>
>>> Though, reading more into this emails, we still have a fair amount of controls
>>> to design and add, probably some compound controls too ?
>>
>> I expect that for stateless encoders support for read-only requests will be needed:
>>
>> https://patchwork.linuxtv.org/project/linux-media/list/?series=5647
>>
>> I worked on that in the past together with dynamic control arrays. The dynamic
>> array part was merged, but the read-only request part wasn't (there was never a
>> driver that actually needed it).
>>
>> I don't know if that series still applies, but if there is a need for it then I
>> can rebase it and post an RFCv3.
> 
> So if I understand this correctly (from a quick look), this would be to allow
> stateless encoder drivers to attach a particular control value to a specific
> returned frame?
> 
> I guess this would be a good match to return statistics about the encoded frame.
> However that would probably be expressed in a hardware-specific way so it
> seems preferable to not expose this to userspace and handle it in-kernel
> instead.
> 
> What's really important for userspace to know (in order to do user-side
> rate-control, which we definitely want to support) is the resulting bitstream
> size. This is already available with bytesused.
> 
> So all in all I think we're good with the current status of request support.

Yup. I agree. Initially, while working on VP8 encoding we introduced (read-only)
requests on the capture queue, but they turned out not to be useful in this
context and we removed them.

Regards.

Andrzej

> 
> Cheers,
> 
> Paul
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-07-11 17:12 Stateless Encoding uAPI Discussion and Proposal Paul Kocialkowski
  2023-07-11 18:18 ` Nicolas Dufresne
  2023-07-21 18:19 ` Michael Grzeschik
@ 2023-08-10 13:44 ` Paul Kocialkowski
  2023-08-10 14:34   ` Nicolas Dufresne
  2 siblings, 1 reply; 29+ messages in thread
From: Paul Kocialkowski @ 2023-08-10 13:44 UTC (permalink / raw)
  To: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Nicolas Dufresne, Andrzej Pietrasiewicz, Michael Tretter
  Cc: Jernej Škrabec, Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 13253 bytes --]

Hi folks,

On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
> I am now working on a H.264 encoder driver for Allwinner platforms (currently
> focusing on the V3/V3s), which already provides some usable bitstream and will
> be published soon.

So I wanted to shared an update on my side since I've been making progress on
the H.264 encoding work for Allwinner platforms. At this point the code supports
IDR, I and P frames, with a single reference. It also supports GOP (both closed
and open with IDR or I frame interval and explicit keyframe request) but uses
QP controls and does not yet provide rate control. I hope to be able to
implement rate-control before we can make a first public release of the code.

One of the main topics of concern now is how reference frames should be managed
and how it should interact with kernel-side GOP management and rate control.

Leaving GOP management to the kernel-side implies having it decide which frame
should be IDR, I or P (and B for encoders that can support it), while keeping
the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
to me that this is already a good balance between giving userspace a decent
level of control while not having to specify the frame type explicitly for each
frame or maintain a GOP in userspace.

Requesting the frame type explicitly seems more fragile as many situations will
be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
it generally requires userspace to know a lot about what the codec assumptions
are. Also for B frames the decision would need to be consistent with the fact
that a following frame (in display order) would need to be submitted earlier
than the current frame and inform the kernel so that the picture order count
(display order indication) can be maintained. This is not impossible or out of
reach, but it brings a lot of complexity for little advantage.

Leaving the decision to the kernel side with some hints (whether to force a
keyframe, whether to allow B frames) seems a lot easier, especially for B frames
since the kernel could just receive frames in-order and decide to hold one
so that it can use the next frame submitted as a forward reference for this
upcoming B frame. This requires flushing support but it's already well in place
for stateful encoders.

The next topic of interest is reference management. It seems pretty clear that
the decision of whether a frame should be a reference or not always needs to be
taken when encoding that frame. In H.264 the nal_ref_idc slice header element
indicates whether a frame is marked as reference or not. IDR frames can
additionally be marked as long-term reference (if I understood correctly, the
frame will stay in the reference picture list until the next IDR frame).
Frames that are marked as reference are added to the l0/l1 lists implicitly
that way and are evicted mostly depending on the number of reference slots
available, or when a new GOP is started.

With the frame type decided by the kernel, it becomes nearly impossible for
userspace to keep track of the reference lists. Userspace would at least need
to know when an IDR frame is produced to flush the reference lists. In addition
it looks like most hardware doesn't have a way to explicitly discard previous
frames that were marked as reference from being used as reference for next
frames. All in all this means that we should expect little control over the
reference frames list.

As a result my updated proposal would be to have userspace only indicate whether
a submitted frame should be marked as a reference or not instead of submitting
an explicit list of previous buffers that should be used as reference, which
would be impossible to honor in many cases.

Addition information gathered:
- It seems likely that the Allwinner Video Engine only supports one reference
  frame. There's a register for specifying the rec buffer of a second one but
  I have never seen the proprietary blob use it. It might be as easy as
  specifying a non-zero address there but it might also be ignored or require
  some undocumented bit to use more than one reference. I haven't made any
  attempt at using it yet.
- Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
  do not support VP8 encode (despite Allwinner's proprietary blob having an
  API for it). The only platform that advertises it is the A80 and this might
  actually be a VP8-only Hantro H1. It seems that the API they developed in the
  library stuck around even if no other platform can use it.

Sorry for the long email again, I'm trying to be a bit more explanatory than
just giving some bare conclusions that I drew on my own.

What do you think about these ideas?

Cheers,

Paul

> 
> This is a very long email where I've tried to split things into distinct topics
> and explain a few concepts to make sure everyone is on the same page.
> 
> # Bitstream Headers
> 
> Stateless encoders typically do not generate all the bitstream headers and
> sometimes no header at all (e.g. Allwinner encoder does not even produce slice
> headers). There's often some hardware block that makes bit-level writing to the
> destination buffer easier (deals with alignment, etc).
> 
> The values of the bitstream headers must be in line with how the compressed
> data bitstream is generated and generally follow the codec specification.
> Some encoders might allow configuring all the fields found in the headers,
> others may only allow configuring a few or have specific constraints regarding
> which values are allowed.
> 
> As a result, we cannot expect that any given encoder is able to produce frames
> for any set of headers. Reporting related constraints and limitations (beyond
> profile/level) seems quite difficult and error-prone.
> 
> So it seems that keeping header generation in-kernel only (close to where the
> hardware is actually configured) is the safest approach.
> 
> # Codec Features
> 
> Codecs have many variable features that can be enabled or not and specific
> configuration fields that can take various values. There is usually some
> top-level indication of profile/level that restricts what can be used.
> 
> This is a very similar situation to stateful encoding, where codec-specific
> controls are used to report and set profile/level and configure these aspects.
> A particularly nice thing about it is that we can reuse these existing controls
> and add new ones in the future for features that are not yet covered.
> 
> This approach feels more flexible than designing new structures with a selected
> set of parameters (that could match the existing controls) for each codec.
> 
> # Reference and Reconstruction Management
> 
> With stateless encoding, we need to tell the hardware which frames need to be
> used as references for encoding the current frame and make sure we have the
> these references available as decoded frames in memory.
> 
> Regardless of references, stateless encoders typically need some memory space to
> write the decoded (known as reconstructed) frame while it's being encoded.
> 
> One question here is how many slots for decoded pictures should be allocated
> by the driver when starting to stream. There is usually a maximum number of
> reference frames that can be used at a time, although perhaps there is a use
> case to keeping more around and alternative between them for future references.
> 
> Another question is how the driver should keep track of which frame will be used
> as a reference in the future and which one can be evicted from the pool of
> decoded pictures if it's not going to be used anymore.
> 
> A restrictive approach would be to let the driver alone manage that, similarly
> to how stateful encoders behave. However it might provide extra flexibility
> (and memory gain) to allow userspace to configure the maximum number of possible
> reference frames. In that case it becomes necessary to indicate if a given
> frame will be used as a reference in the future (maybe using a buffer flag)
> and to indicate which previous reference frames (probably to be identified with
> the matching output buffer's timestamp) should be used for the current encode.
> This could be done with a new dedicated control (as a variable-sized array of
> timestamps). Note that userspace would have to update it for every frame or the
> reference frames will remain the same for future encodes.
> 
> The driver will then make sure to keep the reconstructed buffer around, in one
> of the slots. When there's no slot left, the driver will drop the oldest
> reference it has (maybe with a bounce buffer to still allow it to be used as a
> reference for the current encode).
> 
> With this behavior defined in the uAPI spec, userspace will also be able to
> keep track of which previous frame is no longer allowed as a reference.
> 
> # Frame Types
> 
> Stateless encoder drivers will typically instruct the hardware to encode either
> an intra-coded or an inter-coded frame. While a stream composed only of a single
> intra-coded frame followed by only inter-coded frames is possible, it's
> generally not desirable as it is not very robust against data loss and makes
> seeking difficult.
> 
> As a result, the frame type is usually decided based on a given GOP size
> (the frequency at which a new intra-coded frame is produced) while intra-coded
> frames can be explicitly requested upon request. Stateful encoders implement
> these through dedicated controls:
> - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
> - V4L2_CID_MPEG_VIDEO_GOP_SIZE
> - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
> 
> It seems that reusing them would be possible, which would let the driver decide
> of the particular frame type.
> 
> However it makes the reference frame management a bit trickier since reference
> frames might be requested from userspace for a frame that ends up being
> intra-coded. We can either allow this and silently ignore the info or expect
> that userspace keeps track of the GOP index and not send references on the first
> frame.
> 
> In some codecs, there's also a notion of barrier key-frames (IDR frames in
> H.264) that strictly forbid using any past reference beyond the frame.
> There seems to be an assumption that the GOP start uses this kind of frame
> (and not any intra-coded frame), while the force key frame control does not
> particularly specify it.
> 
> In that case we should flush the list of references and userspace should no
> longer provide references to them for future frames. This puts a requirement on
> userspace to keep track of GOP start in order to know when to flush its
> reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
> could also indicate a general intra-coded frame that is not a barrier.
> 
> So another possibility would be for userspace to explicitly indicate which
> frame type to use (in a codec-specific way) and act accordingly, leaving any
> notion of GOP up to userspace. I feel like this might be the easiest approach
> while giving an extra degree of control to userspace.
> 
> # Rate Control
> 
> Another important feature of encoders is the ability to control the amount of
> data produced following different rate control strategies. Stateful encoders
> typically do this in-firmware and expose controls for selecting the strategy
> and associated targets.
> 
> It seems desirable to support both automatic and manual rate-control to
> userspace.
> 
> Automatic control would be implemented kernel-side (with algos possibly shared
> across drivers) and reuse existing stateful controls. The advantage is
> simplicity (userspace does not need to carry its own rate-control
> implementation) and to ensure that there is a built-in mechanism for common
> strategies available for every driver (no mandatory dependency on a proprietary
> userspace stack). There may also be extra statistics or controls available to
> the driver that allow finer-grain control.
> 
> Manual control allows userspace to get creative and requires the ability to set
> the quantization parameter (QP) directly for each frame (controls are already
> as many stateful encoders also support it).
> 
> # Regions of Interest
> 
> Regions of interest (ROIs) allow specifying sub-regions of the frame that should
> be prioritized for quality. Stateless encoders typically support a limited
> number and allow setting specific QP values for these regions.
> 
> While the QP value should be used directly in manual rate-control, we probably
> want to have some "level of importance" setting for kernel-side rate-control,
> along with the dimensions/position of each ROI. This could be expressed with
> a new structure containing all these elements and presented as a variable-sized
> array control with as many elements as the hardware can support.
> 
> -- 
> Paul Kocialkowski, Bootlin
> Embedded Linux and kernel engineering
> https://bootlin.com



-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-10 13:44 ` Paul Kocialkowski
@ 2023-08-10 14:34   ` Nicolas Dufresne
  2023-08-11 20:08     ` Paul Kocialkowski
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Dufresne @ 2023-08-10 14:34 UTC (permalink / raw)
  To: Paul Kocialkowski, linux-kernel, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter
  Cc: Jernej Škrabec, Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Le jeudi 10 août 2023 à 15:44 +0200, Paul Kocialkowski a écrit :
> Hi folks,
> 
> On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
> > I am now working on a H.264 encoder driver for Allwinner platforms (currently
> > focusing on the V3/V3s), which already provides some usable bitstream and will
> > be published soon.
> 
> So I wanted to shared an update on my side since I've been making progress on
> the H.264 encoding work for Allwinner platforms. At this point the code supports
> IDR, I and P frames, with a single reference. It also supports GOP (both closed
> and open with IDR or I frame interval and explicit keyframe request) but uses
> QP controls and does not yet provide rate control. I hope to be able to
> implement rate-control before we can make a first public release of the code.

Just a reminder that we will code review the API first, the supporting
implementation will just be companion. So in this context, the sooner the better
for an RFC here.

> 
> One of the main topics of concern now is how reference frames should be managed
> and how it should interact with kernel-side GOP management and rate control.

Maybe we need to have a discussion about kernel side GOP management first ?
While I think kernel side rate control is un-avoidable, I don't think stateless
encoder should have kernel side GOP management.

> 
> Leaving GOP management to the kernel-side implies having it decide which frame
> should be IDR, I or P (and B for encoders that can support it), while keeping
> the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
> to me that this is already a good balance between giving userspace a decent
> level of control while not having to specify the frame type explicitly for each
> frame or maintain a GOP in userspace.

My expectation for stateless encoder is to have to specify the frame type and
the associate references if the type requires it.

> 
> Requesting the frame type explicitly seems more fragile as many situations will
> be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
> it generally requires userspace to know a lot about what the codec assumptions
> are. Also for B frames the decision would need to be consistent with the fact
> that a following frame (in display order) would need to be submitted earlier
> than the current frame and inform the kernel so that the picture order count
> (display order indication) can be maintained. This is not impossible or out of
> reach, but it brings a lot of complexity for little advantage.

We have had a lot more consistent results over the last decade with stateless
hardware codecs in contrast to stateful where we endup with wide variation in
behaviour. This applies to Chromium, GStreamer and any active users of VA
encoders really. I'm strongly in favour for stateless reference API out of the
Linux kernel.

> 
> Leaving the decision to the kernel side with some hints (whether to force a
> keyframe, whether to allow B frames) seems a lot easier, especially for B frames
> since the kernel could just receive frames in-order and decide to hold one
> so that it can use the next frame submitted as a forward reference for this
> upcoming B frame. This requires flushing support but it's already well in place
> for stateful encoders.

No, its a lot harder for users. The placement of keyframe should be bound to
various image analyses and streaming conditions like scene change detection,
network traffic, but also, I strictly don't want to depend on the Linux kernel
when its time to implement a custom reference tree. In general, stateful decoder
are never up to the game of modern RTP features and other fancy robust
referencing model. I overall have to disagree with your proposed approach. I
believe we have to create a stateless encoder interface and not a completely
abstract this hardware over our existing stateful interface. We should take
adventage of the nature of the hardware to make simpler and safer driver.

> 
> The next topic of interest is reference management. It seems pretty clear that
> the decision of whether a frame should be a reference or not always needs to be
> taken when encoding that frame. In H.264 the nal_ref_idc slice header element
> indicates whether a frame is marked as reference or not. IDR frames can
> additionally be marked as long-term reference (if I understood correctly, the
> frame will stay in the reference picture list until the next IDR frame).

This is incorrect. Any frames can be marked as long term reference, it does not
matter what type they are. From what I recall, marking of the long term in the
bitstream is using a explicit IDX, so there is no specific rules on which one
get evicted. Long term of course are limited as they occupy space in the DPB. 
Also, Each CODEC have different DPB semantic. For H.264, the DPB can run in two
modes. The first is a simple fifo, in this case, any frame you encode and want
to keep as reference is pushed into the DPB (which has a fixed size minus the
long term). If full, the oldest frame is removed. It is not bound to IDR or GOP.
Though, an IDR will implicitly cause the decoder to evict everything (including
long term).

The second mode uses the memory management commands. This is a series if
instruction that the encoder can send to the decoder. The specification is quite
complex, it is a common source of bugs in decoders and a place were stateless
hardware codecs performs more consistently in general. Through the commands, the
encoder ensure that the decoder dpb representation stay on sync.

> Frames that are marked as reference are added to the l0/l1 lists implicitly
> that way and are evicted mostly depending on the number of reference slots
> available, or when a new GOP is started.

Be aware that "slots" is a hardware implementation detail. I think it can be
used for any MPEG CODEC, but be careful since slots in AV1 specification have a
completely different meaning. Generalization of slots will create confusion.

> 
> With the frame type decided by the kernel, it becomes nearly impossible for
> userspace to keep track of the reference lists. Userspace would at least need
> to know when an IDR frame is produced to flush the reference lists. In addition
> it looks like most hardware doesn't have a way to explicitly discard previous
> frames that were marked as reference from being used as reference for next
> frames. All in all this means that we should expect little control over the
> reference frames list.
> 
> As a result my updated proposal would be to have userspace only indicate whether
> a submitted frame should be marked as a reference or not instead of submitting
> an explicit list of previous buffers that should be used as reference, which
> would be impossible to honor in many cases.
> 
> Addition information gathered:
> - It seems likely that the Allwinner Video Engine only supports one reference
>   frame. There's a register for specifying the rec buffer of a second one but
>   I have never seen the proprietary blob use it. It might be as easy as
>   specifying a non-zero address there but it might also be ignored or require
>   some undocumented bit to use more than one reference. I haven't made any
>   attempt at using it yet.

There is something in that fact that makes me think of Hantro H1. Hantro H1 also
have a second reference, but non one ever use it. We have on our todo to
actually give this a look.

> - Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
>   do not support VP8 encode (despite Allwinner's proprietary blob having an
>   API for it). The only platform that advertises it is the A80 and this might
>   actually be a VP8-only Hantro H1. It seems that the API they developed in the
>   library stuck around even if no other platform can use it.

Thanks for letting us know. Our assumption is that a second hardware design is
unlikely as Google was giving it for free to any hardware makers that wanted it.

> 
> Sorry for the long email again, I'm trying to be a bit more explanatory than
> just giving some bare conclusions that I drew on my own.
> 
> What do you think about these ideas?

In general, we diverge on the direction we want the interface to be. What you
seem to describe now is just a normal stateful encoder interface with everything
needed to drive the stateless hardware implemented in the Linux kernel. There is
no parsing or other unsafety in encoders, so I don't have a strict no-go
argument for that, but for me, it means much more complex drivers and lesser
flexibility. The VA model have been working great for us in the past, giving us
the ability to implement new feature, or even slightly of spec features. While,
the Linux kernel might not be the right place for these experimental methods.

Personally, I would rather discuss around your uAPI RFC though, I think a lot of
other devs here would like to see what you have drafted.

Nicolas

> 
> Cheers,
> 
> Paul
> 
> > 
> > This is a very long email where I've tried to split things into distinct topics
> > and explain a few concepts to make sure everyone is on the same page.
> > 
> > # Bitstream Headers
> > 
> > Stateless encoders typically do not generate all the bitstream headers and
> > sometimes no header at all (e.g. Allwinner encoder does not even produce slice
> > headers). There's often some hardware block that makes bit-level writing to the
> > destination buffer easier (deals with alignment, etc).
> > 
> > The values of the bitstream headers must be in line with how the compressed
> > data bitstream is generated and generally follow the codec specification.
> > Some encoders might allow configuring all the fields found in the headers,
> > others may only allow configuring a few or have specific constraints regarding
> > which values are allowed.
> > 
> > As a result, we cannot expect that any given encoder is able to produce frames
> > for any set of headers. Reporting related constraints and limitations (beyond
> > profile/level) seems quite difficult and error-prone.
> > 
> > So it seems that keeping header generation in-kernel only (close to where the
> > hardware is actually configured) is the safest approach.
> > 
> > # Codec Features
> > 
> > Codecs have many variable features that can be enabled or not and specific
> > configuration fields that can take various values. There is usually some
> > top-level indication of profile/level that restricts what can be used.
> > 
> > This is a very similar situation to stateful encoding, where codec-specific
> > controls are used to report and set profile/level and configure these aspects.
> > A particularly nice thing about it is that we can reuse these existing controls
> > and add new ones in the future for features that are not yet covered.
> > 
> > This approach feels more flexible than designing new structures with a selected
> > set of parameters (that could match the existing controls) for each codec.
> > 
> > # Reference and Reconstruction Management
> > 
> > With stateless encoding, we need to tell the hardware which frames need to be
> > used as references for encoding the current frame and make sure we have the
> > these references available as decoded frames in memory.
> > 
> > Regardless of references, stateless encoders typically need some memory space to
> > write the decoded (known as reconstructed) frame while it's being encoded.
> > 
> > One question here is how many slots for decoded pictures should be allocated
> > by the driver when starting to stream. There is usually a maximum number of
> > reference frames that can be used at a time, although perhaps there is a use
> > case to keeping more around and alternative between them for future references.
> > 
> > Another question is how the driver should keep track of which frame will be used
> > as a reference in the future and which one can be evicted from the pool of
> > decoded pictures if it's not going to be used anymore.
> > 
> > A restrictive approach would be to let the driver alone manage that, similarly
> > to how stateful encoders behave. However it might provide extra flexibility
> > (and memory gain) to allow userspace to configure the maximum number of possible
> > reference frames. In that case it becomes necessary to indicate if a given
> > frame will be used as a reference in the future (maybe using a buffer flag)
> > and to indicate which previous reference frames (probably to be identified with
> > the matching output buffer's timestamp) should be used for the current encode.
> > This could be done with a new dedicated control (as a variable-sized array of
> > timestamps). Note that userspace would have to update it for every frame or the
> > reference frames will remain the same for future encodes.
> > 
> > The driver will then make sure to keep the reconstructed buffer around, in one
> > of the slots. When there's no slot left, the driver will drop the oldest
> > reference it has (maybe with a bounce buffer to still allow it to be used as a
> > reference for the current encode).
> > 
> > With this behavior defined in the uAPI spec, userspace will also be able to
> > keep track of which previous frame is no longer allowed as a reference.
> > 
> > # Frame Types
> > 
> > Stateless encoder drivers will typically instruct the hardware to encode either
> > an intra-coded or an inter-coded frame. While a stream composed only of a single
> > intra-coded frame followed by only inter-coded frames is possible, it's
> > generally not desirable as it is not very robust against data loss and makes
> > seeking difficult.
> > 
> > As a result, the frame type is usually decided based on a given GOP size
> > (the frequency at which a new intra-coded frame is produced) while intra-coded
> > frames can be explicitly requested upon request. Stateful encoders implement
> > these through dedicated controls:
> > - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
> > - V4L2_CID_MPEG_VIDEO_GOP_SIZE
> > - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
> > 
> > It seems that reusing them would be possible, which would let the driver decide
> > of the particular frame type.
> > 
> > However it makes the reference frame management a bit trickier since reference
> > frames might be requested from userspace for a frame that ends up being
> > intra-coded. We can either allow this and silently ignore the info or expect
> > that userspace keeps track of the GOP index and not send references on the first
> > frame.
> > 
> > In some codecs, there's also a notion of barrier key-frames (IDR frames in
> > H.264) that strictly forbid using any past reference beyond the frame.
> > There seems to be an assumption that the GOP start uses this kind of frame
> > (and not any intra-coded frame), while the force key frame control does not
> > particularly specify it.
> > 
> > In that case we should flush the list of references and userspace should no
> > longer provide references to them for future frames. This puts a requirement on
> > userspace to keep track of GOP start in order to know when to flush its
> > reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
> > could also indicate a general intra-coded frame that is not a barrier.
> > 
> > So another possibility would be for userspace to explicitly indicate which
> > frame type to use (in a codec-specific way) and act accordingly, leaving any
> > notion of GOP up to userspace. I feel like this might be the easiest approach
> > while giving an extra degree of control to userspace.
> > 
> > # Rate Control
> > 
> > Another important feature of encoders is the ability to control the amount of
> > data produced following different rate control strategies. Stateful encoders
> > typically do this in-firmware and expose controls for selecting the strategy
> > and associated targets.
> > 
> > It seems desirable to support both automatic and manual rate-control to
> > userspace.
> > 
> > Automatic control would be implemented kernel-side (with algos possibly shared
> > across drivers) and reuse existing stateful controls. The advantage is
> > simplicity (userspace does not need to carry its own rate-control
> > implementation) and to ensure that there is a built-in mechanism for common
> > strategies available for every driver (no mandatory dependency on a proprietary
> > userspace stack). There may also be extra statistics or controls available to
> > the driver that allow finer-grain control.
> > 
> > Manual control allows userspace to get creative and requires the ability to set
> > the quantization parameter (QP) directly for each frame (controls are already
> > as many stateful encoders also support it).
> > 
> > # Regions of Interest
> > 
> > Regions of interest (ROIs) allow specifying sub-regions of the frame that should
> > be prioritized for quality. Stateless encoders typically support a limited
> > number and allow setting specific QP values for these regions.
> > 
> > While the QP value should be used directly in manual rate-control, we probably
> > want to have some "level of importance" setting for kernel-side rate-control,
> > along with the dimensions/position of each ROI. This could be expressed with
> > a new structure containing all these elements and presented as a variable-sized
> > array control with as many elements as the hardware can support.
> > 
> > -- 
> > Paul Kocialkowski, Bootlin
> > Embedded Linux and kernel engineering
> > https://bootlin.com
> 
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-10 14:34   ` Nicolas Dufresne
@ 2023-08-11 20:08     ` Paul Kocialkowski
  2023-08-21 15:13       ` Nicolas Dufresne
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Kocialkowski @ 2023-08-11 20:08 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 21754 bytes --]

Hi Nicolas,

On Thu 10 Aug 23, 10:34, Nicolas Dufresne wrote:
> Le jeudi 10 août 2023 à 15:44 +0200, Paul Kocialkowski a écrit :
> > Hi folks,
> > 
> > On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
> > > I am now working on a H.264 encoder driver for Allwinner platforms (currently
> > > focusing on the V3/V3s), which already provides some usable bitstream and will
> > > be published soon.
> > 
> > So I wanted to shared an update on my side since I've been making progress on
> > the H.264 encoding work for Allwinner platforms. At this point the code supports
> > IDR, I and P frames, with a single reference. It also supports GOP (both closed
> > and open with IDR or I frame interval and explicit keyframe request) but uses
> > QP controls and does not yet provide rate control. I hope to be able to
> > implement rate-control before we can make a first public release of the code.
> 
> Just a reminder that we will code review the API first, the supporting
> implementation will just be companion. So in this context, the sooner the better
> for an RFC here.

I definitely want to have some proposal that is (even vaguely) agreed upon
before proposing patches for mainline, even at the stage of RFC.

While I already have working results at this point, the API that is used is
very basic and just reuses controls from stateful encoders, with no extra
addition. Various assumptions are made in the kernel and there is no real
reference management, since the previous frame is always expected to be used
as the only reference.

We plan to make a public release at some point in the near future which shows
these working results, but it will not be a base for our discussion here yet.

> > One of the main topics of concern now is how reference frames should be managed
> > and how it should interact with kernel-side GOP management and rate control.
> 
> Maybe we need to have a discussion about kernel side GOP management first ?
> While I think kernel side rate control is un-avoidable, I don't think stateless
> encoder should have kernel side GOP management.

I don't have strong opinions about this. The rationale for my proposal is that
kernel-side rate control will be quite difficult to operate without knowledge
of the period at which intra/inter frames are produced. Maybe there are known
methods to handle this, but I have the impression that most rate control
implementations use the GOP size as a parameter.

More generally I think an expectation behind rate control is to be able to
decide at which time a specific frame type is produced. This is not possible if
the decision is entirely up to userspace.

> > Leaving GOP management to the kernel-side implies having it decide which frame
> > should be IDR, I or P (and B for encoders that can support it), while keeping
> > the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
> > to me that this is already a good balance between giving userspace a decent
> > level of control while not having to specify the frame type explicitly for each
> > frame or maintain a GOP in userspace.
> 
> My expectation for stateless encoder is to have to specify the frame type and
> the associate references if the type requires it.
> 
> > 
> > Requesting the frame type explicitly seems more fragile as many situations will
> > be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
> > it generally requires userspace to know a lot about what the codec assumptions
> > are. Also for B frames the decision would need to be consistent with the fact
> > that a following frame (in display order) would need to be submitted earlier
> > than the current frame and inform the kernel so that the picture order count
> > (display order indication) can be maintained. This is not impossible or out of
> > reach, but it brings a lot of complexity for little advantage.
> 
> We have had a lot more consistent results over the last decade with stateless
> hardware codecs in contrast to stateful where we endup with wide variation in
> behaviour. This applies to Chromium, GStreamer and any active users of VA
> encoders really. I'm strongly in favour for stateless reference API out of the
> Linux kernel.

Okay I understand the lower level of control make it possible to get much better
results than opaque firmware-driven encoders and it would be a shame to not
leverage this possibility with an API that is too restrictive.

However I do think it should be possible to operate the encoder without a lot
of codec-specific supporting code from userspace. This is also why I like having
kernel-side rate control (among other reasons).

> > Leaving the decision to the kernel side with some hints (whether to force a
> > keyframe, whether to allow B frames) seems a lot easier, especially for B frames
> > since the kernel could just receive frames in-order and decide to hold one
> > so that it can use the next frame submitted as a forward reference for this
> > upcoming B frame. This requires flushing support but it's already well in place
> > for stateful encoders.
> 
> No, its a lot harder for users. The placement of keyframe should be bound to
> various image analyses and streaming conditions like scene change detection,
> network traffic, but also, I strictly don't want to depend on the Linux kernel
> when its time to implement a custom reference tree. In general, stateful decoder
> are never up to the game of modern RTP features and other fancy robust
> referencing model.

That is a fair point.

> I overall have to disagree with your proposed approach. I
> believe we have to create a stateless encoder interface and not a completely
> abstract this hardware over our existing stateful interface. We should take
> adventage of the nature of the hardware to make simpler and safer driver.

Understood.

> > The next topic of interest is reference management. It seems pretty clear that
> > the decision of whether a frame should be a reference or not always needs to be
> > taken when encoding that frame. In H.264 the nal_ref_idc slice header element
> > indicates whether a frame is marked as reference or not. IDR frames can
> > additionally be marked as long-term reference (if I understood correctly, the
> > frame will stay in the reference picture list until the next IDR frame).
> 
> This is incorrect. Any frames can be marked as long term reference, it does not
> matter what type they are. From what I recall, marking of the long term in the
> bitstream is using a explicit IDX, so there is no specific rules on which one
> get evicted. Long term of course are limited as they occupy space in the DPB. 
> Also, Each CODEC have different DPB semantic. For H.264, the DPB can run in two
> modes. The first is a simple fifo, in this case, any frame you encode and want
> to keep as reference is pushed into the DPB (which has a fixed size minus the
> long term). If full, the oldest frame is removed. It is not bound to IDR or GOP.
> Though, an IDR will implicitly cause the decoder to evict everything (including
> long term).
> 
> The second mode uses the memory management commands. This is a series if
> instruction that the encoder can send to the decoder. The specification is quite
> complex, it is a common source of bugs in decoders and a place were stateless
> hardware codecs performs more consistently in general. Through the commands, the
> encoder ensure that the decoder dpb representation stay on sync.

This is also what I understand from repeated reading of the spec and thanks for
the summary write-up!

My assumption was that it would be preferable to operate in the simple fifo
mode since the memory management commands need to be added to the bitstream
headers and require coordination from the kernel. Like you said it seems complex
and error-prone.

But maybe this mechanism could be used to allow any particular reference frame
configuration, opening the way for userspace to fully decide what the reference
buffer lists are? Also it would be good to know if such mechanisms are generally
present in codecs or if most of them have an implicit reference list that cannot
be modified.

> > Frames that are marked as reference are added to the l0/l1 lists implicitly
> > that way and are evicted mostly depending on the number of reference slots
> > available, or when a new GOP is started.
> 
> Be aware that "slots" is a hardware implementation detail. I think it can be
> used for any MPEG CODEC, but be careful since slots in AV1 specification have a
> completely different meaning. Generalization of slots will create confusion.
>
> > 
> > With the frame type decided by the kernel, it becomes nearly impossible for
> > userspace to keep track of the reference lists. Userspace would at least need
> > to know when an IDR frame is produced to flush the reference lists. In addition
> > it looks like most hardware doesn't have a way to explicitly discard previous
> > frames that were marked as reference from being used as reference for next
> > frames. All in all this means that we should expect little control over the
> > reference frames list.
> > 
> > As a result my updated proposal would be to have userspace only indicate whether
> > a submitted frame should be marked as a reference or not instead of submitting
> > an explicit list of previous buffers that should be used as reference, which
> > would be impossible to honor in many cases.
> > 
> > Addition information gathered:
> > - It seems likely that the Allwinner Video Engine only supports one reference
> >   frame. There's a register for specifying the rec buffer of a second one but
> >   I have never seen the proprietary blob use it. It might be as easy as
> >   specifying a non-zero address there but it might also be ignored or require
> >   some undocumented bit to use more than one reference. I haven't made any
> >   attempt at using it yet.
> 
> There is something in that fact that makes me think of Hantro H1. Hantro H1 also
> have a second reference, but non one ever use it. We have on our todo to
> actually give this a look.

Having looked at both register layouts, I would tend to think both designs
are distinct. It's still unclear where Allwinner's video engine comes from:
perhaps they made it in-house, perhaps some obscure Chinese design house made it
for them or it could be known hardware with a modified register layout.

I would also be interested to know if the H1 can do more than one reference!

> > - Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
> >   do not support VP8 encode (despite Allwinner's proprietary blob having an
> >   API for it). The only platform that advertises it is the A80 and this might
> >   actually be a VP8-only Hantro H1. It seems that the API they developed in the
> >   library stuck around even if no other platform can use it.
> 
> Thanks for letting us know. Our assumption is that a second hardware design is
> unlikely as Google was giving it for free to any hardware makers that wanted it.
> 
> > 
> > Sorry for the long email again, I'm trying to be a bit more explanatory than
> > just giving some bare conclusions that I drew on my own.
> > 
> > What do you think about these ideas?
> 
> In general, we diverge on the direction we want the interface to be. What you
> seem to describe now is just a normal stateful encoder interface with everything
> needed to drive the stateless hardware implemented in the Linux kernel. There is
> no parsing or other unsafety in encoders, so I don't have a strict no-go
> argument for that, but for me, it means much more complex drivers and lesser
> flexibility. The VA model have been working great for us in the past, giving us
> the ability to implement new feature, or even slightly of spec features. While,
> the Linux kernel might not be the right place for these experimental methods.

VA seems too low-level for our case here, as it seems to expect full control
over more or less each bitstream parameter that will be produced.

I think we have to find some middle-ground that is not as limiting as stateful
encoders but not as low-level as VA.

> Personally, I would rather discuss around your uAPI RFC though, I think a lot of
> other devs here would like to see what you have drafted.

Hehe I wish I had some advanced proposal here but my implementation is quite
simplified compared to what we have to plan for mainline.

Cheers,

Paul

> Nicolas
> 
> > 
> > Cheers,
> > 
> > Paul
> > 
> > > 
> > > This is a very long email where I've tried to split things into distinct topics
> > > and explain a few concepts to make sure everyone is on the same page.
> > > 
> > > # Bitstream Headers
> > > 
> > > Stateless encoders typically do not generate all the bitstream headers and
> > > sometimes no header at all (e.g. Allwinner encoder does not even produce slice
> > > headers). There's often some hardware block that makes bit-level writing to the
> > > destination buffer easier (deals with alignment, etc).
> > > 
> > > The values of the bitstream headers must be in line with how the compressed
> > > data bitstream is generated and generally follow the codec specification.
> > > Some encoders might allow configuring all the fields found in the headers,
> > > others may only allow configuring a few or have specific constraints regarding
> > > which values are allowed.
> > > 
> > > As a result, we cannot expect that any given encoder is able to produce frames
> > > for any set of headers. Reporting related constraints and limitations (beyond
> > > profile/level) seems quite difficult and error-prone.
> > > 
> > > So it seems that keeping header generation in-kernel only (close to where the
> > > hardware is actually configured) is the safest approach.
> > > 
> > > # Codec Features
> > > 
> > > Codecs have many variable features that can be enabled or not and specific
> > > configuration fields that can take various values. There is usually some
> > > top-level indication of profile/level that restricts what can be used.
> > > 
> > > This is a very similar situation to stateful encoding, where codec-specific
> > > controls are used to report and set profile/level and configure these aspects.
> > > A particularly nice thing about it is that we can reuse these existing controls
> > > and add new ones in the future for features that are not yet covered.
> > > 
> > > This approach feels more flexible than designing new structures with a selected
> > > set of parameters (that could match the existing controls) for each codec.
> > > 
> > > # Reference and Reconstruction Management
> > > 
> > > With stateless encoding, we need to tell the hardware which frames need to be
> > > used as references for encoding the current frame and make sure we have the
> > > these references available as decoded frames in memory.
> > > 
> > > Regardless of references, stateless encoders typically need some memory space to
> > > write the decoded (known as reconstructed) frame while it's being encoded.
> > > 
> > > One question here is how many slots for decoded pictures should be allocated
> > > by the driver when starting to stream. There is usually a maximum number of
> > > reference frames that can be used at a time, although perhaps there is a use
> > > case to keeping more around and alternative between them for future references.
> > > 
> > > Another question is how the driver should keep track of which frame will be used
> > > as a reference in the future and which one can be evicted from the pool of
> > > decoded pictures if it's not going to be used anymore.
> > > 
> > > A restrictive approach would be to let the driver alone manage that, similarly
> > > to how stateful encoders behave. However it might provide extra flexibility
> > > (and memory gain) to allow userspace to configure the maximum number of possible
> > > reference frames. In that case it becomes necessary to indicate if a given
> > > frame will be used as a reference in the future (maybe using a buffer flag)
> > > and to indicate which previous reference frames (probably to be identified with
> > > the matching output buffer's timestamp) should be used for the current encode.
> > > This could be done with a new dedicated control (as a variable-sized array of
> > > timestamps). Note that userspace would have to update it for every frame or the
> > > reference frames will remain the same for future encodes.
> > > 
> > > The driver will then make sure to keep the reconstructed buffer around, in one
> > > of the slots. When there's no slot left, the driver will drop the oldest
> > > reference it has (maybe with a bounce buffer to still allow it to be used as a
> > > reference for the current encode).
> > > 
> > > With this behavior defined in the uAPI spec, userspace will also be able to
> > > keep track of which previous frame is no longer allowed as a reference.
> > > 
> > > # Frame Types
> > > 
> > > Stateless encoder drivers will typically instruct the hardware to encode either
> > > an intra-coded or an inter-coded frame. While a stream composed only of a single
> > > intra-coded frame followed by only inter-coded frames is possible, it's
> > > generally not desirable as it is not very robust against data loss and makes
> > > seeking difficult.
> > > 
> > > As a result, the frame type is usually decided based on a given GOP size
> > > (the frequency at which a new intra-coded frame is produced) while intra-coded
> > > frames can be explicitly requested upon request. Stateful encoders implement
> > > these through dedicated controls:
> > > - V4L2_CID_MPEG_VIDEO_FORCE_KEY_FRAME
> > > - V4L2_CID_MPEG_VIDEO_GOP_SIZE
> > > - V4L2_CID_MPEG_VIDEO_H264_I_PERIOD
> > > 
> > > It seems that reusing them would be possible, which would let the driver decide
> > > of the particular frame type.
> > > 
> > > However it makes the reference frame management a bit trickier since reference
> > > frames might be requested from userspace for a frame that ends up being
> > > intra-coded. We can either allow this and silently ignore the info or expect
> > > that userspace keeps track of the GOP index and not send references on the first
> > > frame.
> > > 
> > > In some codecs, there's also a notion of barrier key-frames (IDR frames in
> > > H.264) that strictly forbid using any past reference beyond the frame.
> > > There seems to be an assumption that the GOP start uses this kind of frame
> > > (and not any intra-coded frame), while the force key frame control does not
> > > particularly specify it.
> > > 
> > > In that case we should flush the list of references and userspace should no
> > > longer provide references to them for future frames. This puts a requirement on
> > > userspace to keep track of GOP start in order to know when to flush its
> > > reference list. It could also check if V4L2_BUF_FLAG_KEYFRAME is set, but this
> > > could also indicate a general intra-coded frame that is not a barrier.
> > > 
> > > So another possibility would be for userspace to explicitly indicate which
> > > frame type to use (in a codec-specific way) and act accordingly, leaving any
> > > notion of GOP up to userspace. I feel like this might be the easiest approach
> > > while giving an extra degree of control to userspace.
> > > 
> > > # Rate Control
> > > 
> > > Another important feature of encoders is the ability to control the amount of
> > > data produced following different rate control strategies. Stateful encoders
> > > typically do this in-firmware and expose controls for selecting the strategy
> > > and associated targets.
> > > 
> > > It seems desirable to support both automatic and manual rate-control to
> > > userspace.
> > > 
> > > Automatic control would be implemented kernel-side (with algos possibly shared
> > > across drivers) and reuse existing stateful controls. The advantage is
> > > simplicity (userspace does not need to carry its own rate-control
> > > implementation) and to ensure that there is a built-in mechanism for common
> > > strategies available for every driver (no mandatory dependency on a proprietary
> > > userspace stack). There may also be extra statistics or controls available to
> > > the driver that allow finer-grain control.
> > > 
> > > Manual control allows userspace to get creative and requires the ability to set
> > > the quantization parameter (QP) directly for each frame (controls are already
> > > as many stateful encoders also support it).
> > > 
> > > # Regions of Interest
> > > 
> > > Regions of interest (ROIs) allow specifying sub-regions of the frame that should
> > > be prioritized for quality. Stateless encoders typically support a limited
> > > number and allow setting specific QP values for these regions.
> > > 
> > > While the QP value should be used directly in manual rate-control, we probably
> > > want to have some "level of importance" setting for kernel-side rate-control,
> > > along with the dimensions/position of each ROI. This could be expressed with
> > > a new structure containing all these elements and presented as a variable-sized
> > > array control with as many elements as the hardware can support.
> > > 
> > > -- 
> > > Paul Kocialkowski, Bootlin
> > > Embedded Linux and kernel engineering
> > > https://bootlin.com
> > 
> > 
> > 
> 

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-11 20:08     ` Paul Kocialkowski
@ 2023-08-21 15:13       ` Nicolas Dufresne
  2023-08-22  8:30         ` Hsia-Jun Li
  2023-08-23  8:05         ` Paul Kocialkowski
  0 siblings, 2 replies; 29+ messages in thread
From: Nicolas Dufresne @ 2023-08-21 15:13 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Hello again,

I've been away last week.

Le vendredi 11 août 2023 à 22:08 +0200, Paul Kocialkowski a écrit :
> Hi Nicolas,
> 
> On Thu 10 Aug 23, 10:34, Nicolas Dufresne wrote:
> > Le jeudi 10 août 2023 à 15:44 +0200, Paul Kocialkowski a écrit :
> > > Hi folks,
> > > 
> > > On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
> > > > I am now working on a H.264 encoder driver for Allwinner platforms (currently
> > > > focusing on the V3/V3s), which already provides some usable bitstream and will
> > > > be published soon.
> > > 
> > > So I wanted to shared an update on my side since I've been making progress on
> > > the H.264 encoding work for Allwinner platforms. At this point the code supports
> > > IDR, I and P frames, with a single reference. It also supports GOP (both closed
> > > and open with IDR or I frame interval and explicit keyframe request) but uses
> > > QP controls and does not yet provide rate control. I hope to be able to
> > > implement rate-control before we can make a first public release of the code.
> > 
> > Just a reminder that we will code review the API first, the supporting
> > implementation will just be companion. So in this context, the sooner the better
> > for an RFC here.
> 
> I definitely want to have some proposal that is (even vaguely) agreed upon
> before proposing patches for mainline, even at the stage of RFC.
> 
> While I already have working results at this point, the API that is used is
> very basic and just reuses controls from stateful encoders, with no extra
> addition. Various assumptions are made in the kernel and there is no real
> reference management, since the previous frame is always expected to be used
> as the only reference.

One thing we are looking at these days, and aren't current controllable in
stateful interface is RTP RPSI (reference picture selection indication). This is
feedback that a remote decoder sends when a reference picture has been decoded.
In short, even if only 1 reference is used, we'd like the reference to change
only when we received the acknowledgement that the new one has been
reconstructed on the other side.

I'm not super keep in having to modify the Linux kernel specially for this
feature. Specially that similar API offer it at a lower level (VA, D3D12, and
probably future API).

> 
> We plan to make a public release at some point in the near future which shows
> these working results, but it will not be a base for our discussion here yet.
> 
> > > One of the main topics of concern now is how reference frames should be managed
> > > and how it should interact with kernel-side GOP management and rate control.
> > 
> > Maybe we need to have a discussion about kernel side GOP management first ?
> > While I think kernel side rate control is un-avoidable, I don't think stateless
> > encoder should have kernel side GOP management.
> 
> I don't have strong opinions about this. The rationale for my proposal is that
> kernel-side rate control will be quite difficult to operate without knowledge
> of the period at which intra/inter frames are produced. Maybe there are known
> methods to handle this, but I have the impression that most rate control
> implementations use the GOP size as a parameter.
> 
> More generally I think an expectation behind rate control is to be able to
> decide at which time a specific frame type is produced. This is not possible if
> the decision is entirely up to userspace.

In Television (and Youtube) streaming, the GOP size is just fixed, and you deal
with it. In fact, I never seen GOP or picture pattern being modified by the rate
control. In general, the high end rate controls will follow an HRD
specification. The rate controls will require information that represent
constraints, this is not limited to the rate. In H.264/HEVC, the level and
profile will play a role. But you could also add the VBV size and probably more.
I have never read the HRD specification completely.

In cable streaming notably, the RC job is to monitor the about of bits over a
period of time (the window). This window is defined by the streaming hardware
buffering capabilities. Best at this point is to start reading through HRD
specifications, and open source rate control implementation (notably x264).

I think overall, we can live with adding hints were needed, and if the gop
information is appropriate hint, then we can just reuse the existing control.

> 
> > > Leaving GOP management to the kernel-side implies having it decide which frame
> > > should be IDR, I or P (and B for encoders that can support it), while keeping
> > > the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
> > > to me that this is already a good balance between giving userspace a decent
> > > level of control while not having to specify the frame type explicitly for each
> > > frame or maintain a GOP in userspace.
> > 
> > My expectation for stateless encoder is to have to specify the frame type and
> > the associate references if the type requires it.

Ack. For us, this is also why we would require requests (unlike statful
encoder), as we have per frame information to carry, and requests explicitly
attach the information to the frame.

> > 
> > > 
> > > Requesting the frame type explicitly seems more fragile as many situations will
> > > be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
> > > it generally requires userspace to know a lot about what the codec assumptions
> > > are. Also for B frames the decision would need to be consistent with the fact
> > > that a following frame (in display order) would need to be submitted earlier
> > > than the current frame and inform the kernel so that the picture order count
> > > (display order indication) can be maintained. This is not impossible or out of
> > > reach, but it brings a lot of complexity for little advantage.
> > 
> > We have had a lot more consistent results over the last decade with stateless
> > hardware codecs in contrast to stateful where we endup with wide variation in
> > behaviour. This applies to Chromium, GStreamer and any active users of VA
> > encoders really. I'm strongly in favour for stateless reference API out of the
> > Linux kernel.
> 
> Okay I understand the lower level of control make it possible to get much better
> results than opaque firmware-driven encoders and it would be a shame to not
> leverage this possibility with an API that is too restrictive.
> 
> However I do think it should be possible to operate the encoder without a lot
> of codec-specific supporting code from userspace. This is also why I like having
> kernel-side rate control (among other reasons).

Ack. We need a compromise here.


[...]

> 
> > > The next topic of interest is reference management. It seems pretty clear that
> > > the decision of whether a frame should be a reference or not always needs to be
> > > taken when encoding that frame. In H.264 the nal_ref_idc slice header element
> > > indicates whether a frame is marked as reference or not. IDR frames can
> > > additionally be marked as long-term reference (if I understood correctly, the
> > > frame will stay in the reference picture list until the next IDR frame).
> > 
> > This is incorrect. Any frames can be marked as long term reference, it does not
> > matter what type they are. From what I recall, marking of the long term in the
> > bitstream is using a explicit IDX, so there is no specific rules on which one
> > get evicted. Long term of course are limited as they occupy space in the DPB. 
> > Also, Each CODEC have different DPB semantic. For H.264, the DPB can run in two
> > modes. The first is a simple fifo, in this case, any frame you encode and want
> > to keep as reference is pushed into the DPB (which has a fixed size minus the
> > long term). If full, the oldest frame is removed. It is not bound to IDR or GOP.
> > Though, an IDR will implicitly cause the decoder to evict everything (including
> > long term).
> > 
> > The second mode uses the memory management commands. This is a series if
> > instruction that the encoder can send to the decoder. The specification is quite
> > complex, it is a common source of bugs in decoders and a place were stateless
> > hardware codecs performs more consistently in general. Through the commands, the
> > encoder ensure that the decoder dpb representation stay on sync.
> 
> This is also what I understand from repeated reading of the spec and thanks for
> the summary write-up!
> 
> My assumption was that it would be preferable to operate in the simple fifo
> mode since the memory management commands need to be added to the bitstream
> headers and require coordination from the kernel. Like you said it seems complex
> and error-prone.
> 
> But maybe this mechanism could be used to allow any particular reference frame
> configuration, opening the way for userspace to fully decide what the reference
> buffer lists are? Also it would be good to know if such mechanisms are generally
> present in codecs or if most of them have an implicit reference list that cannot
> be modified.

Of course, the subject is much more relevant when there is encoders with more
then 1 reference. But you are correct, what the commands do, is allow to change,
add or remove any reference from the list (random modification), as long as they
fit in the codec contraints (like the DPB size notably). This is the only way
one can implement temporal SVC reference pattern, robust reference trees or RTP
RPSI. Note that long term reference also exists, and are less complex then these
commands.

I this raises a big question, and I never checked how this worked with let's say
VA. Shall we let the driver resolve the changes into commands (VP8 have
something similar, while VP9 and AV1 are refresh flags, which are just trivial
to compute). I believe I'll have to investigate this further.

> > 
[...]

> > > Addition information gathered:
> > > - It seems likely that the Allwinner Video Engine only supports one reference
> > >   frame. There's a register for specifying the rec buffer of a second one but
> > >   I have never seen the proprietary blob use it. It might be as easy as
> > >   specifying a non-zero address there but it might also be ignored or require
> > >   some undocumented bit to use more than one reference. I haven't made any
> > >   attempt at using it yet.
> > 
> > There is something in that fact that makes me think of Hantro H1. Hantro H1 also
> > have a second reference, but non one ever use it. We have on our todo to
> > actually give this a look.
> 
> Having looked at both register layouts, I would tend to think both designs
> are distinct. It's still unclear where Allwinner's video engine comes from:
> perhaps they made it in-house, perhaps some obscure Chinese design house made it
> for them or it could be known hardware with a modified register layout.

Ack,
> 
> I would also be interested to know if the H1 can do more than one reference!

From what we have in our pretty thin documentation, references are being
"searched" for fuzzy match and motion. So when you pass 2 references to the
encoder, then the encoder will search equally in both. I suspect it does a lot
more then that, and saves some information in the auxiliary buffers that exist
per reference, but this isn't documented and I'm not specialized enough really.

From usage perspective, all you have to do is give it access to the references
picture data (reconstructed image and auxiliary data). The result is compressed
macroblock data that may refer to these. We don't really know if it is used, but
we do assume it is and place it in the reference list. This is of course normal
thing to do, specially when using a reference fifo.

In theory, you could implement multiple reference with a HW that only supports
1. A technique could be to compress the image multiple time, and keep the "best"
one for the current configuration. Though, a proper multi-pass encoder would
avoid the bandwidth overhead of compressing and writing the temporary result.

> 
> > > - Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
> > >   do not support VP8 encode (despite Allwinner's proprietary blob having an
> > >   API for it). The only platform that advertises it is the A80 and this might
> > >   actually be a VP8-only Hantro H1. It seems that the API they developed in the
> > >   library stuck around even if no other platform can use it.
> > 
> > Thanks for letting us know. Our assumption is that a second hardware design is
> > unlikely as Google was giving it for free to any hardware makers that wanted it.
> > 
> > > 
> > > Sorry for the long email again, I'm trying to be a bit more explanatory than
> > > just giving some bare conclusions that I drew on my own.
> > > 
> > > What do you think about these ideas?
> > 
> > In general, we diverge on the direction we want the interface to be. What you
> > seem to describe now is just a normal stateful encoder interface with everything
> > needed to drive the stateless hardware implemented in the Linux kernel. There is
> > no parsing or other unsafety in encoders, so I don't have a strict no-go
> > argument for that, but for me, it means much more complex drivers and lesser
> > flexibility. The VA model have been working great for us in the past, giving us
> > the ability to implement new feature, or even slightly of spec features. While,
> > the Linux kernel might not be the right place for these experimental methods.
> 
> VA seems too low-level for our case here, as it seems to expect full control
> over more or less each bitstream parameter that will be produced.
> 
> I think we have to find some middle-ground that is not as limiting as stateful
> encoders but not as low-level as VA.
> 
> > Personally, I would rather discuss around your uAPI RFC though, I think a lot of
> > other devs here would like to see what you have drafted.
> 
> Hehe I wish I had some advanced proposal here but my implementation is quite
> simplified compared to what we have to plan for mainline.

No worries, let's do that later then. On our side, we have similar limitation,
since we have to have something working before we can spend more time in turning
it into something upstream. So we have "something" for VP8, we'll do "something"
for H.264, from there we should be able to iterate. But having the opportunity
to iterate over a more capable hardware would clearly help understand the bigger
picture.

cheers,
Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-21 15:13       ` Nicolas Dufresne
@ 2023-08-22  8:30         ` Hsia-Jun Li
  2023-08-22 20:31           ` Nicolas Dufresne
  2023-08-23  8:05         ` Paul Kocialkowski
  1 sibling, 1 reply; 29+ messages in thread
From: Hsia-Jun Li @ 2023-08-22  8:30 UTC (permalink / raw)
  To: Nicolas Dufresne, Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni



On 8/21/23 23:13, Nicolas Dufresne wrote:
> CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> Hello again,
> 
> I've been away last week.
> 
> Le vendredi 11 août 2023 à 22:08 +0200, Paul Kocialkowski a écrit :
>> Hi Nicolas,
>>
>> On Thu 10 Aug 23, 10:34, Nicolas Dufresne wrote:
>>> Le jeudi 10 août 2023 à 15:44 +0200, Paul Kocialkowski a écrit :
>>>> Hi folks,
>>>>
>>>> On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
>>>>> I am now working on a H.264 encoder driver for Allwinner platforms (currently
>>>>> focusing on the V3/V3s), which already provides some usable bitstream and will
>>>>> be published soon.
>>>>
>>>> So I wanted to shared an update on my side since I've been making progress on
>>>> the H.264 encoding work for Allwinner platforms. At this point the code supports
>>>> IDR, I and P frames, with a single reference. It also supports GOP (both closed
>>>> and open with IDR or I frame interval and explicit keyframe request) but uses
>>>> QP controls and does not yet provide rate control. I hope to be able to
>>>> implement rate-control before we can make a first public release of the code.
>>>
>>> Just a reminder that we will code review the API first, the supporting
>>> implementation will just be companion. So in this context, the sooner the better
>>> for an RFC here.
>>
>> I definitely want to have some proposal that is (even vaguely) agreed upon
>> before proposing patches for mainline, even at the stage of RFC.
>>
>> While I already have working results at this point, the API that is used is
>> very basic and just reuses controls from stateful encoders, with no extra
>> addition. Various assumptions are made in the kernel and there is no real
>> reference management, since the previous frame is always expected to be used
>> as the only reference.
> 
> One thing we are looking at these days, and aren't current controllable in
> stateful interface is RTP RPSI (reference picture selection indication). This is
> feedback that a remote decoder sends when a reference picture has been decoded.
> In short, even if only 1 reference is used, we'd like the reference to change
> only when we received the acknowledgement that the new one has been
> reconstructed on the other side.
> 
> I'm not super keep in having to modify the Linux kernel specially for this
> feature. Specially that similar API offer it at a lower level (VA, D3D12, and
> probably future API).
> 
>>
>> We plan to make a public release at some point in the near future which shows
>> these working results, but it will not be a base for our discussion here yet.
>>
>>>> One of the main topics of concern now is how reference frames should be managed
>>>> and how it should interact with kernel-side GOP management and rate control.
>>>
>>> Maybe we need to have a discussion about kernel side GOP management first ?
>>> While I think kernel side rate control is un-avoidable, I don't think stateless
>>> encoder should have kernel side GOP management.
>>
>> I don't have strong opinions about this. The rationale for my proposal is that
>> kernel-side rate control will be quite difficult to operate without knowledge
>> of the period at which intra/inter frames are produced. Maybe there are known
>> methods to handle this, but I have the impression that most rate control
>> implementations use the GOP size as a parameter.
>>
>> More generally I think an expectation behind rate control is to be able to
>> decide at which time a specific frame type is produced. This is not possible if
>> the decision is entirely up to userspace.
> 
> In Television (and Youtube) streaming, the GOP size is just fixed, and you deal
> with it. In fact, I never seen GOP or picture pattern being modified by the rate
> control. In general, the high end rate controls will follow an HRD
> specification. The rate controls will require information that represent
> constraints, this is not limited to the rate. In H.264/HEVC, the level and
> profile will play a role. But you could also add the VBV size and probably more.
> I have never read the HRD specification completely.
> 
> In cable streaming notably, the RC job is to monitor the about of bits over a
> period of time (the window). This window is defined by the streaming hardware
> buffering capabilities. Best at this point is to start reading through HRD
> specifications, and open source rate control implementation (notably x264).
> 
> I think overall, we can live with adding hints were needed, and if the gop
> information is appropriate hint, then we can just reuse the existing control.
> 
Why we still care about GOP here. Hardware have no idea about GOP at 
all. Although in codec likes HEVC, IDR and intra pictures's nalu header 
is different, there is not different in the hardware coding 
configration. NALU header is generated by the userspace usually.

While future encoding would regard the current encoded picture as an IDR 
is completed decided by the userspace.
>>
>>>> Leaving GOP management to the kernel-side implies having it decide which frame
>>>> should be IDR, I or P (and B for encoders that can support it), while keeping
>>>> the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
>>>> to me that this is already a good balance between giving userspace a decent
>>>> level of control while not having to specify the frame type explicitly for each
>>>> frame or maintain a GOP in userspace.
>>>
>>> My expectation for stateless encoder is to have to specify the frame type and
>>> the associate references if the type requires it.
> 
> Ack. For us, this is also why we would require requests (unlike statful
> encoder), as we have per frame information to carry, and requests explicitly
> attach the information to the frame.
> 
>>>
>>>>
>>>> Requesting the frame type explicitly seems more fragile as many situations will
>>>> be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
>>>> it generally requires userspace to know a lot about what the codec assumptions
>>>> are. Also for B frames the decision would need to be consistent with the fact
>>>> that a following frame (in display order) would need to be submitted earlier
>>>> than the current frame and inform the kernel so that the picture order count
>>>> (display order indication) can be maintained. This is not impossible or out of
>>>> reach, but it brings a lot of complexity for little advantage.
>>>
>>> We have had a lot more consistent results over the last decade with stateless
>>> hardware codecs in contrast to stateful where we endup with wide variation in
>>> behaviour. This applies to Chromium, GStreamer and any active users of VA
>>> encoders really. I'm strongly in favour for stateless reference API out of the
>>> Linux kernel.
>>
>> Okay I understand the lower level of control make it possible to get much better
>> results than opaque firmware-driven encoders and it would be a shame to not
>> leverage this possibility with an API that is too restrictive.
>>
>> However I do think it should be possible to operate the encoder without a lot
>> of codec-specific supporting code from userspace. This is also why I like having
>> kernel-side rate control (among other reasons).
> 
> Ack. We need a compromise here.
> 
> 
> [...]
> 
>>
>>>> The next topic of interest is reference management. It seems pretty clear that
>>>> the decision of whether a frame should be a reference or not always needs to be
>>>> taken when encoding that frame. In H.264 the nal_ref_idc slice header element
>>>> indicates whether a frame is marked as reference or not. IDR frames can
>>>> additionally be marked as long-term reference (if I understood correctly, the
>>>> frame will stay in the reference picture list until the next IDR frame).
>>>
>>> This is incorrect. Any frames can be marked as long term reference, it does not
>>> matter what type they are. From what I recall, marking of the long term in the
>>> bitstream is using a explicit IDX, so there is no specific rules on which one
>>> get evicted. Long term of course are limited as they occupy space in the DPB.
>>> Also, Each CODEC have different DPB semantic. For H.264, the DPB can run in two
>>> modes. The first is a simple fifo, in this case, any frame you encode and want
>>> to keep as reference is pushed into the DPB (which has a fixed size minus the
>>> long term). If full, the oldest frame is removed. It is not bound to IDR or GOP.
>>> Though, an IDR will implicitly cause the decoder to evict everything (including
>>> long term).
>>>
>>> The second mode uses the memory management commands. This is a series if
>>> instruction that the encoder can send to the decoder. The specification is quite
>>> complex, it is a common source of bugs in decoders and a place were stateless
>>> hardware codecs performs more consistently in general. Through the commands, the
>>> encoder ensure that the decoder dpb representation stay on sync.
>>
>> This is also what I understand from repeated reading of the spec and thanks for
>> the summary write-up!
>>
>> My assumption was that it would be preferable to operate in the simple fifo
>> mode since the memory management commands need to be added to the bitstream
>> headers and require coordination from the kernel. Like you said it seems complex
>> and error-prone.
>>
>> But maybe this mechanism could be used to allow any particular reference frame
>> configuration, opening the way for userspace to fully decide what the reference
>> buffer lists are? Also it would be good to know if such mechanisms are generally
>> present in codecs or if most of them have an implicit reference list that cannot
>> be modified.
> 
> Of course, the subject is much more relevant when there is encoders with more
> then 1 reference. But you are correct, what the commands do, is allow to change,
> add or remove any reference from the list (random modification), as long as they
> fit in the codec contraints (like the DPB size notably). This is the only way
> one can implement temporal SVC reference pattern, robust reference trees or RTP
> RPSI. Note that long term reference also exists, and are less complex then these
> commands.
> 

If we the userspace could manage the lifetime of reconstruction 
buffers(assignment, reference), we don't need a command here.

It is just a problem of how to design another request API control 
structure to select which buffers would be used for list0, list1.
> I this raises a big question, and I never checked how this worked with let's say
> VA. Shall we let the driver resolve the changes into commands (VP8 have
> something similar, while VP9 and AV1 are refresh flags, which are just trivial
> to compute). I believe I'll have to investigate this further.
> 
>>>
> [...]
> 
>>>> Addition information gathered:
>>>> - It seems likely that the Allwinner Video Engine only supports one reference
>>>>    frame. There's a register for specifying the rec buffer of a second one but
>>>>    I have never seen the proprietary blob use it. It might be as easy as
>>>>    specifying a non-zero address there but it might also be ignored or require
>>>>    some undocumented bit to use more than one reference. I haven't made any
>>>>    attempt at using it yet.
>>>
>>> There is something in that fact that makes me think of Hantro H1. Hantro H1 also
>>> have a second reference, but non one ever use it. We have on our todo to
>>> actually give this a look.
>>
>> Having looked at both register layouts, I would tend to think both designs
>> are distinct. It's still unclear where Allwinner's video engine comes from:
>> perhaps they made it in-house, perhaps some obscure Chinese design house made it
>> for them or it could be known hardware with a modified register layout.
> 
> Ack,
>>
>> I would also be interested to know if the H1 can do more than one reference!
> 
>  From what we have in our pretty thin documentation, references are being
> "searched" for fuzzy match and motion. So when you pass 2 references to the
> encoder, then the encoder will search equally in both. I suspect it does a lot
> more then that, and saves some information in the auxiliary buffers that exist
> per reference, but this isn't documented and I'm not specialized enough really.
> 
>  From usage perspective, all you have to do is give it access to the references
> picture data (reconstructed image and auxiliary data). The result is compressed
> macroblock data that may refer to these. We don't really know if it is used, but
> we do assume it is and place it in the reference list. This is of course normal
> thing to do, specially when using a reference fifo.
> 
> In theory, you could implement multiple reference with a HW that only supports
> 1. A technique could be to compress the image multiple time, and keep the "best"
> one for the current configuration. Though, a proper multi-pass encoder would
> avoid the bandwidth overhead of compressing and writing the temporary result.
> 
>>
>>>> - Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
>>>>    do not support VP8 encode (despite Allwinner's proprietary blob having an
>>>>    API for it). The only platform that advertises it is the A80 and this might
>>>>    actually be a VP8-only Hantro H1. It seems that the API they developed in the
>>>>    library stuck around even if no other platform can use it.
>>>
>>> Thanks for letting us know. Our assumption is that a second hardware design is
>>> unlikely as Google was giving it for free to any hardware makers that wanted it.
>>>
>>>>
>>>> Sorry for the long email again, I'm trying to be a bit more explanatory than
>>>> just giving some bare conclusions that I drew on my own.
>>>>
>>>> What do you think about these ideas?
>>>
>>> In general, we diverge on the direction we want the interface to be. What you
>>> seem to describe now is just a normal stateful encoder interface with everything
>>> needed to drive the stateless hardware implemented in the Linux kernel. There is
>>> no parsing or other unsafety in encoders, so I don't have a strict no-go
>>> argument for that, but for me, it means much more complex drivers and lesser
>>> flexibility. The VA model have been working great for us in the past, giving us
>>> the ability to implement new feature, or even slightly of spec features. While,
>>> the Linux kernel might not be the right place for these experimental methods.
>>
>> VA seems too low-level for our case here, as it seems to expect full control
>> over more or less each bitstream parameter that will be produced.
>>
>> I think we have to find some middle-ground that is not as limiting as stateful
>> encoders but not as low-level as VA.
>>
>>> Personally, I would rather discuss around your uAPI RFC though, I think a lot of
>>> other devs here would like to see what you have drafted.
>>
>> Hehe I wish I had some advanced proposal here but my implementation is quite
>> simplified compared to what we have to plan for mainline.
> 
> No worries, let's do that later then. On our side, we have similar limitation,
> since we have to have something working before we can spend more time in turning
> it into something upstream. So we have "something" for VP8, we'll do "something"
> for H.264, from there we should be able to iterate. But having the opportunity
> to iterate over a more capable hardware would clearly help understand the bigger
> picture.
> 
> cheers,
> Nicolas

-- 
Hsia-Jun(Randy) Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-22  8:30         ` Hsia-Jun Li
@ 2023-08-22 20:31           ` Nicolas Dufresne
  2023-08-23  3:04             ` Hsia-Jun Li
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Dufresne @ 2023-08-22 20:31 UTC (permalink / raw)
  To: Hsia-Jun Li, Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Hi,
> 

[...]

> > In cable streaming notably, the RC job is to monitor the about of bits over a
> > period of time (the window). This window is defined by the streaming hardware
> > buffering capabilities. Best at this point is to start reading through HRD
> > specifications, and open source rate control implementation (notably x264).
> > 
> > I think overall, we can live with adding hints were needed, and if the gop
> > information is appropriate hint, then we can just reuse the existing control.
> > 
> Why we still care about GOP here. Hardware have no idea about GOP at 
> all. Although in codec likes HEVC, IDR and intra pictures's nalu header 
> is different, there is not different in the hardware coding 
> configration. NALU header is generated by the userspace usually.
> 
> While future encoding would regard the current encoded picture as an IDR 
> is completed decided by the userspace.

The discussion was around having basic RC algorithm in the kernel driver,
possibly making use of hardware specific features without actually exposing it
all to userspace. So assuming we do that:

Paul's concern is that for best result, an RC algorithm could use knowledge of
keyframe placement to preserve bucket space (possibly using the last keyframe
size as a hint). Exposing the GOP structure in some form allow "prediction", so
the adaption can lookahead future budget without introducing latency. There is
an alternative, which is to require ahead of time queuing of encode requests.
But this does introduce latency since the way it works in V4L2 today, we need
the picture to be filled by the time we request an encode.

Though, if we drop the GOP structure and favour this approach, the latency could
be regain later by introducing fence base streaming. The technique would be for
a video source (like a capture driver) to pass dmabuf that aren't filled yet,
but have a companion fence. This would allow queuing requests ahead of time, and
all we need is enough pre-allocation to accommodate the desired look ahead. Only
issue is that perhaps this violates the fundamental of "short term" delivery of
fences. But fences can also fail I think, in case the capture was stopped.

We can certainly move forward with this as a future solution, or just don't
implement future aware RC algorithm in term to avoid the huge task this involves
(and possibly patents?)

[...]
> > 

> > Of course, the subject is much more relevant when there is encoders with more
> > then 1 reference. But you are correct, what the commands do, is allow to change,
> > add or remove any reference from the list (random modification), as long as they
> > fit in the codec contraints (like the DPB size notably). This is the only way
> > one can implement temporal SVC reference pattern, robust reference trees or RTP
> > RPSI. Note that long term reference also exists, and are less complex then these
> > commands.
> > 
> 
> If we the userspace could manage the lifetime of reconstruction 
> buffers(assignment, reference), we don't need a command here.

Sorry if I created confusion, the comments was something specific to H.264
coding. Its a compressed form for the reference lists. This information is coded
in the slice header and enabled through adaptive_ref_pic_marking_mode_flag

It was suggested so far to leave h264 slice headers writing to the driver. This
is motivated by H264 slice header not being byte aligned in size, so the
slice_data() is hard to combine. Also, some hardware actually produce the
slice_header. This needs actual hardware interface analyses, cause an H.264
slice header is worth nothing if it cannot instruct the decoder how to maintain
the desired reference state.

I think this aspect should probably not be generalized to all CODECs, since the
packing semantic can largely differ. When the codec header is indeed byte
aligned, it can easily be seperate and combined by application, improve the
application flexibility, reducing the kernel API complexity.
> 
> It is just a problem of how to design another request API control 
> structure to select which buffers would be used for list0, list1.
> > I this raises a big question, and I never checked how this worked with let's say
> > VA. Shall we let the driver resolve the changes into commands (VP8 have
> > something similar, while VP9 and AV1 are refresh flags, which are just trivial
> > to compute). I believe I'll have to investigate this further.
> > 
> > > > 
> > [...]

regards,
Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-22 20:31           ` Nicolas Dufresne
@ 2023-08-23  3:04             ` Hsia-Jun Li
  2023-08-30 15:10               ` Nicolas Dufresne
  2023-08-30 15:18               ` Nicolas Dufresne
  0 siblings, 2 replies; 29+ messages in thread
From: Hsia-Jun Li @ 2023-08-23  3:04 UTC (permalink / raw)
  To: Nicolas Dufresne, Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni



On 8/23/23 04:31, Nicolas Dufresne wrote:
> CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> Hi,
>>
> 
> [...]
> 
>>> In cable streaming notably, the RC job is to monitor the about of bits over a
>>> period of time (the window). This window is defined by the streaming hardware
>>> buffering capabilities. Best at this point is to start reading through HRD
>>> specifications, and open source rate control implementation (notably x264).
>>>
>>> I think overall, we can live with adding hints were needed, and if the gop
>>> information is appropriate hint, then we can just reuse the existing control.
>>>
>> Why we still care about GOP here. Hardware have no idea about GOP at
>> all. Although in codec likes HEVC, IDR and intra pictures's nalu header
>> is different, there is not different in the hardware coding
>> configration. NALU header is generated by the userspace usually.
>>
>> While future encoding would regard the current encoded picture as an IDR
>> is completed decided by the userspace.
> 
> The discussion was around having basic RC algorithm in the kernel driver,
What I am thinking is who would use a basic RC algorithm in the kernel?
We are designing a toy algorithm which all hardware could use, while it 
would introduce a complex structure to make the userspace work with it.

Vendor would need to try to fit their model in an interface with limited 
functions.
> possibly making use of hardware specific features without actually exposing it
> all to userspace. So assuming we do that:
> 
> Paul's concern is that for best result, an RC algorithm could use knowledge of
> keyframe placement to preserve bucket space (possibly using the last keyframe
> size as a hint). Exposing the GOP structure in some form allow "prediction", so
> the adaption can lookahead future budget without introducing latency. There is
> an alternative, which is to require ahead of time queuing of encode requests.
It sounds like a fixed bitrate RC. Then this RC algorithm would in 
charge of selecting the reference frames?

Suppose we are talking about Hantro H1 which people here are familiar with.
An intra frame would usually cost the most hardware time to encode and 
contribute a lot to the size of a GOP(fixed bitrate).

If we ignore the inter frame, that would lead to a bad quality image.
One case here is decide whether I would use a previous intra frame as 
the reference or just the last frame
Userspace should be able to decide when to request a intra frame or 
reencode the current inter frame to intra frame.
> But this does introduce latency since the way it works in V4L2 today, we need
> the picture to be filled by the time we request an encode.
> 
> Though, if we drop the GOP structure and favour this approach, the latency could
> be regain later by introducing fence base streaming. The technique would be for
> a video source (like a capture driver) to pass dmabuf that aren't filled yet,
> but have a companion fence. This would allow queuing requests ahead of time, and
> all we need is enough pre-allocation to accommodate the desired look ahead. Only
> issue is that perhaps this violates the fundamental of "short term" delivery of
> fences. But fences can also fail I think, in case the capture was stopped.
> 
I don't think it would help. Fence is a thing for DRM/GPU without a queue.
Even with a fence, would the video sink tell us the motion delta here?
> We can certainly move forward with this as a future solution, or just don't
> implement future aware RC algorithm in term to avoid the huge task this involves
> (and possibly patents?)
> 
I think we should not restrict how the userspace(vendor) operate the 
hardware.
> [...]
>>>
> 
>>> Of course, the subject is much more relevant when there is encoders with more
>>> then 1 reference. But you are correct, what the commands do, is allow to change,
>>> add or remove any reference from the list (random modification), as long as they
>>> fit in the codec contraints (like the DPB size notably). This is the only way
>>> one can implement temporal SVC reference pattern, robust reference trees or RTP
>>> RPSI. Note that long term reference also exists, and are less complex then these
>>> commands.
>>>
>>
>> If we the userspace could manage the lifetime of reconstruction
>> buffers(assignment, reference), we don't need a command here.
> 
> Sorry if I created confusion, the comments was something specific to H.264
> coding. Its a compressed form for the reference lists. This information is coded
> in the slice header and enabled through adaptive_ref_pic_marking_mode_flag
> 
> It was suggested so far to leave h264 slice headers writing to the driver. This
> is motivated by H264 slice header not being byte aligned in size, so the
H.264, H.265 has the byte_alignment() in nalu. You don't need skip bits 
feature which could be found in H1.

> slice_data() is hard to combine. Also, some hardware actually produce the
> slice_header. This needs actual hardware interface analyses, cause an H.264
> slice header is worth nothing if it cannot instruct the decoder how to maintain
> the desired reference state.
> 
I don't even think we should write the slice header into the CAPTURE 
buffer, which would cause a cache problem. Ususally the slice header 
would be written only when that slice data is copy out.
That is much more easily that userspace wrapper handle this.

> I think this aspect should probably not be generalized to all CODECs, since the
> packing semantic can largely differ. When the codec header is indeed byte
> aligned, it can easily be seperate and combined by application, improve the
> application flexibility, reducing the kernel API complexity.
>>
>> It is just a problem of how to design another request API control
>> structure to select which buffers would be used for list0, list1.
>>> I this raises a big question, and I never checked how this worked with let's say
>>> VA. Shall we let the driver resolve the changes into commands (VP8 have
>>> something similar, while VP9 and AV1 are refresh flags, which are just trivial
>>> to compute). I believe I'll have to investigate this further.
>>>
>>>>>
>>> [...]
> 
> regards,
> Nicolas

-- 
Hsia-Jun(Randy) Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-21 15:13       ` Nicolas Dufresne
  2023-08-22  8:30         ` Hsia-Jun Li
@ 2023-08-23  8:05         ` Paul Kocialkowski
  2023-11-15 13:19           ` Paul Kocialkowski
  1 sibling, 1 reply; 29+ messages in thread
From: Paul Kocialkowski @ 2023-08-23  8:05 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 17348 bytes --]

Hi Nicolas,

On Mon 21 Aug 23, 11:13, Nicolas Dufresne wrote:
> Hello again,
> 
> I've been away last week.
> 
> Le vendredi 11 août 2023 à 22:08 +0200, Paul Kocialkowski a écrit :
> > Hi Nicolas,
> > 
> > On Thu 10 Aug 23, 10:34, Nicolas Dufresne wrote:
> > > Le jeudi 10 août 2023 à 15:44 +0200, Paul Kocialkowski a écrit :
> > > > Hi folks,
> > > > 
> > > > On Tue 11 Jul 23, 19:12, Paul Kocialkowski wrote:
> > > > > I am now working on a H.264 encoder driver for Allwinner platforms (currently
> > > > > focusing on the V3/V3s), which already provides some usable bitstream and will
> > > > > be published soon.
> > > > 
> > > > So I wanted to shared an update on my side since I've been making progress on
> > > > the H.264 encoding work for Allwinner platforms. At this point the code supports
> > > > IDR, I and P frames, with a single reference. It also supports GOP (both closed
> > > > and open with IDR or I frame interval and explicit keyframe request) but uses
> > > > QP controls and does not yet provide rate control. I hope to be able to
> > > > implement rate-control before we can make a first public release of the code.
> > > 
> > > Just a reminder that we will code review the API first, the supporting
> > > implementation will just be companion. So in this context, the sooner the better
> > > for an RFC here.
> > 
> > I definitely want to have some proposal that is (even vaguely) agreed upon
> > before proposing patches for mainline, even at the stage of RFC.
> > 
> > While I already have working results at this point, the API that is used is
> > very basic and just reuses controls from stateful encoders, with no extra
> > addition. Various assumptions are made in the kernel and there is no real
> > reference management, since the previous frame is always expected to be used
> > as the only reference.
> 
> One thing we are looking at these days, and aren't current controllable in
> stateful interface is RTP RPSI (reference picture selection indication). This is
> feedback that a remote decoder sends when a reference picture has been decoded.
> In short, even if only 1 reference is used, we'd like the reference to change
> only when we received the acknowledgement that the new one has been
> reconstructed on the other side.
> 
> I'm not super keep in having to modify the Linux kernel specially for this
> feature. Specially that similar API offer it at a lower level (VA, D3D12, and
> probably future API).

Yeah I understand this is the kind of feature that the API should not prevent
implementing.

> > We plan to make a public release at some point in the near future which shows
> > these working results, but it will not be a base for our discussion here yet.
> > 
> > > > One of the main topics of concern now is how reference frames should be managed
> > > > and how it should interact with kernel-side GOP management and rate control.
> > > 
> > > Maybe we need to have a discussion about kernel side GOP management first ?
> > > While I think kernel side rate control is un-avoidable, I don't think stateless
> > > encoder should have kernel side GOP management.
> > 
> > I don't have strong opinions about this. The rationale for my proposal is that
> > kernel-side rate control will be quite difficult to operate without knowledge
> > of the period at which intra/inter frames are produced. Maybe there are known
> > methods to handle this, but I have the impression that most rate control
> > implementations use the GOP size as a parameter.
> > 
> > More generally I think an expectation behind rate control is to be able to
> > decide at which time a specific frame type is produced. This is not possible if
> > the decision is entirely up to userspace.
> 
> In Television (and Youtube) streaming, the GOP size is just fixed, and you deal
> with it. In fact, I never seen GOP or picture pattern being modified by the rate
> control.

Sure but my point is rather that rate control has to have some knowledge of what
the GOP size is and what frame type comes next. Not to say that it has to make
that decision, but I believe it has to be aware of it.

> In general, the high end rate controls will follow an HRD
> specification. The rate controls will require information that represent
> constraints, this is not limited to the rate. In H.264/HEVC, the level and
> profile will play a role. But you could also add the VBV size and probably more.
> I have never read the HRD specification completely.

That is good to know, especially since I do not have valuable knowledge about
high-end rate control.

> In cable streaming notably, the RC job is to monitor the about of bits over a
> period of time (the window). This window is defined by the streaming hardware
> buffering capabilities. Best at this point is to start reading through HRD
> specifications, and open source rate control implementation (notably x264).

Yes I will certainly take a look in those directions when starting the work on
rate control.

> I think overall, we can live with adding hints were needed, and if the gop
> information is appropriate hint, then we can just reuse the existing control.

I think it would be necessary for the kernel to have the GOP size hint to do
rate-control, but user-side could still decide on frame types that do not match
the hint and mess with rate-control. Otherwise, it's not very different from
having the kernel decide on the frame type itself.

I guess I'm still not super convinced that it makes sense to have both
user-selectable frame type/references and kernel-side rate control.

Maybe one option would be to have two operating modes:
- "manual" mode, where user-side decides on the frame type, references and
  per-frame QP, without kernel-side rate control; This would be a purely
  stateless approach;
- "automatic" mode, where user-side decides on the GOP size, rate-control
  approach and kernel-side implements rate control to decide on the frame type,
  references and QP; This would be purely stateful.

The more I think about it, the more it feels like mixing the two into a unified
single approach would be messy and unclear. But maybe I'm wrong.

> > > > Leaving GOP management to the kernel-side implies having it decide which frame
> > > > should be IDR, I or P (and B for encoders that can support it), while keeping
> > > > the possibility to request a keyframe (IDR) and configure GOP size. Now it seems
> > > > to me that this is already a good balance between giving userspace a decent
> > > > level of control while not having to specify the frame type explicitly for each
> > > > frame or maintain a GOP in userspace.
> > > 
> > > My expectation for stateless encoder is to have to specify the frame type and
> > > the associate references if the type requires it.
> 
> Ack. For us, this is also why we would require requests (unlike statful
> encoder), as we have per frame information to carry, and requests explicitly
> attach the information to the frame.
>
> > > 
> > > > 
> > > > Requesting the frame type explicitly seems more fragile as many situations will
> > > > be invalid (e.g. requesting a P frame at the beginning of the stream, etc) and
> > > > it generally requires userspace to know a lot about what the codec assumptions
> > > > are. Also for B frames the decision would need to be consistent with the fact
> > > > that a following frame (in display order) would need to be submitted earlier
> > > > than the current frame and inform the kernel so that the picture order count
> > > > (display order indication) can be maintained. This is not impossible or out of
> > > > reach, but it brings a lot of complexity for little advantage.
> > > 
> > > We have had a lot more consistent results over the last decade with stateless
> > > hardware codecs in contrast to stateful where we endup with wide variation in
> > > behaviour. This applies to Chromium, GStreamer and any active users of VA
> > > encoders really. I'm strongly in favour for stateless reference API out of the
> > > Linux kernel.
> > 
> > Okay I understand the lower level of control make it possible to get much better
> > results than opaque firmware-driven encoders and it would be a shame to not
> > leverage this possibility with an API that is too restrictive.
> > 
> > However I do think it should be possible to operate the encoder without a lot
> > of codec-specific supporting code from userspace. This is also why I like having
> > kernel-side rate control (among other reasons).
> 
> Ack. We need a compromise here.
> 
> 
> [...]
> 
> > 
> > > > The next topic of interest is reference management. It seems pretty clear that
> > > > the decision of whether a frame should be a reference or not always needs to be
> > > > taken when encoding that frame. In H.264 the nal_ref_idc slice header element
> > > > indicates whether a frame is marked as reference or not. IDR frames can
> > > > additionally be marked as long-term reference (if I understood correctly, the
> > > > frame will stay in the reference picture list until the next IDR frame).
> > > 
> > > This is incorrect. Any frames can be marked as long term reference, it does not
> > > matter what type they are. From what I recall, marking of the long term in the
> > > bitstream is using a explicit IDX, so there is no specific rules on which one
> > > get evicted. Long term of course are limited as they occupy space in the DPB. 
> > > Also, Each CODEC have different DPB semantic. For H.264, the DPB can run in two
> > > modes. The first is a simple fifo, in this case, any frame you encode and want
> > > to keep as reference is pushed into the DPB (which has a fixed size minus the
> > > long term). If full, the oldest frame is removed. It is not bound to IDR or GOP.
> > > Though, an IDR will implicitly cause the decoder to evict everything (including
> > > long term).
> > > 
> > > The second mode uses the memory management commands. This is a series if
> > > instruction that the encoder can send to the decoder. The specification is quite
> > > complex, it is a common source of bugs in decoders and a place were stateless
> > > hardware codecs performs more consistently in general. Through the commands, the
> > > encoder ensure that the decoder dpb representation stay on sync.
> > 
> > This is also what I understand from repeated reading of the spec and thanks for
> > the summary write-up!
> > 
> > My assumption was that it would be preferable to operate in the simple fifo
> > mode since the memory management commands need to be added to the bitstream
> > headers and require coordination from the kernel. Like you said it seems complex
> > and error-prone.
> > 
> > But maybe this mechanism could be used to allow any particular reference frame
> > configuration, opening the way for userspace to fully decide what the reference
> > buffer lists are? Also it would be good to know if such mechanisms are generally
> > present in codecs or if most of them have an implicit reference list that cannot
> > be modified.
> 
> Of course, the subject is much more relevant when there is encoders with more
> then 1 reference. But you are correct, what the commands do, is allow to change,
> add or remove any reference from the list (random modification), as long as they
> fit in the codec contraints (like the DPB size notably). This is the only way
> one can implement temporal SVC reference pattern, robust reference trees or RTP
> RPSI. Note that long term reference also exists, and are less complex then these
> commands.
> 
> I this raises a big question, and I never checked how this worked with let's say
> VA. Shall we let the driver resolve the changes into commands (VP8 have
> something similar, while VP9 and AV1 are refresh flags, which are just trivial
> to compute). I believe I'll have to investigate this further.

I kind of assumed it would be up to the kernel to do that translation, but maybe
it also makes sense to submit the commands directly from userspace?

It's not very clear to me what's best here.

> 
> > > 
> [...]
> 
> > > > Addition information gathered:
> > > > - It seems likely that the Allwinner Video Engine only supports one reference
> > > >   frame. There's a register for specifying the rec buffer of a second one but
> > > >   I have never seen the proprietary blob use it. It might be as easy as
> > > >   specifying a non-zero address there but it might also be ignored or require
> > > >   some undocumented bit to use more than one reference. I haven't made any
> > > >   attempt at using it yet.
> > > 
> > > There is something in that fact that makes me think of Hantro H1. Hantro H1 also
> > > have a second reference, but non one ever use it. We have on our todo to
> > > actually give this a look.
> > 
> > Having looked at both register layouts, I would tend to think both designs
> > are distinct. It's still unclear where Allwinner's video engine comes from:
> > perhaps they made it in-house, perhaps some obscure Chinese design house made it
> > for them or it could be known hardware with a modified register layout.
> 
> Ack,
> > 
> > I would also be interested to know if the H1 can do more than one reference!
> 
> From what we have in our pretty thin documentation, references are being
> "searched" for fuzzy match and motion. So when you pass 2 references to the
> encoder, then the encoder will search equally in both. I suspect it does a lot
> more then that, and saves some information in the auxiliary buffers that exist
> per reference, but this isn't documented and I'm not specialized enough really.
> 
> From usage perspective, all you have to do is give it access to the references
> picture data (reconstructed image and auxiliary data). The result is compressed
> macroblock data that may refer to these. We don't really know if it is used, but
> we do assume it is and place it in the reference list. This is of course normal
> thing to do, specially when using a reference fifo.
> 
> In theory, you could implement multiple reference with a HW that only supports
> 1. A technique could be to compress the image multiple time, and keep the "best"
> one for the current configuration. Though, a proper multi-pass encoder would
> avoid the bandwidth overhead of compressing and writing the temporary result.
>
> > 
> > > > - Contrary to what I said after Andrzej's talk at EOSS, most Allwinner platforms
> > > >   do not support VP8 encode (despite Allwinner's proprietary blob having an
> > > >   API for it). The only platform that advertises it is the A80 and this might
> > > >   actually be a VP8-only Hantro H1. It seems that the API they developed in the
> > > >   library stuck around even if no other platform can use it.
> > > 
> > > Thanks for letting us know. Our assumption is that a second hardware design is
> > > unlikely as Google was giving it for free to any hardware makers that wanted it.
> > > 
> > > > 
> > > > Sorry for the long email again, I'm trying to be a bit more explanatory than
> > > > just giving some bare conclusions that I drew on my own.
> > > > 
> > > > What do you think about these ideas?
> > > 
> > > In general, we diverge on the direction we want the interface to be. What you
> > > seem to describe now is just a normal stateful encoder interface with everything
> > > needed to drive the stateless hardware implemented in the Linux kernel. There is
> > > no parsing or other unsafety in encoders, so I don't have a strict no-go
> > > argument for that, but for me, it means much more complex drivers and lesser
> > > flexibility. The VA model have been working great for us in the past, giving us
> > > the ability to implement new feature, or even slightly of spec features. While,
> > > the Linux kernel might not be the right place for these experimental methods.
> > 
> > VA seems too low-level for our case here, as it seems to expect full control
> > over more or less each bitstream parameter that will be produced.
> > 
> > I think we have to find some middle-ground that is not as limiting as stateful
> > encoders but not as low-level as VA.
> > 
> > > Personally, I would rather discuss around your uAPI RFC though, I think a lot of
> > > other devs here would like to see what you have drafted.
> > 
> > Hehe I wish I had some advanced proposal here but my implementation is quite
> > simplified compared to what we have to plan for mainline.
> 
> No worries, let's do that later then. On our side, we have similar limitation,
> since we have to have something working before we can spend more time in turning
> it into something upstream. So we have "something" for VP8, we'll do "something"
> for H.264, from there we should be able to iterate. But having the opportunity
> to iterate over a more capable hardware would clearly help understand the bigger
> picture.

Absolutely, it seems a bit difficult to care for cases we cannot really test
yet. Unfortunately I'm not aware of such hardware being around either.

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-23  3:04             ` Hsia-Jun Li
@ 2023-08-30 15:10               ` Nicolas Dufresne
  2023-08-30 16:51                 ` Randy Li
  2023-08-30 15:18               ` Nicolas Dufresne
  1 sibling, 1 reply; 29+ messages in thread
From: Nicolas Dufresne @ 2023-08-30 15:10 UTC (permalink / raw)
  To: Hsia-Jun Li, Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Le mercredi 23 août 2023 à 11:04 +0800, Hsia-Jun Li a écrit :
> > Though, if we drop the GOP structure and favour this approach, the latency could
> > be regain later by introducing fence base streaming. The technique would be for
> > a video source (like a capture driver) to pass dmabuf that aren't filled yet,
> > but have a companion fence. This would allow queuing requests ahead of time, and
> > all we need is enough pre-allocation to accommodate the desired look ahead. Only
> > issue is that perhaps this violates the fundamental of "short term" delivery of
> > fences. But fences can also fail I think, in case the capture was stopped.
> > 
> I don't think it would help. Fence is a thing for DRM/GPU without a queue.
> Even with a fence, would the video sink tell us the motion delta here?

It helps with the latency since the encoder can start its search and analyzes as
soon as frames are available, instead of until you have all N frames available
(refer to the MIN_BUFFER_FOR controls used when lookahead is needed).

> > We can certainly move forward with this as a future solution, or just don't
> > implement future aware RC algorithm in term to avoid the huge task this involves
> > (and possibly patents?)
> > 
> I think we should not restrict how the userspace(vendor) operate the 
> hardware.

Omitting is not restricting. Vendors have to learn to be community members and
propose/add the tools and APIs they need to support their features. We cannot
fix vendors in this regard, those who jumps over that fence are wining.

Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-23  3:04             ` Hsia-Jun Li
  2023-08-30 15:10               ` Nicolas Dufresne
@ 2023-08-30 15:18               ` Nicolas Dufresne
  2023-08-31  9:32                 ` Hsia-Jun Li
  1 sibling, 1 reply; 29+ messages in thread
From: Nicolas Dufresne @ 2023-08-30 15:18 UTC (permalink / raw)
  To: Hsia-Jun Li, Paul Kocialkowski
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

Le mercredi 23 août 2023 à 11:04 +0800, Hsia-Jun Li a écrit :
> > It was suggested so far to leave h264 slice headers writing to the driver. This
> > is motivated by H264 slice header not being byte aligned in size, so the
> H.264, H.265 has the byte_alignment() in nalu. You don't need skip bits 
> feature which could be found in H1.

As you said so, I rechecked the H.264 grammar.

...
  slice_header( )
  slice_data( )
...

There is lot of variable size items in the slice_header() syntax and no padding
bits. And no padding at the start of any of the slice_data types. So no, the
slice_header() syntax in H.264 is not byte aligned like you are claiming here.
Its important to be super accurate about these things, as it will cause errors
to be made. Please always double check.

Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-30 15:10               ` Nicolas Dufresne
@ 2023-08-30 16:51                 ` Randy Li
  0 siblings, 0 replies; 29+ messages in thread
From: Randy Li @ 2023-08-30 16:51 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kernel, linux-media, Paul Kocialkowski, Hsia-Jun Li,
	Hans Verkuil, Sakari Ailus, Andrzej Pietrasiewicz,
	Michael Tretter, Jernej Škrabec, Chen-Yu Tsai,
	Samuel Holland, Thomas Petazzoni


On 2023/8/30 23:10, Nicolas Dufresne wrote:
> CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.
>
>
> Le mercredi 23 août 2023 à 11:04 +0800, Hsia-Jun Li a écrit :
>>> Though, if we drop the GOP structure and favour this approach, the latency could
>>> be regain later by introducing fence base streaming. The technique would be for
>>> a video source (like a capture driver) to pass dmabuf that aren't filled yet,
>>> but have a companion fence. This would allow queuing requests ahead of time, and
>>> all we need is enough pre-allocation to accommodate the desired look ahead. Only
>>> issue is that perhaps this violates the fundamental of "short term" delivery of
>>> fences. But fences can also fail I think, in case the capture was stopped.
>>>
>> I don't think it would help. Fence is a thing for DRM/GPU without a queue.
>> Even with a fence, would the video sink tell us the motion delta here?
> It helps with the latency since the encoder can start its search and analyzes as
> soon as frames are available, instead of until you have all N frames available
> (refer to the MIN_BUFFER_FOR controls used when lookahead is needed).

I think the fence in GPU is something attached to per frame 
buffer(IN_FENCE) or completing the render(OUT_FENCE).

So when we enqueue a buffer, what are expecting from the fence?

I think in KMS, you can't enqueue two buffers for the same plane, you 
have to wait the OUT_FENCE.

>
>>> We can certainly move forward with this as a future solution, or just don't
>>> implement future aware RC algorithm in term to avoid the huge task this involves
>>> (and possibly patents?)
>>>
>> I think we should not restrict how the userspace(vendor) operate the
>> hardware.
> Omitting is not restricting. Vendors have to learn to be community members and
> propose/add the tools and APIs they need to support their features. We cannot
> fix vendors in this regard, those who jumps over that fence are wining.

That is not about what vendor would do. I was thinking we are planning 
how we manage the lifetime of the reconstruction buffer, reference 
selecting based on a simple GOP model.
What was designed here would become a barrier for a vendor whose 
hardware has a little capability than this.

All I want to do here is offer my ideas about how we could achieve an 
open interfaces that could cover the future need.

Especially it is hard to expand the V4L2 uAPIs.

>
> Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-30 15:18               ` Nicolas Dufresne
@ 2023-08-31  9:32                 ` Hsia-Jun Li
  0 siblings, 0 replies; 29+ messages in thread
From: Hsia-Jun Li @ 2023-08-31  9:32 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Paul Kocialkowski, linux-kernel, linux-media, Hans Verkuil,
	Sakari Ailus, Andrzej Pietrasiewicz, Michael Tretter,
	Jernej Škrabec, Chen-Yu Tsai, Samuel Holland,
	Thomas Petazzoni, ayaka



On 8/30/23 23:18, Nicolas Dufresne wrote:
> CAUTION: Email originated externally, do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> Le mercredi 23 août 2023 à 11:04 +0800, Hsia-Jun Li a écrit :
>>> It was suggested so far to leave h264 slice headers writing to the driver. This
>>> is motivated by H264 slice header not being byte aligned in size, so the
>> H.264, H.265 has the byte_alignment() in nalu. You don't need skip bits
>> feature which could be found in H1.
> 
> As you said so, I rechecked the H.264 grammar.
> 
> ...
>    slice_header( )
>    slice_data( )
> ...
> 
> There is lot of variable size items in the slice_header() syntax and no padding
> bits. And no padding at the start of any of the slice_data types. So no, the
> slice_header() syntax in H.264 is not byte aligned like you are claiming here.
> Its important to be super accurate about these things, as it will cause errors
> to be made. Please always double check.
To make a summary of the IRC.
H.264 and VP8 have no such alignment to byte padding bits.
While H.265 has that in 7.3.6.1 General slice segment header syntax.
Also, from 6.1 Frame syntax of AV1, I think frame_header_obu contains 
all the thing that software should prepare for a stateless encoder.
VP9 also has the trailing_bits() after uncompressed_header() (6.1
Frame syntax) would meet the byte alignment.

We may suggest we could use the hardware write-back or write offset bit 
functions which could be widely existed due the non-alignment bitstream 
syntax of the H.264 and VP8.

With such a hardware capability, we could save a cache operation than 
doing that in the kernel.
> 
> Nicolas

-- 
Hsia-Jun(Randy) Li

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Stateless Encoding uAPI Discussion and Proposal
  2023-08-23  8:05         ` Paul Kocialkowski
@ 2023-11-15 13:19           ` Paul Kocialkowski
  0 siblings, 0 replies; 29+ messages in thread
From: Paul Kocialkowski @ 2023-11-15 13:19 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kernel, linux-media, Hans Verkuil, Sakari Ailus,
	Andrzej Pietrasiewicz, Michael Tretter, Jernej Škrabec,
	Chen-Yu Tsai, Samuel Holland, Thomas Petazzoni

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

Hi folks,

Just a quick message on this thread to let you know that we have just published
the code for the H.264 encoding extension to cedrus for the V3/V3s/S3.

You can find more details in the dedicated blog post:
- https://bootlin.com/blog/open-source-linux-kernel-support-for-the-allwinner-v3-v3s-s3-h-264-video-encoder/

And the code is at:
- https://github.com/bootlin/linux/tree/cedrus/h264-encoding
- https://github.com/bootlin/v4l2-cedrus-enc-test

As announced this doesn't really help advance our uAPI discussion here since
there is no rate-control yet and the stateful controls are reused for
controlling the encoding features (including things like GOP).

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-11-15 13:19 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-11 17:12 Stateless Encoding uAPI Discussion and Proposal Paul Kocialkowski
2023-07-11 18:18 ` Nicolas Dufresne
2023-07-12 14:07   ` Paul Kocialkowski
2023-07-25  3:33     ` Hsia-Jun Li
2023-07-25 12:15       ` Paul Kocialkowski
2023-07-26  2:49         ` Hsia-Jun Li
2023-07-26 19:53           ` Nicolas Dufresne
2023-07-27  2:45             ` Hsia-Jun Li
2023-07-27 17:10               ` Nicolas Dufresne
2023-07-26  8:18   ` Hans Verkuil
2023-08-09 14:43     ` Paul Kocialkowski
2023-08-09 17:24       ` Andrzej Pietrasiewicz
2023-07-21 18:19 ` Michael Grzeschik
2023-07-24 14:03   ` Nicolas Dufresne
2023-07-25  9:09     ` Paul Kocialkowski
2023-07-26 20:02       ` Nicolas Dufresne
2023-08-10 13:44 ` Paul Kocialkowski
2023-08-10 14:34   ` Nicolas Dufresne
2023-08-11 20:08     ` Paul Kocialkowski
2023-08-21 15:13       ` Nicolas Dufresne
2023-08-22  8:30         ` Hsia-Jun Li
2023-08-22 20:31           ` Nicolas Dufresne
2023-08-23  3:04             ` Hsia-Jun Li
2023-08-30 15:10               ` Nicolas Dufresne
2023-08-30 16:51                 ` Randy Li
2023-08-30 15:18               ` Nicolas Dufresne
2023-08-31  9:32                 ` Hsia-Jun Li
2023-08-23  8:05         ` Paul Kocialkowski
2023-11-15 13:19           ` Paul Kocialkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).