linux-media.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
@ 2019-05-15 10:09 Paul Kocialkowski
  2019-05-15 14:42 ` Nicolas Dufresne
                   ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-15 10:09 UTC (permalink / raw)
  To: Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Nicolas Dufresne, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Thierry Reding, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

Hi,

With the Rockchip stateless VPU driver in the works, we now have a
better idea of what the situation is like on platforms other than
Allwinner. This email shares my conclusions about the situation and how
we should update the MPEG-2, H.264 and H.265 controls accordingly.

- Per-slice decoding

We've discussed this one already[0] and Hans has submitted a patch[1]
to implement the required core bits. When we agree it looks good, we
should lift the restriction that all slices must be concatenated and
have them submitted as individual requests.

One question is what to do about other controls. I feel like it would
make sense to always pass all the required controls for decoding the
slice, including the ones that don't change across slices. But there
may be no particular advantage to this and only downsides. Not doing it
and relying on the "control cache" can work, but we need to specify
that only a single stream can be decoded per opened instance of the
v4l2 device. This is the assumption we're going with for handling
multi-slice anyway, so it shouldn't be an issue.

- Annex-B formats

I don't think we have really reached a conclusion on the pixel formats
we want to expose. The main issue is how to deal with codecs that need
the full slice NALU with start code, where the slice_header is
duplicated in raw bitstream, when others are fine with just the encoded
slice data and the parsed slice header control.

My initial thinking was that we'd need 3 formats:
- One that only takes only the slice compressed data (without raw slice
header and start code);
- One that takes both the NALU data (including start code, raw header
and compressed data) and slice header controls;
- One that takes the NALU data but no slice header.

But I no longer think the latter really makes sense in the context of
stateless video decoding.

A side-note: I think we should definitely have data offsets in every
case, so that implementations can just push the whole NALU regardless
of the format if they're lazy.

- Dropping the DPB concept in H.264/H.265

As far as I could understand, the decoded picture buffer (DPB) is a
concept that only makes sense relative to a decoder implementation. The
spec mentions how to manage it with the Hypothetical reference decoder
(Annex C), but that's about it.

What's really in the bitstream is the list of modified short-term and
long-term references, which is enough for every decoder.

For this reason, I strongly believe we should stop talking about DPB in
the controls and just pass these lists agremented with relevant
information for userspace.

I think it should be up to the driver to maintain a DPB and we could
have helpers for common cases. For instance, the rockchip decoder needs
to keep unused entries around[2] and cedrus has the same requirement
for H.264. However for cedrus/H.265, we don't need to do any book-
keeping in particular and can manage with the lists from the bitstream
directly.

- Using flags

The current MPEG-2 controls have lots of u8 values that can be
represented as flags. Using flags also helps with padding.
It's unlikely that we'll get more than 64 flags, so using a u64 by
default for that sounds fine (we definitely do want to keep some room
available and I don't think using 32 bits as a default is good enough).

I think H.264/HEVC per-control flags should also be moved to u64.

- Clear split of controls and terminology

Some codecs have explicit NAL units that are good fits to match as
controls: e.g. slice header, pps, sps. I think we should stick to the
bitstream element names for those.

For H.264, that would suggest the following changes:
- renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
- killing v4l2_ctrl_h264_decode_param and having the reference lists
where they belong, which seems to be slice_header;

I'm up for preparing and submitting these control changes and updating
cedrus if they seem agreeable.

What do you think?

Cheers,

Paul

[0]: https://lkml.org/lkml/2019/3/6/82
[1]: https://patchwork.linuxtv.org/patch/55947/
[2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 10:09 Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support Paul Kocialkowski
@ 2019-05-15 14:42 ` Nicolas Dufresne
  2019-05-15 17:42   ` Paul Kocialkowski
  2019-05-23 21:04 ` Jonas Karlman
  2019-06-03 11:24 ` Thierry Reding
  2 siblings, 1 reply; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-15 14:42 UTC (permalink / raw)
  To: Paul Kocialkowski, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 10365 bytes --]

Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> With the Rockchip stateless VPU driver in the works, we now have a
> better idea of what the situation is like on platforms other than
> Allwinner. This email shares my conclusions about the situation and how
> we should update the MPEG-2, H.264 and H.265 controls accordingly.
> 
> - Per-slice decoding
> 
> We've discussed this one already[0] and Hans has submitted a patch[1]
> to implement the required core bits. When we agree it looks good, we
> should lift the restriction that all slices must be concatenated and
> have them submitted as individual requests.
> 
> One question is what to do about other controls. I feel like it would
> make sense to always pass all the required controls for decoding the
> slice, including the ones that don't change across slices. But there
> may be no particular advantage to this and only downsides. Not doing it
> and relying on the "control cache" can work, but we need to specify
> that only a single stream can be decoded per opened instance of the
> v4l2 device. This is the assumption we're going with for handling
> multi-slice anyway, so it shouldn't be an issue.

My opinion on this is that the m2m instance is a state, and the driver
should be responsible of doing time-division multiplexing across
multiple m2m instance jobs. Doing the time-division multiplexing in
userspace would require some sort of daemon to work properly across
processes. I also think the kernel is better place for doing resource
access scheduling in general.

> 
> - Annex-B formats
> 
> I don't think we have really reached a conclusion on the pixel formats
> we want to expose. The main issue is how to deal with codecs that need
> the full slice NALU with start code, where the slice_header is
> duplicated in raw bitstream, when others are fine with just the encoded
> slice data and the parsed slice header control.
> 
> My initial thinking was that we'd need 3 formats:
> - One that only takes only the slice compressed data (without raw slice
> header and start code);
> - One that takes both the NALU data (including start code, raw header
> and compressed data) and slice header controls;
> - One that takes the NALU data but no slice header.
> 
> But I no longer think the latter really makes sense in the context of
> stateless video decoding.
> 
> A side-note: I think we should definitely have data offsets in every
> case, so that implementations can just push the whole NALU regardless
> of the format if they're lazy.

I realize that I didn't share our latest research on the subject. So a
slice in the original bitstream is formed of the following blocks
(simplified):

  [nal_header][nal_type][slice_header][slice]

nal_header:
This one is a header used to locate the start and the end of the of a
NAL. There is two standard forms, the ANNEX B / start code, a sequence
of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
would be a leading 0 from the previous NAL padding, but this is also
totally valid start code. The second form is the AVC form, notably used
in ISOMP4 container. It simply is the size of the NAL. You must keep
your buffer aligned to NALs in this case as you cannot scan from random
location.

nal_type:
It's a bit more then just the type, but it contains at least the
information of the nal type. This has different size on H.264 and HEVC
but I know it's size is in bytes.

slice_header:
This contains per slice parameters, like the modification lists to
apply on the references. This one has a size in bits, not in bytes.

slice:
I don't really know what is in it exactly, but this is the data used to
decode. This bit has a special coding called the anti-emulation, which
prevents a start-code from appearing in it. This coding is present in
both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
call ANNEX-B the bytestream format).

So, what we notice is that what is currently passed through Cedrus
driver:
  [nal_type][slice_header][slice]

This matches what is being passed through VA-API. We can understand
that stripping off the slice_header would be hard, since it's size is
in bits. Instead we pass size and header_bit_size in slice_params.

About Rockchip. RK3288 is a Hantro G1 and has a bit called
start_code_e, when you turn this off, you don't need start code. As a
side effect, the bitstream becomes identical. We do now know that it
works with the ffmpeg branch implement for cedrus.

Now what's special about Hantro G1 (also found on IMX8M) is that it
take care for us of reading and executing the modification lists found
in the slice header. Mostly because I very disliked having to pass the
p/b0/b1 parameters, is that Boris implemented in the driver the
transformation from the DPB entries into this p/b0/b1 list. These list
a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
following section is the execution of the modification list. As this
list is not modified, it only need to be calculated per frame. As a
result, we don't need these new lists, and we can work with the same
H264_SLICE format as Cedrus is using.

Now, this is just a start. For RK3399, we have a different CODEC
design. This one does not have the start_code_e bit. What the IP does,
is that you give it one or more slice per buffer, setup the params,
start decoding, but the decoder then return the location of the
following NAL. So basically you could offload the scanning of start
code to the HW. That being said, with the driver layer in between, that
would be amazingly inconvenient to use, and with Boyer-more algorithm,
it is pretty cheap to scan this type of start-code on CPU. But the
feature that this allows is to operate in frame mode. In this mode, you
have 1 interrupt per frame. But it also support slice mode, with an
interrupt per slice, which is what we decided to use.

So in this case, indeed we strictly require on start-code. Though, to
me this is not a great reason to make a new fourcc, so we will try and
use (data_offset = 3) in order to make some space for that start code,
and write it down in the driver. This is to be continued, we will
report back on this later. This could have some side effect in the
ability to import buffers. But most userspace don't try to do zero-copy 
on the encoded size and just copy anyway.

To my opinion, having a single format is a big deal, since userspace
will generally be developed for one specific HW and we would endup with
fragmented support. What we really want to achieve is having a driver
interface which works across multiple HW, and I think this is quite
possible.

> 
> - Dropping the DPB concept in H.264/H.265
> 
> As far as I could understand, the decoded picture buffer (DPB) is a
> concept that only makes sense relative to a decoder implementation. The
> spec mentions how to manage it with the Hypothetical reference decoder
> (Annex C), but that's about it.
> 
> What's really in the bitstream is the list of modified short-term and
> long-term references, which is enough for every decoder.
> 
> For this reason, I strongly believe we should stop talking about DPB in
> the controls and just pass these lists agremented with relevant
> information for userspace.
> 
> I think it should be up to the driver to maintain a DPB and we could
> have helpers for common cases. For instance, the rockchip decoder needs
> to keep unused entries around[2] and cedrus has the same requirement
> for H.264. However for cedrus/H.265, we don't need to do any book-
> keeping in particular and can manage with the lists from the bitstream
> directly.

As discusses today, we still need to pass that list. It's being index
by the HW to retrieve the extra information we have collected about the
status of the reference frames. In the case of Hantro, which process
the modification list from the slice header for us, we also need that
list to construct the unmodified list.

So the problem here is just a naming problem. That list is not really a
DPB. It is just the list of long-term/short-term references with the
status of these references. So maybe we could just rename as
references/reference_entry ?

> 
> - Using flags
> 
> The current MPEG-2 controls have lots of u8 values that can be
> represented as flags. Using flags also helps with padding.
> It's unlikely that we'll get more than 64 flags, so using a u64 by
> default for that sounds fine (we definitely do want to keep some room
> available and I don't think using 32 bits as a default is good enough).
> 
> I think H.264/HEVC per-control flags should also be moved to u64.

Make sense, I guess bits (member : 1) are not allowed in uAPI right ?

> 
> - Clear split of controls and terminology
> 
> Some codecs have explicit NAL units that are good fits to match as
> controls: e.g. slice header, pps, sps. I think we should stick to the
> bitstream element names for those.
> 
> For H.264, that would suggest the following changes:
> - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;

Oops, I think you meant slice_prams ? decode_params matches the
information found in SPS/PPS (combined?), while slice_params matches
the information extracted (and executed in case of l0/l1) from the
slice headers. That being said, to me this name wasn't confusing, since
it's not just the slice header, and it's per slice.

> - killing v4l2_ctrl_h264_decode_param and having the reference lists
> where they belong, which seems to be slice_header;

There reference list is only updated by userspace (through it's DPB)
base on the result of the last decoding step. I was very confused for a
moment until I realize that the lists in the slice_header are just a
list of modification to apply to the reference list in order to produce
l0 and l1.

> 
> I'm up for preparing and submitting these control changes and updating
> cedrus if they seem agreeable.
> 
> What do you think?
> 
> Cheers,
> 
> Paul
> 
> [0]: https://lkml.org/lkml/2019/3/6/82
> [1]: https://patchwork.linuxtv.org/patch/55947/
> [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 14:42 ` Nicolas Dufresne
@ 2019-05-15 17:42   ` Paul Kocialkowski
  2019-05-15 18:54     ` Nicolas Dufresne
                       ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-15 17:42 UTC (permalink / raw)
  To: Nicolas Dufresne, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

Hi,

Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > With the Rockchip stateless VPU driver in the works, we now have a
> > better idea of what the situation is like on platforms other than
> > Allwinner. This email shares my conclusions about the situation and how
> > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > 
> > - Per-slice decoding
> > 
> > We've discussed this one already[0] and Hans has submitted a patch[1]
> > to implement the required core bits. When we agree it looks good, we
> > should lift the restriction that all slices must be concatenated and
> > have them submitted as individual requests.
> > 
> > One question is what to do about other controls. I feel like it would
> > make sense to always pass all the required controls for decoding the
> > slice, including the ones that don't change across slices. But there
> > may be no particular advantage to this and only downsides. Not doing it
> > and relying on the "control cache" can work, but we need to specify
> > that only a single stream can be decoded per opened instance of the
> > v4l2 device. This is the assumption we're going with for handling
> > multi-slice anyway, so it shouldn't be an issue.
> 
> My opinion on this is that the m2m instance is a state, and the driver
> should be responsible of doing time-division multiplexing across
> multiple m2m instance jobs. Doing the time-division multiplexing in
> userspace would require some sort of daemon to work properly across
> processes. I also think the kernel is better place for doing resource
> access scheduling in general.

I agree with that yes. We always have a single m2m context and specific
controls per opened device so keeping cached values works out well.

So maybe we shall explicitly require that the request with the first
slice for a frame also contains the per-frame controls.

> > - Annex-B formats
> > 
> > I don't think we have really reached a conclusion on the pixel formats
> > we want to expose. The main issue is how to deal with codecs that need
> > the full slice NALU with start code, where the slice_header is
> > duplicated in raw bitstream, when others are fine with just the encoded
> > slice data and the parsed slice header control.
> > 
> > My initial thinking was that we'd need 3 formats:
> > - One that only takes only the slice compressed data (without raw slice
> > header and start code);
> > - One that takes both the NALU data (including start code, raw header
> > and compressed data) and slice header controls;
> > - One that takes the NALU data but no slice header.
> > 
> > But I no longer think the latter really makes sense in the context of
> > stateless video decoding.
> > 
> > A side-note: I think we should definitely have data offsets in every
> > case, so that implementations can just push the whole NALU regardless
> > of the format if they're lazy.
> 
> I realize that I didn't share our latest research on the subject. So a
> slice in the original bitstream is formed of the following blocks
> (simplified):
> 
>   [nal_header][nal_type][slice_header][slice]

Thanks for the details!

> nal_header:
> This one is a header used to locate the start and the end of the of a
> NAL. There is two standard forms, the ANNEX B / start code, a sequence
> of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> would be a leading 0 from the previous NAL padding, but this is also
> totally valid start code. The second form is the AVC form, notably used
> in ISOMP4 container. It simply is the size of the NAL. You must keep
> your buffer aligned to NALs in this case as you cannot scan from random
> location.
> 
> nal_type:
> It's a bit more then just the type, but it contains at least the
> information of the nal type. This has different size on H.264 and HEVC
> but I know it's size is in bytes.
> 
> slice_header:
> This contains per slice parameters, like the modification lists to
> apply on the references. This one has a size in bits, not in bytes.
> 
> slice:
> I don't really know what is in it exactly, but this is the data used to
> decode. This bit has a special coding called the anti-emulation, which
> prevents a start-code from appearing in it. This coding is present in
> both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> call ANNEX-B the bytestream format).
> 
> So, what we notice is that what is currently passed through Cedrus
> driver:
>   [nal_type][slice_header][slice]
> 
> This matches what is being passed through VA-API. We can understand
> that stripping off the slice_header would be hard, since it's size is
> in bits. Instead we pass size and header_bit_size in slice_params.

True, there is that.

> About Rockchip. RK3288 is a Hantro G1 and has a bit called
> start_code_e, when you turn this off, you don't need start code. As a
> side effect, the bitstream becomes identical. We do now know that it
> works with the ffmpeg branch implement for cedrus.

Oh great, that makes life easier in the short term, but I guess the
issue could arise on another decoder sooner or later.

> Now what's special about Hantro G1 (also found on IMX8M) is that it
> take care for us of reading and executing the modification lists found
> in the slice header. Mostly because I very disliked having to pass the
> p/b0/b1 parameters, is that Boris implemented in the driver the
> transformation from the DPB entries into this p/b0/b1 list. These list
> a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> following section is the execution of the modification list. As this
> list is not modified, it only need to be calculated per frame. As a
> result, we don't need these new lists, and we can work with the same
> H264_SLICE format as Cedrus is using.

Yes but I definitely think it makes more sense to pass the list
modifications rather than reconstructing those in the driver from a
full list. IMO controls should stick to the bitstream as close as
possible.

> Now, this is just a start. For RK3399, we have a different CODEC
> design. This one does not have the start_code_e bit. What the IP does,
> is that you give it one or more slice per buffer, setup the params,
> start decoding, but the decoder then return the location of the
> following NAL. So basically you could offload the scanning of start
> code to the HW. That being said, with the driver layer in between, that
> would be amazingly inconvenient to use, and with Boyer-more algorithm,
> it is pretty cheap to scan this type of start-code on CPU. But the
> feature that this allows is to operate in frame mode. In this mode, you
> have 1 interrupt per frame.

I'm not sure there is any interest in exposing that from userspace and
my current feeling is that we should just ditch support for per-frame
decoding altogether. I think it mixes decoding with notions that are
higher-level than decoding, but I agree it's a blurry line.

> But it also support slice mode, with an
> interrupt per slice, which is what we decided to use.

Easier for everyone and probably better for latency as well :)

> So in this case, indeed we strictly require on start-code. Though, to
> me this is not a great reason to make a new fourcc, so we will try and
> use (data_offset = 3) in order to make some space for that start code,
> and write it down in the driver. This is to be continued, we will
> report back on this later. This could have some side effect in the
> ability to import buffers. But most userspace don't try to do zero-copy 
> on the encoded size and just copy anyway.
> 
> To my opinion, having a single format is a big deal, since userspace
> will generally be developed for one specific HW and we would endup with
> fragmented support. What we really want to achieve is having a driver
> interface which works across multiple HW, and I think this is quite
> possible.

I agree with that. The more I think about it, the more I believe we
should just pass the whole [nal_header][nal_type][slice_header][slice]
and the parsed list in every scenario.

For H.265, our decoder needs some information from the NAL type too.
We currently extract that in userspace and stick it to the
slice_header, but maybe it would make more sense to have drivers parse
that info from the buffer if they need it. On the other hand, it seems
quite common to pass information from the NAL type, so maybe we should
either make a new control for it or have all the fields in the
slice_header (which would still be wrong in terms of matching bitstream
description).

> > - Dropping the DPB concept in H.264/H.265
> > 
> > As far as I could understand, the decoded picture buffer (DPB) is a
> > concept that only makes sense relative to a decoder implementation. The
> > spec mentions how to manage it with the Hypothetical reference decoder
> > (Annex C), but that's about it.
> > 
> > What's really in the bitstream is the list of modified short-term and
> > long-term references, which is enough for every decoder.
> > 
> > For this reason, I strongly believe we should stop talking about DPB in
> > the controls and just pass these lists agremented with relevant
> > information for userspace.
> > 
> > I think it should be up to the driver to maintain a DPB and we could
> > have helpers for common cases. For instance, the rockchip decoder needs
> > to keep unused entries around[2] and cedrus has the same requirement
> > for H.264. However for cedrus/H.265, we don't need to do any book-
> > keeping in particular and can manage with the lists from the bitstream
> > directly.
> 
> As discusses today, we still need to pass that list. It's being index
> by the HW to retrieve the extra information we have collected about the
> status of the reference frames. In the case of Hantro, which process
> the modification list from the slice header for us, we also need that
> list to construct the unmodified list.
> 
> So the problem here is just a naming problem. That list is not really a
> DPB. It is just the list of long-term/short-term references with the
> status of these references. So maybe we could just rename as
> references/reference_entry ?

What I'd like to pass is the diff to the references list, as ffmpeg
currently provides for v4l2 request and vaapi (probably vdpau too). No
functional change here, only that we should stop calling it a DPB,
which confuses everyone.

> > - Using flags
> > 
> > The current MPEG-2 controls have lots of u8 values that can be
> > represented as flags. Using flags also helps with padding.
> > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > default for that sounds fine (we definitely do want to keep some room
> > available and I don't think using 32 bits as a default is good enough).
> > 
> > I think H.264/HEVC per-control flags should also be moved to u64.
> 
> Make sense, I guess bits (member : 1) are not allowed in uAPI right ?

Mhh, even if they are, it makes it much harder to verify 32/64 bit
alignment constraints (we're dealing with 64-bit platforms that need to
have 32-bit userspace and compat_ioctl).

> > - Clear split of controls and terminology
> > 
> > Some codecs have explicit NAL units that are good fits to match as
> > controls: e.g. slice header, pps, sps. I think we should stick to the
> > bitstream element names for those.
> > 
> > For H.264, that would suggest the following changes:
> > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> 
> Oops, I think you meant slice_prams ? decode_params matches the
> information found in SPS/PPS (combined?), while slice_params matches
> the information extracted (and executed in case of l0/l1) from the
> slice headers.

Yes you're right, I mixed them up.

>  That being said, to me this name wasn't confusing, since
> it's not just the slice header, and it's per slice.

Mhh, what exactly remains in there and where does it originate in the
bitstream? Maybe it wouldn't be too bad to have one control per actual
group of bitstream elements.

> > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > where they belong, which seems to be slice_header;
> 
> There reference list is only updated by userspace (through it's DPB)
> base on the result of the last decoding step. I was very confused for a
> moment until I realize that the lists in the slice_header are just a
> list of modification to apply to the reference list in order to produce
> l0 and l1.

Indeed, and I'm suggesting that we pass the modifications only, which
would fit a slice_header control.

Cheers,

Paul

> > I'm up for preparing and submitting these control changes and updating
> > cedrus if they seem agreeable.
> > 
> > What do you think?
> > 
> > Cheers,
> > 
> > Paul
> > 
> > [0]: https://lkml.org/lkml/2019/3/6/82
> > [1]: https://patchwork.linuxtv.org/patch/55947/
> > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 17:42   ` Paul Kocialkowski
@ 2019-05-15 18:54     ` Nicolas Dufresne
  2019-05-15 20:59       ` Paul Kocialkowski
  2019-05-21 10:27     ` Tomasz Figa
  2019-05-21 15:43     ` Thierry Reding
  2 siblings, 1 reply; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-15 18:54 UTC (permalink / raw)
  To: Paul Kocialkowski, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 15847 bytes --]

Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > With the Rockchip stateless VPU driver in the works, we now have a
> > > better idea of what the situation is like on platforms other than
> > > Allwinner. This email shares my conclusions about the situation and how
> > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > 
> > > - Per-slice decoding
> > > 
> > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > to implement the required core bits. When we agree it looks good, we
> > > should lift the restriction that all slices must be concatenated and
> > > have them submitted as individual requests.
> > > 
> > > One question is what to do about other controls. I feel like it would
> > > make sense to always pass all the required controls for decoding the
> > > slice, including the ones that don't change across slices. But there
> > > may be no particular advantage to this and only downsides. Not doing it
> > > and relying on the "control cache" can work, but we need to specify
> > > that only a single stream can be decoded per opened instance of the
> > > v4l2 device. This is the assumption we're going with for handling
> > > multi-slice anyway, so it shouldn't be an issue.
> > 
> > My opinion on this is that the m2m instance is a state, and the driver
> > should be responsible of doing time-division multiplexing across
> > multiple m2m instance jobs. Doing the time-division multiplexing in
> > userspace would require some sort of daemon to work properly across
> > processes. I also think the kernel is better place for doing resource
> > access scheduling in general.
> 
> I agree with that yes. We always have a single m2m context and specific
> controls per opened device so keeping cached values works out well.
> 
> So maybe we shall explicitly require that the request with the first
> slice for a frame also contains the per-frame controls.
> 
> > > - Annex-B formats
> > > 
> > > I don't think we have really reached a conclusion on the pixel formats
> > > we want to expose. The main issue is how to deal with codecs that need
> > > the full slice NALU with start code, where the slice_header is
> > > duplicated in raw bitstream, when others are fine with just the encoded
> > > slice data and the parsed slice header control.
> > > 
> > > My initial thinking was that we'd need 3 formats:
> > > - One that only takes only the slice compressed data (without raw slice
> > > header and start code);
> > > - One that takes both the NALU data (including start code, raw header
> > > and compressed data) and slice header controls;
> > > - One that takes the NALU data but no slice header.
> > > 
> > > But I no longer think the latter really makes sense in the context of
> > > stateless video decoding.
> > > 
> > > A side-note: I think we should definitely have data offsets in every
> > > case, so that implementations can just push the whole NALU regardless
> > > of the format if they're lazy.
> > 
> > I realize that I didn't share our latest research on the subject. So a
> > slice in the original bitstream is formed of the following blocks
> > (simplified):
> > 
> >   [nal_header][nal_type][slice_header][slice]
> 
> Thanks for the details!
> 
> > nal_header:
> > This one is a header used to locate the start and the end of the of a
> > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > would be a leading 0 from the previous NAL padding, but this is also
> > totally valid start code. The second form is the AVC form, notably used
> > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > your buffer aligned to NALs in this case as you cannot scan from random
> > location.
> > 
> > nal_type:
> > It's a bit more then just the type, but it contains at least the
> > information of the nal type. This has different size on H.264 and HEVC
> > but I know it's size is in bytes.
> > 
> > slice_header:
> > This contains per slice parameters, like the modification lists to
> > apply on the references. This one has a size in bits, not in bytes.
> > 
> > slice:
> > I don't really know what is in it exactly, but this is the data used to
> > decode. This bit has a special coding called the anti-emulation, which
> > prevents a start-code from appearing in it. This coding is present in
> > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > call ANNEX-B the bytestream format).
> > 
> > So, what we notice is that what is currently passed through Cedrus
> > driver:
> >   [nal_type][slice_header][slice]
> > 
> > This matches what is being passed through VA-API. We can understand
> > that stripping off the slice_header would be hard, since it's size is
> > in bits. Instead we pass size and header_bit_size in slice_params.
> 
> True, there is that.
> 
> > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > start_code_e, when you turn this off, you don't need start code. As a
> > side effect, the bitstream becomes identical. We do now know that it
> > works with the ffmpeg branch implement for cedrus.
> 
> Oh great, that makes life easier in the short term, but I guess the
> issue could arise on another decoder sooner or later.
> 
> > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > take care for us of reading and executing the modification lists found
> > in the slice header. Mostly because I very disliked having to pass the
> > p/b0/b1 parameters, is that Boris implemented in the driver the
> > transformation from the DPB entries into this p/b0/b1 list. These list
> > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > following section is the execution of the modification list. As this
> > list is not modified, it only need to be calculated per frame. As a
> > result, we don't need these new lists, and we can work with the same
> > H264_SLICE format as Cedrus is using.
> 
> Yes but I definitely think it makes more sense to pass the list
> modifications rather than reconstructing those in the driver from a
> full list. IMO controls should stick to the bitstream as close as
> possible.

For Hantro and RKVDEC, the list of modification is parsed by the IP
from the slice header bits. Just to make sure, because I myself was
confused on this before, the slice header does not contain a list of
references, instead it contains a list modification to be applied to
the reference list. I need to check again, but to execute these
modification, you need to filter and sort the references in a specific
order. This should be what is defined in the spec as 8.2.4.1 and
8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.

The list of references is deduced from the DPB. The DPB, which I thinks
should be rename as "references", seems more useful then p/b0/b1, since
this is the data that gives use the ability to implementing glue in the
driver to compensate some HW differences.

In the case of Hantro / RKVDEC, we think it's natural to build the HW
specific lists (p/b0/b1) from the references rather then adding HW
specific list in the decode_params structure. The fact these lists are
standard intermediate step of the standard is not that important.

> 
> > Now, this is just a start. For RK3399, we have a different CODEC
> > design. This one does not have the start_code_e bit. What the IP does,
> > is that you give it one or more slice per buffer, setup the params,
> > start decoding, but the decoder then return the location of the
> > following NAL. So basically you could offload the scanning of start
> > code to the HW. That being said, with the driver layer in between, that
> > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > it is pretty cheap to scan this type of start-code on CPU. But the
> > feature that this allows is to operate in frame mode. In this mode, you
> > have 1 interrupt per frame.
> 
> I'm not sure there is any interest in exposing that from userspace and
> my current feeling is that we should just ditch support for per-frame
> decoding altogether. I think it mixes decoding with notions that are
> higher-level than decoding, but I agree it's a blurry line.

I'm not worried about this either. We can already support that by
copying the bitstream internally to the driver, though zero-copy with
this would require a new format, the one we talked about,
SLICE_ANNEX_B.

> 
> > But it also support slice mode, with an
> > interrupt per slice, which is what we decided to use.
> 
> Easier for everyone and probably better for latency as well :)
> 
> > So in this case, indeed we strictly require on start-code. Though, to
> > me this is not a great reason to make a new fourcc, so we will try and
> > use (data_offset = 3) in order to make some space for that start code,
> > and write it down in the driver. This is to be continued, we will
> > report back on this later. This could have some side effect in the
> > ability to import buffers. But most userspace don't try to do zero-copy 
> > on the encoded size and just copy anyway.
> > 
> > To my opinion, having a single format is a big deal, since userspace
> > will generally be developed for one specific HW and we would endup with
> > fragmented support. What we really want to achieve is having a driver
> > interface which works across multiple HW, and I think this is quite
> > possible.
> 
> I agree with that. The more I think about it, the more I believe we
> should just pass the whole [nal_header][nal_type][slice_header][slice]
> and the parsed list in every scenario.

What I like of the cut at nal_type, is that there is only format. If we
cut at nal_header, then we need to expose 2 formats. And it makes our
API similar to other accelerator API, so it's easy to "convert"
existing userspace.

> 
> For H.265, our decoder needs some information from the NAL type too.
> We currently extract that in userspace and stick it to the
> slice_header, but maybe it would make more sense to have drivers parse
> that info from the buffer if they need it. On the other hand, it seems
> quite common to pass information from the NAL type, so maybe we should
> either make a new control for it or have all the fields in the
> slice_header (which would still be wrong in terms of matching bitstream
> description).

Even in userspace, it's common to just parse this in place, it's a
simple mask. But yes, if we don't have it yet, we should expose the NAL
type, it would be cleaner.

> 
> > > - Dropping the DPB concept in H.264/H.265
> > > 
> > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > concept that only makes sense relative to a decoder implementation. The
> > > spec mentions how to manage it with the Hypothetical reference decoder
> > > (Annex C), but that's about it.
> > > 
> > > What's really in the bitstream is the list of modified short-term and
> > > long-term references, which is enough for every decoder.
> > > 
> > > For this reason, I strongly believe we should stop talking about DPB in
> > > the controls and just pass these lists agremented with relevant
> > > information for userspace.
> > > 
> > > I think it should be up to the driver to maintain a DPB and we could
> > > have helpers for common cases. For instance, the rockchip decoder needs
> > > to keep unused entries around[2] and cedrus has the same requirement
> > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > keeping in particular and can manage with the lists from the bitstream
> > > directly.
> > 
> > As discusses today, we still need to pass that list. It's being index
> > by the HW to retrieve the extra information we have collected about the
> > status of the reference frames. In the case of Hantro, which process
> > the modification list from the slice header for us, we also need that
> > list to construct the unmodified list.
> > 
> > So the problem here is just a naming problem. That list is not really a
> > DPB. It is just the list of long-term/short-term references with the
> > status of these references. So maybe we could just rename as
> > references/reference_entry ?
> 
> What I'd like to pass is the diff to the references list, as ffmpeg
> currently provides for v4l2 request and vaapi (probably vdpau too). No
> functional change here, only that we should stop calling it a DPB,
> which confuses everyone.

Yes.

> 
> > > - Using flags
> > > 
> > > The current MPEG-2 controls have lots of u8 values that can be
> > > represented as flags. Using flags also helps with padding.
> > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > default for that sounds fine (we definitely do want to keep some room
> > > available and I don't think using 32 bits as a default is good enough).
> > > 
> > > I think H.264/HEVC per-control flags should also be moved to u64.
> > 
> > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> 
> Mhh, even if they are, it makes it much harder to verify 32/64 bit
> alignment constraints (we're dealing with 64-bit platforms that need to
> have 32-bit userspace and compat_ioctl).

I see, thanks.

> 
> > > - Clear split of controls and terminology
> > > 
> > > Some codecs have explicit NAL units that are good fits to match as
> > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > bitstream element names for those.
> > > 
> > > For H.264, that would suggest the following changes:
> > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > 
> > Oops, I think you meant slice_prams ? decode_params matches the
> > information found in SPS/PPS (combined?), while slice_params matches
> > the information extracted (and executed in case of l0/l1) from the
> > slice headers.
> 
> Yes you're right, I mixed them up.
> 
> >  That being said, to me this name wasn't confusing, since
> > it's not just the slice header, and it's per slice.
> 
> Mhh, what exactly remains in there and where does it originate in the
> bitstream? Maybe it wouldn't be too bad to have one control per actual
> group of bitstream elements.
> 
> > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > where they belong, which seems to be slice_header;
> > 
> > There reference list is only updated by userspace (through it's DPB)
> > base on the result of the last decoding step. I was very confused for a
> > moment until I realize that the lists in the slice_header are just a
> > list of modification to apply to the reference list in order to produce
> > l0 and l1.
> 
> Indeed, and I'm suggesting that we pass the modifications only, which
> would fit a slice_header control.

I think I made my point why we want the dpb -> references. I'm going to
validate with the VA driver now, to see if the references list there is
usable with our code.

> 
> Cheers,
> 
> Paul
> 
> > > I'm up for preparing and submitting these control changes and updating
> > > cedrus if they seem agreeable.
> > > 
> > > What do you think?
> > > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 18:54     ` Nicolas Dufresne
@ 2019-05-15 20:59       ` Paul Kocialkowski
  2019-05-16 18:24         ` Nicolas Dufresne
  0 siblings, 1 reply; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-15 20:59 UTC (permalink / raw)
  To: Nicolas Dufresne, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

Hi,

Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > better idea of what the situation is like on platforms other than
> > > > Allwinner. This email shares my conclusions about the situation and how
> > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > 
> > > > - Per-slice decoding
> > > > 
> > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > to implement the required core bits. When we agree it looks good, we
> > > > should lift the restriction that all slices must be concatenated and
> > > > have them submitted as individual requests.
> > > > 
> > > > One question is what to do about other controls. I feel like it would
> > > > make sense to always pass all the required controls for decoding the
> > > > slice, including the ones that don't change across slices. But there
> > > > may be no particular advantage to this and only downsides. Not doing it
> > > > and relying on the "control cache" can work, but we need to specify
> > > > that only a single stream can be decoded per opened instance of the
> > > > v4l2 device. This is the assumption we're going with for handling
> > > > multi-slice anyway, so it shouldn't be an issue.
> > > 
> > > My opinion on this is that the m2m instance is a state, and the driver
> > > should be responsible of doing time-division multiplexing across
> > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > userspace would require some sort of daemon to work properly across
> > > processes. I also think the kernel is better place for doing resource
> > > access scheduling in general.
> > 
> > I agree with that yes. We always have a single m2m context and specific
> > controls per opened device so keeping cached values works out well.
> > 
> > So maybe we shall explicitly require that the request with the first
> > slice for a frame also contains the per-frame controls.
> > 
> > > > - Annex-B formats
> > > > 
> > > > I don't think we have really reached a conclusion on the pixel formats
> > > > we want to expose. The main issue is how to deal with codecs that need
> > > > the full slice NALU with start code, where the slice_header is
> > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > slice data and the parsed slice header control.
> > > > 
> > > > My initial thinking was that we'd need 3 formats:
> > > > - One that only takes only the slice compressed data (without raw slice
> > > > header and start code);
> > > > - One that takes both the NALU data (including start code, raw header
> > > > and compressed data) and slice header controls;
> > > > - One that takes the NALU data but no slice header.
> > > > 
> > > > But I no longer think the latter really makes sense in the context of
> > > > stateless video decoding.
> > > > 
> > > > A side-note: I think we should definitely have data offsets in every
> > > > case, so that implementations can just push the whole NALU regardless
> > > > of the format if they're lazy.
> > > 
> > > I realize that I didn't share our latest research on the subject. So a
> > > slice in the original bitstream is formed of the following blocks
> > > (simplified):
> > > 
> > >   [nal_header][nal_type][slice_header][slice]
> > 
> > Thanks for the details!
> > 
> > > nal_header:
> > > This one is a header used to locate the start and the end of the of a
> > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > would be a leading 0 from the previous NAL padding, but this is also
> > > totally valid start code. The second form is the AVC form, notably used
> > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > your buffer aligned to NALs in this case as you cannot scan from random
> > > location.
> > > 
> > > nal_type:
> > > It's a bit more then just the type, but it contains at least the
> > > information of the nal type. This has different size on H.264 and HEVC
> > > but I know it's size is in bytes.
> > > 
> > > slice_header:
> > > This contains per slice parameters, like the modification lists to
> > > apply on the references. This one has a size in bits, not in bytes.
> > > 
> > > slice:
> > > I don't really know what is in it exactly, but this is the data used to
> > > decode. This bit has a special coding called the anti-emulation, which
> > > prevents a start-code from appearing in it. This coding is present in
> > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > call ANNEX-B the bytestream format).
> > > 
> > > So, what we notice is that what is currently passed through Cedrus
> > > driver:
> > >   [nal_type][slice_header][slice]
> > > 
> > > This matches what is being passed through VA-API. We can understand
> > > that stripping off the slice_header would be hard, since it's size is
> > > in bits. Instead we pass size and header_bit_size in slice_params.
> > 
> > True, there is that.
> > 
> > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > start_code_e, when you turn this off, you don't need start code. As a
> > > side effect, the bitstream becomes identical. We do now know that it
> > > works with the ffmpeg branch implement for cedrus.
> > 
> > Oh great, that makes life easier in the short term, but I guess the
> > issue could arise on another decoder sooner or later.
> > 
> > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > take care for us of reading and executing the modification lists found
> > > in the slice header. Mostly because I very disliked having to pass the
> > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > following section is the execution of the modification list. As this
> > > list is not modified, it only need to be calculated per frame. As a
> > > result, we don't need these new lists, and we can work with the same
> > > H264_SLICE format as Cedrus is using.
> > 
> > Yes but I definitely think it makes more sense to pass the list
> > modifications rather than reconstructing those in the driver from a
> > full list. IMO controls should stick to the bitstream as close as
> > possible.
> 
> For Hantro and RKVDEC, the list of modification is parsed by the IP
> from the slice header bits. Just to make sure, because I myself was
> confused on this before, the slice header does not contain a list of
> references, instead it contains a list modification to be applied to
> the reference list. I need to check again, but to execute these
> modification, you need to filter and sort the references in a specific
> order. This should be what is defined in the spec as 8.2.4.1 and
> 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> 
> The list of references is deduced from the DPB. The DPB, which I thinks
> should be rename as "references", seems more useful then p/b0/b1, since
> this is the data that gives use the ability to implementing glue in the
> driver to compensate some HW differences.
> 
> In the case of Hantro / RKVDEC, we think it's natural to build the HW
> specific lists (p/b0/b1) from the references rather then adding HW
> specific list in the decode_params structure. The fact these lists are
> standard intermediate step of the standard is not that important.

Sorry I got confused (once more) about it. Boris just explained the
same thing to me over IRC :) Anyway my point is that we want to pass
what's in ffmpeg's short and long term ref lists, and name them that
instead of dpb.

> > > Now, this is just a start. For RK3399, we have a different CODEC
> > > design. This one does not have the start_code_e bit. What the IP does,
> > > is that you give it one or more slice per buffer, setup the params,
> > > start decoding, but the decoder then return the location of the
> > > following NAL. So basically you could offload the scanning of start
> > > code to the HW. That being said, with the driver layer in between, that
> > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > feature that this allows is to operate in frame mode. In this mode, you
> > > have 1 interrupt per frame.
> > 
> > I'm not sure there is any interest in exposing that from userspace and
> > my current feeling is that we should just ditch support for per-frame
> > decoding altogether. I think it mixes decoding with notions that are
> > higher-level than decoding, but I agree it's a blurry line.
> 
> I'm not worried about this either. We can already support that by
> copying the bitstream internally to the driver, though zero-copy with
> this would require a new format, the one we talked about,
> SLICE_ANNEX_B.

Right, but what I'm thinking about is making that the one and only
format. The rationale is that it's always easier to just append a start
code from userspace if needed. And we need a bit offset to the slice
data part anyway, so it doesn't hurt to require a few extra bits to
have the whole thing that will work in every situation.

To me the breaking point was about having the slice header both in raw
bitstream and parsed forms. Since we agree that's fine, we might as
well push it to its logical conclusion and include all the bits that
can be useful.

> > > But it also support slice mode, with an
> > > interrupt per slice, which is what we decided to use.
> > 
> > Easier for everyone and probably better for latency as well :)
> > 
> > > So in this case, indeed we strictly require on start-code. Though, to
> > > me this is not a great reason to make a new fourcc, so we will try and
> > > use (data_offset = 3) in order to make some space for that start code,
> > > and write it down in the driver. This is to be continued, we will
> > > report back on this later. This could have some side effect in the
> > > ability to import buffers. But most userspace don't try to do zero-copy 
> > > on the encoded size and just copy anyway.
> > > 
> > > To my opinion, having a single format is a big deal, since userspace
> > > will generally be developed for one specific HW and we would endup with
> > > fragmented support. What we really want to achieve is having a driver
> > > interface which works across multiple HW, and I think this is quite
> > > possible.
> > 
> > I agree with that. The more I think about it, the more I believe we
> > should just pass the whole [nal_header][nal_type][slice_header][slice]
> > and the parsed list in every scenario.
> 
> What I like of the cut at nal_type, is that there is only format. If we
> cut at nal_header, then we need to expose 2 formats. And it makes our
> API similar to other accelerator API, so it's easy to "convert"
> existing userspace.

Unless we make that cut the single one and only true cut that shall
supersed all other cuts :)

> > For H.265, our decoder needs some information from the NAL type too.
> > We currently extract that in userspace and stick it to the
> > slice_header, but maybe it would make more sense to have drivers parse
> > that info from the buffer if they need it. On the other hand, it seems
> > quite common to pass information from the NAL type, so maybe we should
> > either make a new control for it or have all the fields in the
> > slice_header (which would still be wrong in terms of matching bitstream
> > description).
> 
> Even in userspace, it's common to just parse this in place, it's a
> simple mask. But yes, if we don't have it yet, we should expose the NAL
> type, it would be cleaner.

Right, works for me.

Cheers,

Paul

> > > > - Dropping the DPB concept in H.264/H.265
> > > > 
> > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > concept that only makes sense relative to a decoder implementation. The
> > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > (Annex C), but that's about it.
> > > > 
> > > > What's really in the bitstream is the list of modified short-term and
> > > > long-term references, which is enough for every decoder.
> > > > 
> > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > the controls and just pass these lists agremented with relevant
> > > > information for userspace.
> > > > 
> > > > I think it should be up to the driver to maintain a DPB and we could
> > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > keeping in particular and can manage with the lists from the bitstream
> > > > directly.
> > > 
> > > As discusses today, we still need to pass that list. It's being index
> > > by the HW to retrieve the extra information we have collected about the
> > > status of the reference frames. In the case of Hantro, which process
> > > the modification list from the slice header for us, we also need that
> > > list to construct the unmodified list.
> > > 
> > > So the problem here is just a naming problem. That list is not really a
> > > DPB. It is just the list of long-term/short-term references with the
> > > status of these references. So maybe we could just rename as
> > > references/reference_entry ?
> > 
> > What I'd like to pass is the diff to the references list, as ffmpeg
> > currently provides for v4l2 request and vaapi (probably vdpau too). No
> > functional change here, only that we should stop calling it a DPB,
> > which confuses everyone.
> 
> Yes.
> 
> > > > - Using flags
> > > > 
> > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > represented as flags. Using flags also helps with padding.
> > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > default for that sounds fine (we definitely do want to keep some room
> > > > available and I don't think using 32 bits as a default is good enough).
> > > > 
> > > > I think H.264/HEVC per-control flags should also be moved to u64.
> > > 
> > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> > 
> > Mhh, even if they are, it makes it much harder to verify 32/64 bit
> > alignment constraints (we're dealing with 64-bit platforms that need to
> > have 32-bit userspace and compat_ioctl).
> 
> I see, thanks.
> 
> > > > - Clear split of controls and terminology
> > > > 
> > > > Some codecs have explicit NAL units that are good fits to match as
> > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > bitstream element names for those.
> > > > 
> > > > For H.264, that would suggest the following changes:
> > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > 
> > > Oops, I think you meant slice_prams ? decode_params matches the
> > > information found in SPS/PPS (combined?), while slice_params matches
> > > the information extracted (and executed in case of l0/l1) from the
> > > slice headers.
> > 
> > Yes you're right, I mixed them up.
> > 
> > >  That being said, to me this name wasn't confusing, since
> > > it's not just the slice header, and it's per slice.
> > 
> > Mhh, what exactly remains in there and where does it originate in the
> > bitstream? Maybe it wouldn't be too bad to have one control per actual
> > group of bitstream elements.
> > 
> > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > where they belong, which seems to be slice_header;
> > > 
> > > There reference list is only updated by userspace (through it's DPB)
> > > base on the result of the last decoding step. I was very confused for a
> > > moment until I realize that the lists in the slice_header are just a
> > > list of modification to apply to the reference list in order to produce
> > > l0 and l1.
> > 
> > Indeed, and I'm suggesting that we pass the modifications only, which
> > would fit a slice_header control.
> 
> I think I made my point why we want the dpb -> references. I'm going to
> validate with the VA driver now, to see if the references list there is
> usable with our code.
> 
> > Cheers,
> > 
> > Paul
> > 
> > > > I'm up for preparing and submitting these control changes and updating
> > > > cedrus if they seem agreeable.
> > > > 
> > > > What do you think?
> > > > 
> > > > Cheers,
> > > > 
> > > > Paul
> > > > 
> > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > > 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 20:59       ` Paul Kocialkowski
@ 2019-05-16 18:24         ` Nicolas Dufresne
  2019-05-16 18:45           ` Paul Kocialkowski
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-16 18:24 UTC (permalink / raw)
  To: Paul Kocialkowski, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 19122 bytes --]

Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > better idea of what the situation is like on platforms other than
> > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > 
> > > > > - Per-slice decoding
> > > > > 
> > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > to implement the required core bits. When we agree it looks good, we
> > > > > should lift the restriction that all slices must be concatenated and
> > > > > have them submitted as individual requests.
> > > > > 
> > > > > One question is what to do about other controls. I feel like it would
> > > > > make sense to always pass all the required controls for decoding the
> > > > > slice, including the ones that don't change across slices. But there
> > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > and relying on the "control cache" can work, but we need to specify
> > > > > that only a single stream can be decoded per opened instance of the
> > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > 
> > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > should be responsible of doing time-division multiplexing across
> > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > userspace would require some sort of daemon to work properly across
> > > > processes. I also think the kernel is better place for doing resource
> > > > access scheduling in general.
> > > 
> > > I agree with that yes. We always have a single m2m context and specific
> > > controls per opened device so keeping cached values works out well.
> > > 
> > > So maybe we shall explicitly require that the request with the first
> > > slice for a frame also contains the per-frame controls.
> > > 
> > > > > - Annex-B formats
> > > > > 
> > > > > I don't think we have really reached a conclusion on the pixel formats
> > > > > we want to expose. The main issue is how to deal with codecs that need
> > > > > the full slice NALU with start code, where the slice_header is
> > > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > > slice data and the parsed slice header control.
> > > > > 
> > > > > My initial thinking was that we'd need 3 formats:
> > > > > - One that only takes only the slice compressed data (without raw slice
> > > > > header and start code);
> > > > > - One that takes both the NALU data (including start code, raw header
> > > > > and compressed data) and slice header controls;
> > > > > - One that takes the NALU data but no slice header.
> > > > > 
> > > > > But I no longer think the latter really makes sense in the context of
> > > > > stateless video decoding.
> > > > > 
> > > > > A side-note: I think we should definitely have data offsets in every
> > > > > case, so that implementations can just push the whole NALU regardless
> > > > > of the format if they're lazy.
> > > > 
> > > > I realize that I didn't share our latest research on the subject. So a
> > > > slice in the original bitstream is formed of the following blocks
> > > > (simplified):
> > > > 
> > > >   [nal_header][nal_type][slice_header][slice]
> > > 
> > > Thanks for the details!
> > > 
> > > > nal_header:
> > > > This one is a header used to locate the start and the end of the of a
> > > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > > would be a leading 0 from the previous NAL padding, but this is also
> > > > totally valid start code. The second form is the AVC form, notably used
> > > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > > your buffer aligned to NALs in this case as you cannot scan from random
> > > > location.
> > > > 
> > > > nal_type:
> > > > It's a bit more then just the type, but it contains at least the
> > > > information of the nal type. This has different size on H.264 and HEVC
> > > > but I know it's size is in bytes.
> > > > 
> > > > slice_header:
> > > > This contains per slice parameters, like the modification lists to
> > > > apply on the references. This one has a size in bits, not in bytes.
> > > > 
> > > > slice:
> > > > I don't really know what is in it exactly, but this is the data used to
> > > > decode. This bit has a special coding called the anti-emulation, which
> > > > prevents a start-code from appearing in it. This coding is present in
> > > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > > call ANNEX-B the bytestream format).
> > > > 
> > > > So, what we notice is that what is currently passed through Cedrus
> > > > driver:
> > > >   [nal_type][slice_header][slice]
> > > > 
> > > > This matches what is being passed through VA-API. We can understand
> > > > that stripping off the slice_header would be hard, since it's size is
> > > > in bits. Instead we pass size and header_bit_size in slice_params.
> > > 
> > > True, there is that.
> > > 
> > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > start_code_e, when you turn this off, you don't need start code. As a
> > > > side effect, the bitstream becomes identical. We do now know that it
> > > > works with the ffmpeg branch implement for cedrus.
> > > 
> > > Oh great, that makes life easier in the short term, but I guess the
> > > issue could arise on another decoder sooner or later.
> > > 
> > > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > > take care for us of reading and executing the modification lists found
> > > > in the slice header. Mostly because I very disliked having to pass the
> > > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > > following section is the execution of the modification list. As this
> > > > list is not modified, it only need to be calculated per frame. As a
> > > > result, we don't need these new lists, and we can work with the same
> > > > H264_SLICE format as Cedrus is using.
> > > 
> > > Yes but I definitely think it makes more sense to pass the list
> > > modifications rather than reconstructing those in the driver from a
> > > full list. IMO controls should stick to the bitstream as close as
> > > possible.
> > 
> > For Hantro and RKVDEC, the list of modification is parsed by the IP
> > from the slice header bits. Just to make sure, because I myself was
> > confused on this before, the slice header does not contain a list of
> > references, instead it contains a list modification to be applied to
> > the reference list. I need to check again, but to execute these
> > modification, you need to filter and sort the references in a specific
> > order. This should be what is defined in the spec as 8.2.4.1 and
> > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > 
> > The list of references is deduced from the DPB. The DPB, which I thinks
> > should be rename as "references", seems more useful then p/b0/b1, since
> > this is the data that gives use the ability to implementing glue in the
> > driver to compensate some HW differences.
> > 
> > In the case of Hantro / RKVDEC, we think it's natural to build the HW
> > specific lists (p/b0/b1) from the references rather then adding HW
> > specific list in the decode_params structure. The fact these lists are
> > standard intermediate step of the standard is not that important.
> 
> Sorry I got confused (once more) about it. Boris just explained the
> same thing to me over IRC :) Anyway my point is that we want to pass
> what's in ffmpeg's short and long term ref lists, and name them that
> instead of dpb.
> 
> > > > Now, this is just a start. For RK3399, we have a different CODEC
> > > > design. This one does not have the start_code_e bit. What the IP does,
> > > > is that you give it one or more slice per buffer, setup the params,
> > > > start decoding, but the decoder then return the location of the
> > > > following NAL. So basically you could offload the scanning of start
> > > > code to the HW. That being said, with the driver layer in between, that
> > > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > > feature that this allows is to operate in frame mode. In this mode, you
> > > > have 1 interrupt per frame.
> > > 
> > > I'm not sure there is any interest in exposing that from userspace and
> > > my current feeling is that we should just ditch support for per-frame
> > > decoding altogether. I think it mixes decoding with notions that are
> > > higher-level than decoding, but I agree it's a blurry line.
> > 
> > I'm not worried about this either. We can already support that by
> > copying the bitstream internally to the driver, though zero-copy with
> > this would require a new format, the one we talked about,
> > SLICE_ANNEX_B.
> 
> Right, but what I'm thinking about is making that the one and only
> format. The rationale is that it's always easier to just append a start
> code from userspace if needed. And we need a bit offset to the slice
> data part anyway, so it doesn't hurt to require a few extra bits to
> have the whole thing that will work in every situation.

What I'd like is to eventually allow zero-copy (aka userptr) into the
driver. If you make the start code mandatory, any decoding from ISOMP4
(.mp4, .mov) will require a full bitstream copy in userspace to add the
start code (unless you hack your allocation in your demuxer, but it's a
bit complicated since this code might come from two libraries). In
ISOMP4, you have an AVC header, which is just the size of the NAL that
follows.

On the other end, the data_offset thing is likely just a thing for the
RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.

> 
> To me the breaking point was about having the slice header both in raw
> bitstream and parsed forms. Since we agree that's fine, we might as
> well push it to its logical conclusion and include all the bits that
> can be useful.

To take your words, the bits that contain useful information starts
from the NAL type byte, exactly were the data was cut by VA-API and the
current uAPI.

> 
> > > > But it also support slice mode, with an
> > > > interrupt per slice, which is what we decided to use.
> > > 
> > > Easier for everyone and probably better for latency as well :)
> > > 
> > > > So in this case, indeed we strictly require on start-code. Though, to
> > > > me this is not a great reason to make a new fourcc, so we will try and
> > > > use (data_offset = 3) in order to make some space for that start code,
> > > > and write it down in the driver. This is to be continued, we will
> > > > report back on this later. This could have some side effect in the
> > > > ability to import buffers. But most userspace don't try to do zero-copy 
> > > > on the encoded size and just copy anyway.
> > > > 
> > > > To my opinion, having a single format is a big deal, since userspace
> > > > will generally be developed for one specific HW and we would endup with
> > > > fragmented support. What we really want to achieve is having a driver
> > > > interface which works across multiple HW, and I think this is quite
> > > > possible.
> > > 
> > > I agree with that. The more I think about it, the more I believe we
> > > should just pass the whole [nal_header][nal_type][slice_header][slice]
> > > and the parsed list in every scenario.
> > 
> > What I like of the cut at nal_type, is that there is only format. If we
> > cut at nal_header, then we need to expose 2 formats. And it makes our
> > API similar to other accelerator API, so it's easy to "convert"
> > existing userspace.
> 
> Unless we make that cut the single one and only true cut that shall
> supersed all other cuts :)

That's basically what I've been trying to do, kill this _RAW/ANNEX_B
thing and go back to our first idea.

> 
> > > For H.265, our decoder needs some information from the NAL type too.
> > > We currently extract that in userspace and stick it to the
> > > slice_header, but maybe it would make more sense to have drivers parse
> > > that info from the buffer if they need it. On the other hand, it seems
> > > quite common to pass information from the NAL type, so maybe we should
> > > either make a new control for it or have all the fields in the
> > > slice_header (which would still be wrong in terms of matching bitstream
> > > description).
> > 
> > Even in userspace, it's common to just parse this in place, it's a
> > simple mask. But yes, if we don't have it yet, we should expose the NAL
> > type, it would be cleaner.
> 
> Right, works for me.

Ack.

> 
> Cheers,
> 
> Paul
> 
> > > > > - Dropping the DPB concept in H.264/H.265
> > > > > 
> > > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > > concept that only makes sense relative to a decoder implementation. The
> > > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > > (Annex C), but that's about it.
> > > > > 
> > > > > What's really in the bitstream is the list of modified short-term and
> > > > > long-term references, which is enough for every decoder.
> > > > > 
> > > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > > the controls and just pass these lists agremented with relevant
> > > > > information for userspace.
> > > > > 
> > > > > I think it should be up to the driver to maintain a DPB and we could
> > > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > > keeping in particular and can manage with the lists from the bitstream
> > > > > directly.
> > > > 
> > > > As discusses today, we still need to pass that list. It's being index
> > > > by the HW to retrieve the extra information we have collected about the
> > > > status of the reference frames. In the case of Hantro, which process
> > > > the modification list from the slice header for us, we also need that
> > > > list to construct the unmodified list.
> > > > 
> > > > So the problem here is just a naming problem. That list is not really a
> > > > DPB. It is just the list of long-term/short-term references with the
> > > > status of these references. So maybe we could just rename as
> > > > references/reference_entry ?
> > > 
> > > What I'd like to pass is the diff to the references list, as ffmpeg
> > > currently provides for v4l2 request and vaapi (probably vdpau too). No
> > > functional change here, only that we should stop calling it a DPB,
> > > which confuses everyone.
> > 
> > Yes.
> > 
> > > > > - Using flags
> > > > > 
> > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > represented as flags. Using flags also helps with padding.
> > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > 
> > > > > I think H.264/HEVC per-control flags should also be moved to u64.
> > > > 
> > > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> > > 
> > > Mhh, even if they are, it makes it much harder to verify 32/64 bit
> > > alignment constraints (we're dealing with 64-bit platforms that need to
> > > have 32-bit userspace and compat_ioctl).
> > 
> > I see, thanks.
> > 
> > > > > - Clear split of controls and terminology
> > > > > 
> > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > bitstream element names for those.
> > > > > 
> > > > > For H.264, that would suggest the following changes:
> > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > 
> > > > Oops, I think you meant slice_prams ? decode_params matches the
> > > > information found in SPS/PPS (combined?), while slice_params matches
> > > > the information extracted (and executed in case of l0/l1) from the
> > > > slice headers.
> > > 
> > > Yes you're right, I mixed them up.
> > > 
> > > >  That being said, to me this name wasn't confusing, since
> > > > it's not just the slice header, and it's per slice.
> > > 
> > > Mhh, what exactly remains in there and where does it originate in the
> > > bitstream? Maybe it wouldn't be too bad to have one control per actual
> > > group of bitstream elements.
> > > 
> > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > where they belong, which seems to be slice_header;
> > > > 
> > > > There reference list is only updated by userspace (through it's DPB)
> > > > base on the result of the last decoding step. I was very confused for a
> > > > moment until I realize that the lists in the slice_header are just a
> > > > list of modification to apply to the reference list in order to produce
> > > > l0 and l1.
> > > 
> > > Indeed, and I'm suggesting that we pass the modifications only, which
> > > would fit a slice_header control.
> > 
> > I think I made my point why we want the dpb -> references. I'm going to
> > validate with the VA driver now, to see if the references list there is
> > usable with our code.
> > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > > > I'm up for preparing and submitting these control changes and updating
> > > > > cedrus if they seem agreeable.
> > > > > 
> > > > > What do you think?
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > > Paul
> > > > > 
> > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > > > 
> 
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-16 18:24         ` Nicolas Dufresne
@ 2019-05-16 18:45           ` Paul Kocialkowski
  2019-05-17 20:43             ` Nicolas Dufresne
  0 siblings, 1 reply; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-16 18:45 UTC (permalink / raw)
  To: Nicolas Dufresne, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

Hi,

Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > > better idea of what the situation is like on platforms other than
> > > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > > 
> > > > > > - Per-slice decoding
> > > > > > 
> > > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > > to implement the required core bits. When we agree it looks good, we
> > > > > > should lift the restriction that all slices must be concatenated and
> > > > > > have them submitted as individual requests.
> > > > > > 
> > > > > > One question is what to do about other controls. I feel like it would
> > > > > > make sense to always pass all the required controls for decoding the
> > > > > > slice, including the ones that don't change across slices. But there
> > > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > > and relying on the "control cache" can work, but we need to specify
> > > > > > that only a single stream can be decoded per opened instance of the
> > > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > 
> > > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > > should be responsible of doing time-division multiplexing across
> > > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > > userspace would require some sort of daemon to work properly across
> > > > > processes. I also think the kernel is better place for doing resource
> > > > > access scheduling in general.
> > > > 
> > > > I agree with that yes. We always have a single m2m context and specific
> > > > controls per opened device so keeping cached values works out well.
> > > > 
> > > > So maybe we shall explicitly require that the request with the first
> > > > slice for a frame also contains the per-frame controls.
> > > > 
> > > > > > - Annex-B formats
> > > > > > 
> > > > > > I don't think we have really reached a conclusion on the pixel formats
> > > > > > we want to expose. The main issue is how to deal with codecs that need
> > > > > > the full slice NALU with start code, where the slice_header is
> > > > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > > > slice data and the parsed slice header control.
> > > > > > 
> > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > - One that only takes only the slice compressed data (without raw slice
> > > > > > header and start code);
> > > > > > - One that takes both the NALU data (including start code, raw header
> > > > > > and compressed data) and slice header controls;
> > > > > > - One that takes the NALU data but no slice header.
> > > > > > 
> > > > > > But I no longer think the latter really makes sense in the context of
> > > > > > stateless video decoding.
> > > > > > 
> > > > > > A side-note: I think we should definitely have data offsets in every
> > > > > > case, so that implementations can just push the whole NALU regardless
> > > > > > of the format if they're lazy.
> > > > > 
> > > > > I realize that I didn't share our latest research on the subject. So a
> > > > > slice in the original bitstream is formed of the following blocks
> > > > > (simplified):
> > > > > 
> > > > >   [nal_header][nal_type][slice_header][slice]
> > > > 
> > > > Thanks for the details!
> > > > 
> > > > > nal_header:
> > > > > This one is a header used to locate the start and the end of the of a
> > > > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > > > would be a leading 0 from the previous NAL padding, but this is also
> > > > > totally valid start code. The second form is the AVC form, notably used
> > > > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > > > your buffer aligned to NALs in this case as you cannot scan from random
> > > > > location.
> > > > > 
> > > > > nal_type:
> > > > > It's a bit more then just the type, but it contains at least the
> > > > > information of the nal type. This has different size on H.264 and HEVC
> > > > > but I know it's size is in bytes.
> > > > > 
> > > > > slice_header:
> > > > > This contains per slice parameters, like the modification lists to
> > > > > apply on the references. This one has a size in bits, not in bytes.
> > > > > 
> > > > > slice:
> > > > > I don't really know what is in it exactly, but this is the data used to
> > > > > decode. This bit has a special coding called the anti-emulation, which
> > > > > prevents a start-code from appearing in it. This coding is present in
> > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > > > call ANNEX-B the bytestream format).
> > > > > 
> > > > > So, what we notice is that what is currently passed through Cedrus
> > > > > driver:
> > > > >   [nal_type][slice_header][slice]
> > > > > 
> > > > > This matches what is being passed through VA-API. We can understand
> > > > > that stripping off the slice_header would be hard, since it's size is
> > > > > in bits. Instead we pass size and header_bit_size in slice_params.
> > > > 
> > > > True, there is that.
> > > > 
> > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > start_code_e, when you turn this off, you don't need start code. As a
> > > > > side effect, the bitstream becomes identical. We do now know that it
> > > > > works with the ffmpeg branch implement for cedrus.
> > > > 
> > > > Oh great, that makes life easier in the short term, but I guess the
> > > > issue could arise on another decoder sooner or later.
> > > > 
> > > > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > > > take care for us of reading and executing the modification lists found
> > > > > in the slice header. Mostly because I very disliked having to pass the
> > > > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > > > following section is the execution of the modification list. As this
> > > > > list is not modified, it only need to be calculated per frame. As a
> > > > > result, we don't need these new lists, and we can work with the same
> > > > > H264_SLICE format as Cedrus is using.
> > > > 
> > > > Yes but I definitely think it makes more sense to pass the list
> > > > modifications rather than reconstructing those in the driver from a
> > > > full list. IMO controls should stick to the bitstream as close as
> > > > possible.
> > > 
> > > For Hantro and RKVDEC, the list of modification is parsed by the IP
> > > from the slice header bits. Just to make sure, because I myself was
> > > confused on this before, the slice header does not contain a list of
> > > references, instead it contains a list modification to be applied to
> > > the reference list. I need to check again, but to execute these
> > > modification, you need to filter and sort the references in a specific
> > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > 
> > > The list of references is deduced from the DPB. The DPB, which I thinks
> > > should be rename as "references", seems more useful then p/b0/b1, since
> > > this is the data that gives use the ability to implementing glue in the
> > > driver to compensate some HW differences.
> > > 
> > > In the case of Hantro / RKVDEC, we think it's natural to build the HW
> > > specific lists (p/b0/b1) from the references rather then adding HW
> > > specific list in the decode_params structure. The fact these lists are
> > > standard intermediate step of the standard is not that important.
> > 
> > Sorry I got confused (once more) about it. Boris just explained the
> > same thing to me over IRC :) Anyway my point is that we want to pass
> > what's in ffmpeg's short and long term ref lists, and name them that
> > instead of dpb.
> > 
> > > > > Now, this is just a start. For RK3399, we have a different CODEC
> > > > > design. This one does not have the start_code_e bit. What the IP does,
> > > > > is that you give it one or more slice per buffer, setup the params,
> > > > > start decoding, but the decoder then return the location of the
> > > > > following NAL. So basically you could offload the scanning of start
> > > > > code to the HW. That being said, with the driver layer in between, that
> > > > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > > > feature that this allows is to operate in frame mode. In this mode, you
> > > > > have 1 interrupt per frame.
> > > > 
> > > > I'm not sure there is any interest in exposing that from userspace and
> > > > my current feeling is that we should just ditch support for per-frame
> > > > decoding altogether. I think it mixes decoding with notions that are
> > > > higher-level than decoding, but I agree it's a blurry line.
> > > 
> > > I'm not worried about this either. We can already support that by
> > > copying the bitstream internally to the driver, though zero-copy with
> > > this would require a new format, the one we talked about,
> > > SLICE_ANNEX_B.
> > 
> > Right, but what I'm thinking about is making that the one and only
> > format. The rationale is that it's always easier to just append a start
> > code from userspace if needed. And we need a bit offset to the slice
> > data part anyway, so it doesn't hurt to require a few extra bits to
> > have the whole thing that will work in every situation.
> 
> What I'd like is to eventually allow zero-copy (aka userptr) into the
> driver. If you make the start code mandatory, any decoding from ISOMP4
> (.mp4, .mov) will require a full bitstream copy in userspace to add the
> start code (unless you hack your allocation in your demuxer, but it's a
> bit complicated since this code might come from two libraries). In
> ISOMP4, you have an AVC header, which is just the size of the NAL that
> follows.

Well, I think we have to do a copy from system memory to the buffer
allocated by v4l2 anyway. Our hardware pipelines can reasonably be
expected not to have any MMU unit and not allow sg import anyway.

So with that in mind, asking userspace to add a startcode it already
knows doesn't seem to be asking too much.

> On the other end, the data_offset thing is likely just a thing for the
> RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.

Well, I think it's best to be fool-proof here and just require that
start code. We should also have per-slice bit offsets to the different
parts anyway, so drivers that don't need it can just ignore it.

In extreme cases where there is some interest in doing direct buffer
import without doing a copy in userspace, userspace could trick the
format and avoid a copy by not providing the start-code (assuming it
knows it doesn't need it) and specifying the bit offsets accordingly.
That'd be a hack for better performance, and it feels better to do
things in this order rather than having to hack around in the drivers
that need the start code in every other case.

> > To me the breaking point was about having the slice header both in raw
> > bitstream and parsed forms. Since we agree that's fine, we might as
> > well push it to its logical conclusion and include all the bits that
> > can be useful.
> 
> To take your words, the bits that contain useful information starts
> from the NAL type byte, exactly were the data was cut by VA-API and the
> current uAPI.

Agreed, but I think that the advantages of always requiring the start
code outweigh the potential (yet quite unlikely) downsides.

> > > > > But it also support slice mode, with an
> > > > > interrupt per slice, which is what we decided to use.
> > > > 
> > > > Easier for everyone and probably better for latency as well :)
> > > > 
> > > > > So in this case, indeed we strictly require on start-code. Though, to
> > > > > me this is not a great reason to make a new fourcc, so we will try and
> > > > > use (data_offset = 3) in order to make some space for that start code,
> > > > > and write it down in the driver. This is to be continued, we will
> > > > > report back on this later. This could have some side effect in the
> > > > > ability to import buffers. But most userspace don't try to do zero-copy 
> > > > > on the encoded size and just copy anyway.
> > > > > 
> > > > > To my opinion, having a single format is a big deal, since userspace
> > > > > will generally be developed for one specific HW and we would endup with
> > > > > fragmented support. What we really want to achieve is having a driver
> > > > > interface which works across multiple HW, and I think this is quite
> > > > > possible.
> > > > 
> > > > I agree with that. The more I think about it, the more I believe we
> > > > should just pass the whole [nal_header][nal_type][slice_header][slice]
> > > > and the parsed list in every scenario.
> > > 
> > > What I like of the cut at nal_type, is that there is only format. If we
> > > cut at nal_header, then we need to expose 2 formats. And it makes our
> > > API similar to other accelerator API, so it's easy to "convert"
> > > existing userspace.
> > 
> > Unless we make that cut the single one and only true cut that shall
> > supersed all other cuts :)
> 
> That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> thing and go back to our first idea.

Right, in the end I think we should go with:
V4L2_PIX_FMT_MPEG2_SLICE
V4L2_PIX_FMT_H264_SLICE
V4L2_PIX_FMT_HEVC_SLICE

And just require raw bitstream for the slice with emulation-prevention
bits included.

Cheers,

Paul

> > > > For H.265, our decoder needs some information from the NAL type too.
> > > > We currently extract that in userspace and stick it to the
> > > > slice_header, but maybe it would make more sense to have drivers parse
> > > > that info from the buffer if they need it. On the other hand, it seems
> > > > quite common to pass information from the NAL type, so maybe we should
> > > > either make a new control for it or have all the fields in the
> > > > slice_header (which would still be wrong in terms of matching bitstream
> > > > description).
> > > 
> > > Even in userspace, it's common to just parse this in place, it's a
> > > simple mask. But yes, if we don't have it yet, we should expose the NAL
> > > type, it would be cleaner.
> > 
> > Right, works for me.
> 
> Ack.
> 
> > Cheers,
> > 
> > Paul
> > 
> > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > 
> > > > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > > > concept that only makes sense relative to a decoder implementation. The
> > > > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > > > (Annex C), but that's about it.
> > > > > > 
> > > > > > What's really in the bitstream is the list of modified short-term and
> > > > > > long-term references, which is enough for every decoder.
> > > > > > 
> > > > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > > > the controls and just pass these lists agremented with relevant
> > > > > > information for userspace.
> > > > > > 
> > > > > > I think it should be up to the driver to maintain a DPB and we could
> > > > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > > > keeping in particular and can manage with the lists from the bitstream
> > > > > > directly.
> > > > > 
> > > > > As discusses today, we still need to pass that list. It's being index
> > > > > by the HW to retrieve the extra information we have collected about the
> > > > > status of the reference frames. In the case of Hantro, which process
> > > > > the modification list from the slice header for us, we also need that
> > > > > list to construct the unmodified list.
> > > > > 
> > > > > So the problem here is just a naming problem. That list is not really a
> > > > > DPB. It is just the list of long-term/short-term references with the
> > > > > status of these references. So maybe we could just rename as
> > > > > references/reference_entry ?
> > > > 
> > > > What I'd like to pass is the diff to the references list, as ffmpeg
> > > > currently provides for v4l2 request and vaapi (probably vdpau too). No
> > > > functional change here, only that we should stop calling it a DPB,
> > > > which confuses everyone.
> > > 
> > > Yes.
> > > 
> > > > > > - Using flags
> > > > > > 
> > > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > > represented as flags. Using flags also helps with padding.
> > > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > > 
> > > > > > I think H.264/HEVC per-control flags should also be moved to u64.
> > > > > 
> > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> > > > 
> > > > Mhh, even if they are, it makes it much harder to verify 32/64 bit
> > > > alignment constraints (we're dealing with 64-bit platforms that need to
> > > > have 32-bit userspace and compat_ioctl).
> > > 
> > > I see, thanks.
> > > 
> > > > > > - Clear split of controls and terminology
> > > > > > 
> > > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > > bitstream element names for those.
> > > > > > 
> > > > > > For H.264, that would suggest the following changes:
> > > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > > 
> > > > > Oops, I think you meant slice_prams ? decode_params matches the
> > > > > information found in SPS/PPS (combined?), while slice_params matches
> > > > > the information extracted (and executed in case of l0/l1) from the
> > > > > slice headers.
> > > > 
> > > > Yes you're right, I mixed them up.
> > > > 
> > > > >  That being said, to me this name wasn't confusing, since
> > > > > it's not just the slice header, and it's per slice.
> > > > 
> > > > Mhh, what exactly remains in there and where does it originate in the
> > > > bitstream? Maybe it wouldn't be too bad to have one control per actual
> > > > group of bitstream elements.
> > > > 
> > > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > > where they belong, which seems to be slice_header;
> > > > > 
> > > > > There reference list is only updated by userspace (through it's DPB)
> > > > > base on the result of the last decoding step. I was very confused for a
> > > > > moment until I realize that the lists in the slice_header are just a
> > > > > list of modification to apply to the reference list in order to produce
> > > > > l0 and l1.
> > > > 
> > > > Indeed, and I'm suggesting that we pass the modifications only, which
> > > > would fit a slice_header control.
> > > 
> > > I think I made my point why we want the dpb -> references. I'm going to
> > > validate with the VA driver now, to see if the references list there is
> > > usable with our code.
> > > 
> > > > Cheers,
> > > > 
> > > > Paul
> > > > 
> > > > > > I'm up for preparing and submitting these control changes and updating
> > > > > > cedrus if they seem agreeable.
> > > > > > 
> > > > > > What do you think?
> > > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Paul
> > > > > > 
> > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > > > > 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-16 18:45           ` Paul Kocialkowski
@ 2019-05-17 20:43             ` Nicolas Dufresne
  2019-05-18  9:50               ` Paul Kocialkowski
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-17 20:43 UTC (permalink / raw)
  To: Paul Kocialkowski, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 25322 bytes --]

Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > > > better idea of what the situation is like on platforms other than
> > > > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > > > 
> > > > > > > - Per-slice decoding
> > > > > > > 
> > > > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > > > to implement the required core bits. When we agree it looks good, we
> > > > > > > should lift the restriction that all slices must be concatenated and
> > > > > > > have them submitted as individual requests.
> > > > > > > 
> > > > > > > One question is what to do about other controls. I feel like it would
> > > > > > > make sense to always pass all the required controls for decoding the
> > > > > > > slice, including the ones that don't change across slices. But there
> > > > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > > > and relying on the "control cache" can work, but we need to specify
> > > > > > > that only a single stream can be decoded per opened instance of the
> > > > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > 
> > > > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > > > should be responsible of doing time-division multiplexing across
> > > > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > > > userspace would require some sort of daemon to work properly across
> > > > > > processes. I also think the kernel is better place for doing resource
> > > > > > access scheduling in general.
> > > > > 
> > > > > I agree with that yes. We always have a single m2m context and specific
> > > > > controls per opened device so keeping cached values works out well.
> > > > > 
> > > > > So maybe we shall explicitly require that the request with the first
> > > > > slice for a frame also contains the per-frame controls.
> > > > > 
> > > > > > > - Annex-B formats
> > > > > > > 
> > > > > > > I don't think we have really reached a conclusion on the pixel formats
> > > > > > > we want to expose. The main issue is how to deal with codecs that need
> > > > > > > the full slice NALU with start code, where the slice_header is
> > > > > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > > > > slice data and the parsed slice header control.
> > > > > > > 
> > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > - One that only takes only the slice compressed data (without raw slice
> > > > > > > header and start code);
> > > > > > > - One that takes both the NALU data (including start code, raw header
> > > > > > > and compressed data) and slice header controls;
> > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > 
> > > > > > > But I no longer think the latter really makes sense in the context of
> > > > > > > stateless video decoding.
> > > > > > > 
> > > > > > > A side-note: I think we should definitely have data offsets in every
> > > > > > > case, so that implementations can just push the whole NALU regardless
> > > > > > > of the format if they're lazy.
> > > > > > 
> > > > > > I realize that I didn't share our latest research on the subject. So a
> > > > > > slice in the original bitstream is formed of the following blocks
> > > > > > (simplified):
> > > > > > 
> > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > 
> > > > > Thanks for the details!
> > > > > 
> > > > > > nal_header:
> > > > > > This one is a header used to locate the start and the end of the of a
> > > > > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > > > > would be a leading 0 from the previous NAL padding, but this is also
> > > > > > totally valid start code. The second form is the AVC form, notably used
> > > > > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > > > > your buffer aligned to NALs in this case as you cannot scan from random
> > > > > > location.
> > > > > > 
> > > > > > nal_type:
> > > > > > It's a bit more then just the type, but it contains at least the
> > > > > > information of the nal type. This has different size on H.264 and HEVC
> > > > > > but I know it's size is in bytes.
> > > > > > 
> > > > > > slice_header:
> > > > > > This contains per slice parameters, like the modification lists to
> > > > > > apply on the references. This one has a size in bits, not in bytes.
> > > > > > 
> > > > > > slice:
> > > > > > I don't really know what is in it exactly, but this is the data used to
> > > > > > decode. This bit has a special coding called the anti-emulation, which
> > > > > > prevents a start-code from appearing in it. This coding is present in
> > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > > > > call ANNEX-B the bytestream format).
> > > > > > 
> > > > > > So, what we notice is that what is currently passed through Cedrus
> > > > > > driver:
> > > > > >   [nal_type][slice_header][slice]
> > > > > > 
> > > > > > This matches what is being passed through VA-API. We can understand
> > > > > > that stripping off the slice_header would be hard, since it's size is
> > > > > > in bits. Instead we pass size and header_bit_size in slice_params.
> > > > > 
> > > > > True, there is that.
> > > > > 
> > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > start_code_e, when you turn this off, you don't need start code. As a
> > > > > > side effect, the bitstream becomes identical. We do now know that it
> > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > 
> > > > > Oh great, that makes life easier in the short term, but I guess the
> > > > > issue could arise on another decoder sooner or later.
> > > > > 
> > > > > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > > > > take care for us of reading and executing the modification lists found
> > > > > > in the slice header. Mostly because I very disliked having to pass the
> > > > > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > > > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > > > > following section is the execution of the modification list. As this
> > > > > > list is not modified, it only need to be calculated per frame. As a
> > > > > > result, we don't need these new lists, and we can work with the same
> > > > > > H264_SLICE format as Cedrus is using.
> > > > > 
> > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > modifications rather than reconstructing those in the driver from a
> > > > > full list. IMO controls should stick to the bitstream as close as
> > > > > possible.
> > > > 
> > > > For Hantro and RKVDEC, the list of modification is parsed by the IP
> > > > from the slice header bits. Just to make sure, because I myself was
> > > > confused on this before, the slice header does not contain a list of
> > > > references, instead it contains a list modification to be applied to
> > > > the reference list. I need to check again, but to execute these
> > > > modification, you need to filter and sort the references in a specific
> > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > 
> > > > The list of references is deduced from the DPB. The DPB, which I thinks
> > > > should be rename as "references", seems more useful then p/b0/b1, since
> > > > this is the data that gives use the ability to implementing glue in the
> > > > driver to compensate some HW differences.
> > > > 
> > > > In the case of Hantro / RKVDEC, we think it's natural to build the HW
> > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > specific list in the decode_params structure. The fact these lists are
> > > > standard intermediate step of the standard is not that important.
> > > 
> > > Sorry I got confused (once more) about it. Boris just explained the
> > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > what's in ffmpeg's short and long term ref lists, and name them that
> > > instead of dpb.
> > > 
> > > > > > Now, this is just a start. For RK3399, we have a different CODEC
> > > > > > design. This one does not have the start_code_e bit. What the IP does,
> > > > > > is that you give it one or more slice per buffer, setup the params,
> > > > > > start decoding, but the decoder then return the location of the
> > > > > > following NAL. So basically you could offload the scanning of start
> > > > > > code to the HW. That being said, with the driver layer in between, that
> > > > > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > > > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > > > > feature that this allows is to operate in frame mode. In this mode, you
> > > > > > have 1 interrupt per frame.
> > > > > 
> > > > > I'm not sure there is any interest in exposing that from userspace and
> > > > > my current feeling is that we should just ditch support for per-frame
> > > > > decoding altogether. I think it mixes decoding with notions that are
> > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > 
> > > > I'm not worried about this either. We can already support that by
> > > > copying the bitstream internally to the driver, though zero-copy with
> > > > this would require a new format, the one we talked about,
> > > > SLICE_ANNEX_B.
> > > 
> > > Right, but what I'm thinking about is making that the one and only
> > > format. The rationale is that it's always easier to just append a start
> > > code from userspace if needed. And we need a bit offset to the slice
> > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > have the whole thing that will work in every situation.
> > 
> > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > driver. If you make the start code mandatory, any decoding from ISOMP4
> > (.mp4, .mov) will require a full bitstream copy in userspace to add the
> > start code (unless you hack your allocation in your demuxer, but it's a
> > bit complicated since this code might come from two libraries). In
> > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > follows.
> 
> Well, I think we have to do a copy from system memory to the buffer
> allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> expected not to have any MMU unit and not allow sg import anyway.

The Rockchip has an mmu. You need one copy at least indeed, e.g. file
to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
ts, first you need to copy the MPEG TS to mem, then to demux you copy
that H264 stream to another buffer, you then copy in the parser,
removing the start-code and finally copy in the accelerator, adding the
start code. If the driver would allow userptr, it would be unusable.

GStreamer on the other side implement lazy conversion, so it would copy
the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
parser (but stream format is negotiation, so it keeps the start-code).
If you request alignment=au, you have full frame of buffers, so if your
driver could do userptr, you can same that extra copy.

Now, if we demux an MP4 it's the same, the parser will need do a full
copy instead of lazy aggregation in order to prepend the start code
(since it had an AVC header). But userptr could save a copy.

If the driver requires no nal prefix, then we could just pass a
slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
which is a bit slower (even know it's nothing compare to the full
copies we already do.

That was my argument in favour for no NAL prefix in term of efficiency,
and it does not prevent adding a control to enable start-code for cases
it make sense.

> 
> So with that in mind, asking userspace to add a startcode it already
> knows doesn't seem to be asking too much.
> 
> > On the other end, the data_offset thing is likely just a thing for the
> > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> 
> Well, I think it's best to be fool-proof here and just require that
> start code. We should also have per-slice bit offsets to the different
> parts anyway, so drivers that don't need it can just ignore it.
> 
> In extreme cases where there is some interest in doing direct buffer
> import without doing a copy in userspace, userspace could trick the
> format and avoid a copy by not providing the start-code (assuming it
> knows it doesn't need it) and specifying the bit offsets accordingly.
> That'd be a hack for better performance, and it feels better to do
> things in this order rather than having to hack around in the drivers
> that need the start code in every other case.

So basically, you and Tomas are both strongly in favour of adding
ANNEX-B start-code to the current uAPI. I have digged into Cedrus
registers, and it seems that it does have start-code scanning support.
I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
like the RK do. That requires the IP to deal with the modifications
lists, which are per slices.

My question is, are you willing to adapt the Cedrus driver to support
receiving start-code ? And will this have a performance impact or not ?
On RK side, it's really just about flipping 1 bit.

On the Rockchip side, Tomas had concern about CPU wakeup and the fact
that we didn't aim at supporting passing multiple slices at once to the
IP (something RK supports). It's important to understand that multi-
slice streams are relatively rare and mostly used for low-latency /
video conferencing. So aggregating in these case defeats the purpose of
using slices. So I think RK feature is not very important.

Of course, I do believe that long term we will want to expose bot
stream formats on RK (because the HW can do that), so then userspace
can just pick the best when available. So that boils down to our first
idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
Now that we have progressed on the matter, I'm quite in favour of
having _SLICE in the first place, with the preferred format that
everyone should support, and allow for variants later. Now, if we make
one mandatory, we could also just have a menu control to allow other
formats.

> 
> > > To me the breaking point was about having the slice header both in raw
> > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > well push it to its logical conclusion and include all the bits that
> > > can be useful.
> > 
> > To take your words, the bits that contain useful information starts
> > from the NAL type byte, exactly were the data was cut by VA-API and the
> > current uAPI.
> 
> Agreed, but I think that the advantages of always requiring the start
> code outweigh the potential (yet quite unlikely) downsides.
> 
> > > > > > But it also support slice mode, with an
> > > > > > interrupt per slice, which is what we decided to use.
> > > > > 
> > > > > Easier for everyone and probably better for latency as well :)
> > > > > 
> > > > > > So in this case, indeed we strictly require on start-code. Though, to
> > > > > > me this is not a great reason to make a new fourcc, so we will try and
> > > > > > use (data_offset = 3) in order to make some space for that start code,
> > > > > > and write it down in the driver. This is to be continued, we will
> > > > > > report back on this later. This could have some side effect in the
> > > > > > ability to import buffers. But most userspace don't try to do zero-copy 
> > > > > > on the encoded size and just copy anyway.
> > > > > > 
> > > > > > To my opinion, having a single format is a big deal, since userspace
> > > > > > will generally be developed for one specific HW and we would endup with
> > > > > > fragmented support. What we really want to achieve is having a driver
> > > > > > interface which works across multiple HW, and I think this is quite
> > > > > > possible.
> > > > > 
> > > > > I agree with that. The more I think about it, the more I believe we
> > > > > should just pass the whole [nal_header][nal_type][slice_header][slice]
> > > > > and the parsed list in every scenario.
> > > > 
> > > > What I like of the cut at nal_type, is that there is only format. If we
> > > > cut at nal_header, then we need to expose 2 formats. And it makes our
> > > > API similar to other accelerator API, so it's easy to "convert"
> > > > existing userspace.
> > > 
> > > Unless we make that cut the single one and only true cut that shall
> > > supersed all other cuts :)
> > 
> > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > thing and go back to our first idea.
> 
> Right, in the end I think we should go with:
> V4L2_PIX_FMT_MPEG2_SLICE
> V4L2_PIX_FMT_H264_SLICE
> V4L2_PIX_FMT_HEVC_SLICE
> 
> And just require raw bitstream for the slice with emulation-prevention
> bits included.

That's should be the set of format we start with indeed. The single
format for which software gets written and tested, making sure software
support is not fragmented, and other variants should be something to
opt-in.

> 
> Cheers,
> 
> Paul
> 
> > > > > For H.265, our decoder needs some information from the NAL type too.
> > > > > We currently extract that in userspace and stick it to the
> > > > > slice_header, but maybe it would make more sense to have drivers parse
> > > > > that info from the buffer if they need it. On the other hand, it seems
> > > > > quite common to pass information from the NAL type, so maybe we should
> > > > > either make a new control for it or have all the fields in the
> > > > > slice_header (which would still be wrong in terms of matching bitstream
> > > > > description).
> > > > 
> > > > Even in userspace, it's common to just parse this in place, it's a
> > > > simple mask. But yes, if we don't have it yet, we should expose the NAL
> > > > type, it would be cleaner.
> > > 
> > > Right, works for me.
> > 
> > Ack.
> > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > 
> > > > > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > > > > concept that only makes sense relative to a decoder implementation. The
> > > > > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > > > > (Annex C), but that's about it.
> > > > > > > 
> > > > > > > What's really in the bitstream is the list of modified short-term and
> > > > > > > long-term references, which is enough for every decoder.
> > > > > > > 
> > > > > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > > > > the controls and just pass these lists agremented with relevant
> > > > > > > information for userspace.
> > > > > > > 
> > > > > > > I think it should be up to the driver to maintain a DPB and we could
> > > > > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > > > > keeping in particular and can manage with the lists from the bitstream
> > > > > > > directly.
> > > > > > 
> > > > > > As discusses today, we still need to pass that list. It's being index
> > > > > > by the HW to retrieve the extra information we have collected about the
> > > > > > status of the reference frames. In the case of Hantro, which process
> > > > > > the modification list from the slice header for us, we also need that
> > > > > > list to construct the unmodified list.
> > > > > > 
> > > > > > So the problem here is just a naming problem. That list is not really a
> > > > > > DPB. It is just the list of long-term/short-term references with the
> > > > > > status of these references. So maybe we could just rename as
> > > > > > references/reference_entry ?
> > > > > 
> > > > > What I'd like to pass is the diff to the references list, as ffmpeg
> > > > > currently provides for v4l2 request and vaapi (probably vdpau too). No
> > > > > functional change here, only that we should stop calling it a DPB,
> > > > > which confuses everyone.
> > > > 
> > > > Yes.
> > > > 
> > > > > > > - Using flags
> > > > > > > 
> > > > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > > > 
> > > > > > > I think H.264/HEVC per-control flags should also be moved to u64.
> > > > > > 
> > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> > > > > 
> > > > > Mhh, even if they are, it makes it much harder to verify 32/64 bit
> > > > > alignment constraints (we're dealing with 64-bit platforms that need to
> > > > > have 32-bit userspace and compat_ioctl).
> > > > 
> > > > I see, thanks.
> > > > 
> > > > > > > - Clear split of controls and terminology
> > > > > > > 
> > > > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > > > bitstream element names for those.
> > > > > > > 
> > > > > > > For H.264, that would suggest the following changes:
> > > > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > > > 
> > > > > > Oops, I think you meant slice_prams ? decode_params matches the
> > > > > > information found in SPS/PPS (combined?), while slice_params matches
> > > > > > the information extracted (and executed in case of l0/l1) from the
> > > > > > slice headers.
> > > > > 
> > > > > Yes you're right, I mixed them up.
> > > > > 
> > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > it's not just the slice header, and it's per slice.
> > > > > 
> > > > > Mhh, what exactly remains in there and where does it originate in the
> > > > > bitstream? Maybe it wouldn't be too bad to have one control per actual
> > > > > group of bitstream elements.
> > > > > 
> > > > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > > > where they belong, which seems to be slice_header;
> > > > > > 
> > > > > > There reference list is only updated by userspace (through it's DPB)
> > > > > > base on the result of the last decoding step. I was very confused for a
> > > > > > moment until I realize that the lists in the slice_header are just a
> > > > > > list of modification to apply to the reference list in order to produce
> > > > > > l0 and l1.
> > > > > 
> > > > > Indeed, and I'm suggesting that we pass the modifications only, which
> > > > > would fit a slice_header control.
> > > > 
> > > > I think I made my point why we want the dpb -> references. I'm going to
> > > > validate with the VA driver now, to see if the references list there is
> > > > usable with our code.
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Paul
> > > > > 
> > > > > > > I'm up for preparing and submitting these control changes and updating
> > > > > > > cedrus if they seem agreeable.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Paul
> > > > > > > 
> > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > > > > > 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-17 20:43             ` Nicolas Dufresne
@ 2019-05-18  9:50               ` Paul Kocialkowski
  2019-05-18 10:04                 ` Jernej Škrabec
  0 siblings, 1 reply; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-18  9:50 UTC (permalink / raw)
  To: Nicolas Dufresne, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Thierry Reding, Jernej Skrabec, Ezequiel Garcia,
	Jonas Karlman

Hi,

On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > > > > better idea of what the situation is like on platforms other than
> > > > > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > > > > 
> > > > > > > > - Per-slice decoding
> > > > > > > > 
> > > > > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > > > > to implement the required core bits. When we agree it looks good, we
> > > > > > > > should lift the restriction that all slices must be concatenated and
> > > > > > > > have them submitted as individual requests.
> > > > > > > > 
> > > > > > > > One question is what to do about other controls. I feel like it would
> > > > > > > > make sense to always pass all the required controls for decoding the
> > > > > > > > slice, including the ones that don't change across slices. But there
> > > > > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > > > > and relying on the "control cache" can work, but we need to specify
> > > > > > > > that only a single stream can be decoded per opened instance of the
> > > > > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > 
> > > > > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > > > > should be responsible of doing time-division multiplexing across
> > > > > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > > > > userspace would require some sort of daemon to work properly across
> > > > > > > processes. I also think the kernel is better place for doing resource
> > > > > > > access scheduling in general.
> > > > > > 
> > > > > > I agree with that yes. We always have a single m2m context and specific
> > > > > > controls per opened device so keeping cached values works out well.
> > > > > > 
> > > > > > So maybe we shall explicitly require that the request with the first
> > > > > > slice for a frame also contains the per-frame controls.
> > > > > > 
> > > > > > > > - Annex-B formats
> > > > > > > > 
> > > > > > > > I don't think we have really reached a conclusion on the pixel formats
> > > > > > > > we want to expose. The main issue is how to deal with codecs that need
> > > > > > > > the full slice NALU with start code, where the slice_header is
> > > > > > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > > > > > slice data and the parsed slice header control.
> > > > > > > > 
> > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > - One that only takes only the slice compressed data (without raw slice
> > > > > > > > header and start code);
> > > > > > > > - One that takes both the NALU data (including start code, raw header
> > > > > > > > and compressed data) and slice header controls;
> > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > 
> > > > > > > > But I no longer think the latter really makes sense in the context of
> > > > > > > > stateless video decoding.
> > > > > > > > 
> > > > > > > > A side-note: I think we should definitely have data offsets in every
> > > > > > > > case, so that implementations can just push the whole NALU regardless
> > > > > > > > of the format if they're lazy.
> > > > > > > 
> > > > > > > I realize that I didn't share our latest research on the subject. So a
> > > > > > > slice in the original bitstream is formed of the following blocks
> > > > > > > (simplified):
> > > > > > > 
> > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > 
> > > > > > Thanks for the details!
> > > > > > 
> > > > > > > nal_header:
> > > > > > > This one is a header used to locate the start and the end of the of a
> > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > > > > > would be a leading 0 from the previous NAL padding, but this is also
> > > > > > > totally valid start code. The second form is the AVC form, notably used
> > > > > > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > > > > > your buffer aligned to NALs in this case as you cannot scan from random
> > > > > > > location.
> > > > > > > 
> > > > > > > nal_type:
> > > > > > > It's a bit more then just the type, but it contains at least the
> > > > > > > information of the nal type. This has different size on H.264 and HEVC
> > > > > > > but I know it's size is in bytes.
> > > > > > > 
> > > > > > > slice_header:
> > > > > > > This contains per slice parameters, like the modification lists to
> > > > > > > apply on the references. This one has a size in bits, not in bytes.
> > > > > > > 
> > > > > > > slice:
> > > > > > > I don't really know what is in it exactly, but this is the data used to
> > > > > > > decode. This bit has a special coding called the anti-emulation, which
> > > > > > > prevents a start-code from appearing in it. This coding is present in
> > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > > > > > call ANNEX-B the bytestream format).
> > > > > > > 
> > > > > > > So, what we notice is that what is currently passed through Cedrus
> > > > > > > driver:
> > > > > > >   [nal_type][slice_header][slice]
> > > > > > > 
> > > > > > > This matches what is being passed through VA-API. We can understand
> > > > > > > that stripping off the slice_header would be hard, since it's size is
> > > > > > > in bits. Instead we pass size and header_bit_size in slice_params.
> > > > > > 
> > > > > > True, there is that.
> > > > > > 
> > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > start_code_e, when you turn this off, you don't need start code. As a
> > > > > > > side effect, the bitstream becomes identical. We do now know that it
> > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > 
> > > > > > Oh great, that makes life easier in the short term, but I guess the
> > > > > > issue could arise on another decoder sooner or later.
> > > > > > 
> > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > > > > > take care for us of reading and executing the modification lists found
> > > > > > > in the slice header. Mostly because I very disliked having to pass the
> > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > > > > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > > > > > following section is the execution of the modification list. As this
> > > > > > > list is not modified, it only need to be calculated per frame. As a
> > > > > > > result, we don't need these new lists, and we can work with the same
> > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > 
> > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > modifications rather than reconstructing those in the driver from a
> > > > > > full list. IMO controls should stick to the bitstream as close as
> > > > > > possible.
> > > > > 
> > > > > For Hantro and RKVDEC, the list of modification is parsed by the IP
> > > > > from the slice header bits. Just to make sure, because I myself was
> > > > > confused on this before, the slice header does not contain a list of
> > > > > references, instead it contains a list modification to be applied to
> > > > > the reference list. I need to check again, but to execute these
> > > > > modification, you need to filter and sort the references in a specific
> > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > 
> > > > > The list of references is deduced from the DPB. The DPB, which I thinks
> > > > > should be rename as "references", seems more useful then p/b0/b1, since
> > > > > this is the data that gives use the ability to implementing glue in the
> > > > > driver to compensate some HW differences.
> > > > > 
> > > > > In the case of Hantro / RKVDEC, we think it's natural to build the HW
> > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > specific list in the decode_params structure. The fact these lists are
> > > > > standard intermediate step of the standard is not that important.
> > > > 
> > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > instead of dpb.
> > > > 
> > > > > > > Now, this is just a start. For RK3399, we have a different CODEC
> > > > > > > design. This one does not have the start_code_e bit. What the IP does,
> > > > > > > is that you give it one or more slice per buffer, setup the params,
> > > > > > > start decoding, but the decoder then return the location of the
> > > > > > > following NAL. So basically you could offload the scanning of start
> > > > > > > code to the HW. That being said, with the driver layer in between, that
> > > > > > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > > > > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > > > > > feature that this allows is to operate in frame mode. In this mode, you
> > > > > > > have 1 interrupt per frame.
> > > > > > 
> > > > > > I'm not sure there is any interest in exposing that from userspace and
> > > > > > my current feeling is that we should just ditch support for per-frame
> > > > > > decoding altogether. I think it mixes decoding with notions that are
> > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > 
> > > > > I'm not worried about this either. We can already support that by
> > > > > copying the bitstream internally to the driver, though zero-copy with
> > > > > this would require a new format, the one we talked about,
> > > > > SLICE_ANNEX_B.
> > > > 
> > > > Right, but what I'm thinking about is making that the one and only
> > > > format. The rationale is that it's always easier to just append a start
> > > > code from userspace if needed. And we need a bit offset to the slice
> > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > have the whole thing that will work in every situation.
> > > 
> > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > (.mp4, .mov) will require a full bitstream copy in userspace to add the
> > > start code (unless you hack your allocation in your demuxer, but it's a
> > > bit complicated since this code might come from two libraries). In
> > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > follows.
> > 
> > Well, I think we have to do a copy from system memory to the buffer
> > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > expected not to have any MMU unit and not allow sg import anyway.
> 
> The Rockchip has an mmu. You need one copy at least indeed, 

Is the MMU in use currently? That can make things troublesome if we run
into a case where the VPU has MMU and deals with scatter-gather while
the display part doesn't. As far as I know, there's no way for
userspace to know whether a dma-buf-exported buffer is backed by CMA or
by scatter-gather memory. This feels like a major issue for using dma-
buf, since userspace can't predict whether a buffer exported on one
device can be imported on another when building its pipeline.

> e.g. file
> to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
> ts, first you need to copy the MPEG TS to mem, then to demux you copy
> that H264 stream to another buffer, you then copy in the parser,
> removing the start-code and finally copy in the accelerator, adding the
> start code. If the driver would allow userptr, it would be unusable.
> 
> GStreamer on the other side implement lazy conversion, so it would copy
> the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
> parser (but stream format is negotiation, so it keeps the start-code).
> If you request alignment=au, you have full frame of buffers, so if your
> driver could do userptr, you can same that extra copy.
> 
> Now, if we demux an MP4 it's the same, the parser will need do a full
> copy instead of lazy aggregation in order to prepend the start code
> (since it had an AVC header). But userptr could save a copy.
> 
> If the driver requires no nal prefix, then we could just pass a
> slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
> which is a bit slower (even know it's nothing compare to the full
> copies we already do.
> 
> That was my argument in favour for no NAL prefix in term of efficiency,
> and it does not prevent adding a control to enable start-code for cases
> it make sense.

I see, so the internal arcitecture of userspace software may not be a
good fit for adding these bits and it could hurt performance a bit.
That feels like a significant downside.

> > So with that in mind, asking userspace to add a startcode it already
> > knows doesn't seem to be asking too much.
> > 
> > > On the other end, the data_offset thing is likely just a thing for the
> > > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> > 
> > Well, I think it's best to be fool-proof here and just require that
> > start code. We should also have per-slice bit offsets to the different
> > parts anyway, so drivers that don't need it can just ignore it.
> > 
> > In extreme cases where there is some interest in doing direct buffer
> > import without doing a copy in userspace, userspace could trick the
> > format and avoid a copy by not providing the start-code (assuming it
> > knows it doesn't need it) and specifying the bit offsets accordingly.
> > That'd be a hack for better performance, and it feels better to do
> > things in this order rather than having to hack around in the drivers
> > that need the start code in every other case.
> 
> So basically, you and Tomas are both strongly in favour of adding
> ANNEX-B start-code to the current uAPI. I have digged into Cedrus
> registers, and it seems that it does have start-code scanning support.
> I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
> like the RK do. That requires the IP to deal with the modifications
> lists, which are per slices.

Actually the bitstream parser won't reconfigure the pipeline
configuration registers, it's only around for userspace to avoid
implementing bitstream parsing, but it's a standalone thing.

So if we want to do full-frame decoding we always need to reconfigure
our pipeline (or do it like we do currently and just use one of the
per-slice configuration and hope for the best).

Do we have more information on the RK3399 and what it requires exactly?
(Just to make sure it's not another issue altogether.)

> My question is, are you willing to adapt the Cedrus driver to support
> receiving start-code ? And will this have a performance impact or not ?
> On RK side, it's really just about flipping 1 bit.
> 
> On the Rockchip side, Tomas had concern about CPU wakeup and the fact
> that we didn't aim at supporting passing multiple slices at once to the
> IP (something RK supports). It's important to understand that multi-
> slice streams are relatively rare and mostly used for low-latency /
> video conferencing. So aggregating in these case defeats the purpose of
> using slices. So I think RK feature is not very important.

Agreed, let's aim for low-latency as a standard.

> Of course, I do believe that long term we will want to expose bot
> stream formats on RK (because the HW can do that), so then userspace
> can just pick the best when available. So that boils down to our first
> idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
> Now that we have progressed on the matter, I'm quite in favour of
> having _SLICE in the first place, with the preferred format that
> everyone should support, and allow for variants later. Now, if we make
> one mandatory, we could also just have a menu control to allow other
> formats.

That seems fairly reasonable to me, and indeed, having one preferred
format at first seems to be a good move.

> > > > To me the breaking point was about having the slice header both in raw
> > > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > > well push it to its logical conclusion and include all the bits that
> > > > can be useful.
> > > 
> > > To take your words, the bits that contain useful information starts
> > > from the NAL type byte, exactly were the data was cut by VA-API and the
> > > current uAPI.
> > 
> > Agreed, but I think that the advantages of always requiring the start
> > code outweigh the potential (yet quite unlikely) downsides.
> > 
> > > > > > > But it also support slice mode, with an
> > > > > > > interrupt per slice, which is what we decided to use.
> > > > > > 
> > > > > > Easier for everyone and probably better for latency as well :)
> > > > > > 
> > > > > > > So in this case, indeed we strictly require on start-code. Though, to
> > > > > > > me this is not a great reason to make a new fourcc, so we will try and
> > > > > > > use (data_offset = 3) in order to make some space for that start code,
> > > > > > > and write it down in the driver. This is to be continued, we will
> > > > > > > report back on this later. This could have some side effect in the
> > > > > > > ability to import buffers. But most userspace don't try to do zero-copy 
> > > > > > > on the encoded size and just copy anyway.
> > > > > > > 
> > > > > > > To my opinion, having a single format is a big deal, since userspace
> > > > > > > will generally be developed for one specific HW and we would endup with
> > > > > > > fragmented support. What we really want to achieve is having a driver
> > > > > > > interface which works across multiple HW, and I think this is quite
> > > > > > > possible.
> > > > > > 
> > > > > > I agree with that. The more I think about it, the more I believe we
> > > > > > should just pass the whole [nal_header][nal_type][slice_header][slice]
> > > > > > and the parsed list in every scenario.
> > > > > 
> > > > > What I like of the cut at nal_type, is that there is only format. If we
> > > > > cut at nal_header, then we need to expose 2 formats. And it makes our
> > > > > API similar to other accelerator API, so it's easy to "convert"
> > > > > existing userspace.
> > > > 
> > > > Unless we make that cut the single one and only true cut that shall
> > > > supersed all other cuts :)
> > > 
> > > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > > thing and go back to our first idea.
> > 
> > Right, in the end I think we should go with:
> > V4L2_PIX_FMT_MPEG2_SLICE
> > V4L2_PIX_FMT_H264_SLICE
> > V4L2_PIX_FMT_HEVC_SLICE
> > 
> > And just require raw bitstream for the slice with emulation-prevention
> > bits included.
> 
> That's should be the set of format we start with indeed. The single
> format for which software gets written and tested, making sure software
> support is not fragmented, and other variants should be something to
> opt-in.

Cheers for that!

Paul

> > Cheers,
> > 
> > Paul
> > 
> > > > > > For H.265, our decoder needs some information from the NAL type too.
> > > > > > We currently extract that in userspace and stick it to the
> > > > > > slice_header, but maybe it would make more sense to have drivers parse
> > > > > > that info from the buffer if they need it. On the other hand, it seems
> > > > > > quite common to pass information from the NAL type, so maybe we should
> > > > > > either make a new control for it or have all the fields in the
> > > > > > slice_header (which would still be wrong in terms of matching bitstream
> > > > > > description).
> > > > > 
> > > > > Even in userspace, it's common to just parse this in place, it's a
> > > > > simple mask. But yes, if we don't have it yet, we should expose the NAL
> > > > > type, it would be cleaner.
> > > > 
> > > > Right, works for me.
> > > 
> > > Ack.
> > > 
> > > > Cheers,
> > > > 
> > > > Paul
> > > > 
> > > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > > 
> > > > > > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > > > > > concept that only makes sense relative to a decoder implementation. The
> > > > > > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > > > > > (Annex C), but that's about it.
> > > > > > > > 
> > > > > > > > What's really in the bitstream is the list of modified short-term and
> > > > > > > > long-term references, which is enough for every decoder.
> > > > > > > > 
> > > > > > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > > > > > the controls and just pass these lists agremented with relevant
> > > > > > > > information for userspace.
> > > > > > > > 
> > > > > > > > I think it should be up to the driver to maintain a DPB and we could
> > > > > > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > > > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > > > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > > > > > keeping in particular and can manage with the lists from the bitstream
> > > > > > > > directly.
> > > > > > > 
> > > > > > > As discusses today, we still need to pass that list. It's being index
> > > > > > > by the HW to retrieve the extra information we have collected about the
> > > > > > > status of the reference frames. In the case of Hantro, which process
> > > > > > > the modification list from the slice header for us, we also need that
> > > > > > > list to construct the unmodified list.
> > > > > > > 
> > > > > > > So the problem here is just a naming problem. That list is not really a
> > > > > > > DPB. It is just the list of long-term/short-term references with the
> > > > > > > status of these references. So maybe we could just rename as
> > > > > > > references/reference_entry ?
> > > > > > 
> > > > > > What I'd like to pass is the diff to the references list, as ffmpeg
> > > > > > currently provides for v4l2 request and vaapi (probably vdpau too). No
> > > > > > functional change here, only that we should stop calling it a DPB,
> > > > > > which confuses everyone.
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > > > - Using flags
> > > > > > > > 
> > > > > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > > > > 
> > > > > > > > I think H.264/HEVC per-control flags should also be moved to u64.
> > > > > > > 
> > > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> > > > > > 
> > > > > > Mhh, even if they are, it makes it much harder to verify 32/64 bit
> > > > > > alignment constraints (we're dealing with 64-bit platforms that need to
> > > > > > have 32-bit userspace and compat_ioctl).
> > > > > 
> > > > > I see, thanks.
> > > > > 
> > > > > > > > - Clear split of controls and terminology
> > > > > > > > 
> > > > > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > > > > bitstream element names for those.
> > > > > > > > 
> > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > > > > 
> > > > > > > Oops, I think you meant slice_prams ? decode_params matches the
> > > > > > > information found in SPS/PPS (combined?), while slice_params matches
> > > > > > > the information extracted (and executed in case of l0/l1) from the
> > > > > > > slice headers.
> > > > > > 
> > > > > > Yes you're right, I mixed them up.
> > > > > > 
> > > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > > it's not just the slice header, and it's per slice.
> > > > > > 
> > > > > > Mhh, what exactly remains in there and where does it originate in the
> > > > > > bitstream? Maybe it wouldn't be too bad to have one control per actual
> > > > > > group of bitstream elements.
> > > > > > 
> > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > > > > where they belong, which seems to be slice_header;
> > > > > > > 
> > > > > > > There reference list is only updated by userspace (through it's DPB)
> > > > > > > base on the result of the last decoding step. I was very confused for a
> > > > > > > moment until I realize that the lists in the slice_header are just a
> > > > > > > list of modification to apply to the reference list in order to produce
> > > > > > > l0 and l1.
> > > > > > 
> > > > > > Indeed, and I'm suggesting that we pass the modifications only, which
> > > > > > would fit a slice_header control.
> > > > > 
> > > > > I think I made my point why we want the dpb -> references. I'm going to
> > > > > validate with the VA driver now, to see if the references list there is
> > > > > usable with our code.
> > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Paul
> > > > > > 
> > > > > > > > I'm up for preparing and submitting these control changes and updating
> > > > > > > > cedrus if they seem agreeable.
> > > > > > > > 
> > > > > > > > What do you think?
> > > > > > > > 
> > > > > > > > Cheers,
> > > > > > > > 
> > > > > > > > Paul
> > > > > > > > 
> > > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > > > > > > 
-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-18  9:50               ` Paul Kocialkowski
@ 2019-05-18 10:04                 ` Jernej Škrabec
  2019-05-18 10:29                   ` Paul Kocialkowski
  0 siblings, 1 reply; 55+ messages in thread
From: Jernej Škrabec @ 2019-05-18 10:04 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Nicolas Dufresne, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Thierry Reding, Ezequiel Garcia, Jonas Karlman

Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> Hi,
> 
> On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> > Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit 
:
> > > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a 
écrit :
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > With the Rockchip stateless VPU driver in the works, we now
> > > > > > > > > have a
> > > > > > > > > better idea of what the situation is like on platforms other
> > > > > > > > > than
> > > > > > > > > Allwinner. This email shares my conclusions about the
> > > > > > > > > situation and how
> > > > > > > > > we should update the MPEG-2, H.264 and H.265 controls
> > > > > > > > > accordingly.
> > > > > > > > > 
> > > > > > > > > - Per-slice decoding
> > > > > > > > > 
> > > > > > > > > We've discussed this one already[0] and Hans has submitted a
> > > > > > > > > patch[1]
> > > > > > > > > to implement the required core bits. When we agree it looks
> > > > > > > > > good, we
> > > > > > > > > should lift the restriction that all slices must be
> > > > > > > > > concatenated and
> > > > > > > > > have them submitted as individual requests.
> > > > > > > > > 
> > > > > > > > > One question is what to do about other controls. I feel like
> > > > > > > > > it would
> > > > > > > > > make sense to always pass all the required controls for
> > > > > > > > > decoding the
> > > > > > > > > slice, including the ones that don't change across slices.
> > > > > > > > > But there
> > > > > > > > > may be no particular advantage to this and only downsides.
> > > > > > > > > Not doing it
> > > > > > > > > and relying on the "control cache" can work, but we need to
> > > > > > > > > specify
> > > > > > > > > that only a single stream can be decoded per opened instance
> > > > > > > > > of the
> > > > > > > > > v4l2 device. This is the assumption we're going with for
> > > > > > > > > handling
> > > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > > 
> > > > > > > > My opinion on this is that the m2m instance is a state, and
> > > > > > > > the driver
> > > > > > > > should be responsible of doing time-division multiplexing
> > > > > > > > across
> > > > > > > > multiple m2m instance jobs. Doing the time-division
> > > > > > > > multiplexing in
> > > > > > > > userspace would require some sort of daemon to work properly
> > > > > > > > across
> > > > > > > > processes. I also think the kernel is better place for doing
> > > > > > > > resource
> > > > > > > > access scheduling in general.
> > > > > > > 
> > > > > > > I agree with that yes. We always have a single m2m context and
> > > > > > > specific
> > > > > > > controls per opened device so keeping cached values works out
> > > > > > > well.
> > > > > > > 
> > > > > > > So maybe we shall explicitly require that the request with the
> > > > > > > first
> > > > > > > slice for a frame also contains the per-frame controls.
> > > > > > > 
> > > > > > > > > - Annex-B formats
> > > > > > > > > 
> > > > > > > > > I don't think we have really reached a conclusion on the
> > > > > > > > > pixel formats
> > > > > > > > > we want to expose. The main issue is how to deal with codecs
> > > > > > > > > that need
> > > > > > > > > the full slice NALU with start code, where the slice_header
> > > > > > > > > is
> > > > > > > > > duplicated in raw bitstream, when others are fine with just
> > > > > > > > > the encoded
> > > > > > > > > slice data and the parsed slice header control.
> > > > > > > > > 
> > > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > > - One that only takes only the slice compressed data
> > > > > > > > > (without raw slice
> > > > > > > > > header and start code);
> > > > > > > > > - One that takes both the NALU data (including start code,
> > > > > > > > > raw header
> > > > > > > > > and compressed data) and slice header controls;
> > > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > > 
> > > > > > > > > But I no longer think the latter really makes sense in the
> > > > > > > > > context of
> > > > > > > > > stateless video decoding.
> > > > > > > > > 
> > > > > > > > > A side-note: I think we should definitely have data offsets
> > > > > > > > > in every
> > > > > > > > > case, so that implementations can just push the whole NALU
> > > > > > > > > regardless
> > > > > > > > > of the format if they're lazy.
> > > > > > > > 
> > > > > > > > I realize that I didn't share our latest research on the
> > > > > > > > subject. So a
> > > > > > > > slice in the original bitstream is formed of the following
> > > > > > > > blocks
> > > > > > > > 
> > > > > > > > (simplified):
> > > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > > 
> > > > > > > Thanks for the details!
> > > > > > > 
> > > > > > > > nal_header:
> > > > > > > > This one is a header used to locate the start and the end of
> > > > > > > > the of a
> > > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a
> > > > > > > > sequence
> > > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> > > > > > > > byte
> > > > > > > > would be a leading 0 from the previous NAL padding, but this
> > > > > > > > is also
> > > > > > > > totally valid start code. The second form is the AVC form,
> > > > > > > > notably used
> > > > > > > > in ISOMP4 container. It simply is the size of the NAL. You
> > > > > > > > must keep
> > > > > > > > your buffer aligned to NALs in this case as you cannot scan
> > > > > > > > from random
> > > > > > > > location.
> > > > > > > > 
> > > > > > > > nal_type:
> > > > > > > > It's a bit more then just the type, but it contains at least
> > > > > > > > the
> > > > > > > > information of the nal type. This has different size on H.264
> > > > > > > > and HEVC
> > > > > > > > but I know it's size is in bytes.
> > > > > > > > 
> > > > > > > > slice_header:
> > > > > > > > This contains per slice parameters, like the modification
> > > > > > > > lists to
> > > > > > > > apply on the references. This one has a size in bits, not in
> > > > > > > > bytes.
> > > > > > > > 
> > > > > > > > slice:
> > > > > > > > I don't really know what is in it exactly, but this is the
> > > > > > > > data used to
> > > > > > > > decode. This bit has a special coding called the
> > > > > > > > anti-emulation, which
> > > > > > > > prevents a start-code from appearing in it. This coding is
> > > > > > > > present in
> > > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference
> > > > > > > > manual they
> > > > > > > > call ANNEX-B the bytestream format).
> > > > > > > > 
> > > > > > > > So, what we notice is that what is currently passed through
> > > > > > > > Cedrus
> > > > > > > > 
> > > > > > > > driver:
> > > > > > > >   [nal_type][slice_header][slice]
> > > > > > > > 
> > > > > > > > This matches what is being passed through VA-API. We can
> > > > > > > > understand
> > > > > > > > that stripping off the slice_header would be hard, since it's
> > > > > > > > size is
> > > > > > > > in bits. Instead we pass size and header_bit_size in
> > > > > > > > slice_params.
> > > > > > > 
> > > > > > > True, there is that.
> > > > > > > 
> > > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > > start_code_e, when you turn this off, you don't need start
> > > > > > > > code. As a
> > > > > > > > side effect, the bitstream becomes identical. We do now know
> > > > > > > > that it
> > > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > > 
> > > > > > > Oh great, that makes life easier in the short term, but I guess
> > > > > > > the
> > > > > > > issue could arise on another decoder sooner or later.
> > > > > > > 
> > > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is
> > > > > > > > that it
> > > > > > > > take care for us of reading and executing the modification
> > > > > > > > lists found
> > > > > > > > in the slice header. Mostly because I very disliked having to
> > > > > > > > pass the
> > > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver
> > > > > > > > the
> > > > > > > > transformation from the DPB entries into this p/b0/b1 list.
> > > > > > > > These list
> > > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> > > > > > > > the
> > > > > > > > following section is the execution of the modification list.
> > > > > > > > As this
> > > > > > > > list is not modified, it only need to be calculated per frame.
> > > > > > > > As a
> > > > > > > > result, we don't need these new lists, and we can work with
> > > > > > > > the same
> > > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > > 
> > > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > > modifications rather than reconstructing those in the driver
> > > > > > > from a
> > > > > > > full list. IMO controls should stick to the bitstream as close
> > > > > > > as
> > > > > > > possible.
> > > > > > 
> > > > > > For Hantro and RKVDEC, the list of modification is parsed by the
> > > > > > IP
> > > > > > from the slice header bits. Just to make sure, because I myself
> > > > > > was
> > > > > > confused on this before, the slice header does not contain a list
> > > > > > of
> > > > > > references, instead it contains a list modification to be applied
> > > > > > to
> > > > > > the reference list. I need to check again, but to execute these
> > > > > > modification, you need to filter and sort the references in a
> > > > > > specific
> > > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > > 
> > > > > > The list of references is deduced from the DPB. The DPB, which I
> > > > > > thinks
> > > > > > should be rename as "references", seems more useful then p/b0/b1,
> > > > > > since
> > > > > > this is the data that gives use the ability to implementing glue
> > > > > > in the
> > > > > > driver to compensate some HW differences.
> > > > > > 
> > > > > > In the case of Hantro / RKVDEC, we think it's natural to build the
> > > > > > HW
> > > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > > specific list in the decode_params structure. The fact these lists
> > > > > > are
> > > > > > standard intermediate step of the standard is not that important.
> > > > > 
> > > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > > instead of dpb.
> > > > > 
> > > > > > > > Now, this is just a start. For RK3399, we have a different
> > > > > > > > CODEC
> > > > > > > > design. This one does not have the start_code_e bit. What the
> > > > > > > > IP does,
> > > > > > > > is that you give it one or more slice per buffer, setup the
> > > > > > > > params,
> > > > > > > > start decoding, but the decoder then return the location of
> > > > > > > > the
> > > > > > > > following NAL. So basically you could offload the scanning of
> > > > > > > > start
> > > > > > > > code to the HW. That being said, with the driver layer in
> > > > > > > > between, that
> > > > > > > > would be amazingly inconvenient to use, and with Boyer-more
> > > > > > > > algorithm,
> > > > > > > > it is pretty cheap to scan this type of start-code on CPU. But
> > > > > > > > the
> > > > > > > > feature that this allows is to operate in frame mode. In this
> > > > > > > > mode, you
> > > > > > > > have 1 interrupt per frame.
> > > > > > > 
> > > > > > > I'm not sure there is any interest in exposing that from
> > > > > > > userspace and
> > > > > > > my current feeling is that we should just ditch support for
> > > > > > > per-frame
> > > > > > > decoding altogether. I think it mixes decoding with notions that
> > > > > > > are
> > > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > > 
> > > > > > I'm not worried about this either. We can already support that by
> > > > > > copying the bitstream internally to the driver, though zero-copy
> > > > > > with
> > > > > > this would require a new format, the one we talked about,
> > > > > > SLICE_ANNEX_B.
> > > > > 
> > > > > Right, but what I'm thinking about is making that the one and only
> > > > > format. The rationale is that it's always easier to just append a
> > > > > start
> > > > > code from userspace if needed. And we need a bit offset to the slice
> > > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > > have the whole thing that will work in every situation.
> > > > 
> > > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > > (.mp4, .mov) will require a full bitstream copy in userspace to add
> > > > the
> > > > start code (unless you hack your allocation in your demuxer, but it's
> > > > a
> > > > bit complicated since this code might come from two libraries). In
> > > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > > follows.
> > > 
> > > Well, I think we have to do a copy from system memory to the buffer
> > > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > > expected not to have any MMU unit and not allow sg import anyway.
> > 
> > The Rockchip has an mmu. You need one copy at least indeed,
> 
> Is the MMU in use currently? That can make things troublesome if we run
> into a case where the VPU has MMU and deals with scatter-gather while
> the display part doesn't. As far as I know, there's no way for
> userspace to know whether a dma-buf-exported buffer is backed by CMA or
> by scatter-gather memory. This feels like a major issue for using dma-
> buf, since userspace can't predict whether a buffer exported on one
> device can be imported on another when building its pipeline.

FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver 
for it yet. It is supported for display, both VPUs and some other devices. I 
think no sane SoC designer would left out one or another unit without IOMMU 
support, that just calls for troubles, as you pointed out.

Best regards,
Jernej

> 
> > e.g. file
> > to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
> > ts, first you need to copy the MPEG TS to mem, then to demux you copy
> > that H264 stream to another buffer, you then copy in the parser,
> > removing the start-code and finally copy in the accelerator, adding the
> > start code. If the driver would allow userptr, it would be unusable.
> > 
> > GStreamer on the other side implement lazy conversion, so it would copy
> > the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
> > parser (but stream format is negotiation, so it keeps the start-code).
> > If you request alignment=au, you have full frame of buffers, so if your
> > driver could do userptr, you can same that extra copy.
> > 
> > Now, if we demux an MP4 it's the same, the parser will need do a full
> > copy instead of lazy aggregation in order to prepend the start code
> > (since it had an AVC header). But userptr could save a copy.
> > 
> > If the driver requires no nal prefix, then we could just pass a
> > slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
> > which is a bit slower (even know it's nothing compare to the full
> > copies we already do.
> > 
> > That was my argument in favour for no NAL prefix in term of efficiency,
> > and it does not prevent adding a control to enable start-code for cases
> > it make sense.
> 
> I see, so the internal arcitecture of userspace software may not be a
> good fit for adding these bits and it could hurt performance a bit.
> That feels like a significant downside.
> 
> > > So with that in mind, asking userspace to add a startcode it already
> > > knows doesn't seem to be asking too much.
> > > 
> > > > On the other end, the data_offset thing is likely just a thing for the
> > > > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> > > 
> > > Well, I think it's best to be fool-proof here and just require that
> > > start code. We should also have per-slice bit offsets to the different
> > > parts anyway, so drivers that don't need it can just ignore it.
> > > 
> > > In extreme cases where there is some interest in doing direct buffer
> > > import without doing a copy in userspace, userspace could trick the
> > > format and avoid a copy by not providing the start-code (assuming it
> > > knows it doesn't need it) and specifying the bit offsets accordingly.
> > > That'd be a hack for better performance, and it feels better to do
> > > things in this order rather than having to hack around in the drivers
> > > that need the start code in every other case.
> > 
> > So basically, you and Tomas are both strongly in favour of adding
> > ANNEX-B start-code to the current uAPI. I have digged into Cedrus
> > registers, and it seems that it does have start-code scanning support.
> > I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
> > like the RK do. That requires the IP to deal with the modifications
> > lists, which are per slices.
> 
> Actually the bitstream parser won't reconfigure the pipeline
> configuration registers, it's only around for userspace to avoid
> implementing bitstream parsing, but it's a standalone thing.
> 
> So if we want to do full-frame decoding we always need to reconfigure
> our pipeline (or do it like we do currently and just use one of the
> per-slice configuration and hope for the best).
> 
> Do we have more information on the RK3399 and what it requires exactly?
> (Just to make sure it's not another issue altogether.)
> 
> > My question is, are you willing to adapt the Cedrus driver to support
> > receiving start-code ? And will this have a performance impact or not ?
> > On RK side, it's really just about flipping 1 bit.
> > 
> > On the Rockchip side, Tomas had concern about CPU wakeup and the fact
> > that we didn't aim at supporting passing multiple slices at once to the
> > IP (something RK supports). It's important to understand that multi-
> > slice streams are relatively rare and mostly used for low-latency /
> > video conferencing. So aggregating in these case defeats the purpose of
> > using slices. So I think RK feature is not very important.
> 
> Agreed, let's aim for low-latency as a standard.
> 
> > Of course, I do believe that long term we will want to expose bot
> > stream formats on RK (because the HW can do that), so then userspace
> > can just pick the best when available. So that boils down to our first
> > idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
> > Now that we have progressed on the matter, I'm quite in favour of
> > having _SLICE in the first place, with the preferred format that
> > everyone should support, and allow for variants later. Now, if we make
> > one mandatory, we could also just have a menu control to allow other
> > formats.
> 
> That seems fairly reasonable to me, and indeed, having one preferred
> format at first seems to be a good move.
> 
> > > > > To me the breaking point was about having the slice header both in
> > > > > raw
> > > > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > > > well push it to its logical conclusion and include all the bits that
> > > > > can be useful.
> > > > 
> > > > To take your words, the bits that contain useful information starts
> > > > from the NAL type byte, exactly were the data was cut by VA-API and
> > > > the
> > > > current uAPI.
> > > 
> > > Agreed, but I think that the advantages of always requiring the start
> > > code outweigh the potential (yet quite unlikely) downsides.
> > > 
> > > > > > > > But it also support slice mode, with an
> > > > > > > > interrupt per slice, which is what we decided to use.
> > > > > > > 
> > > > > > > Easier for everyone and probably better for latency as well :)
> > > > > > > 
> > > > > > > > So in this case, indeed we strictly require on start-code.
> > > > > > > > Though, to
> > > > > > > > me this is not a great reason to make a new fourcc, so we will
> > > > > > > > try and
> > > > > > > > use (data_offset = 3) in order to make some space for that
> > > > > > > > start code,
> > > > > > > > and write it down in the driver. This is to be continued, we
> > > > > > > > will
> > > > > > > > report back on this later. This could have some side effect in
> > > > > > > > the
> > > > > > > > ability to import buffers. But most userspace don't try to do
> > > > > > > > zero-copy
> > > > > > > > on the encoded size and just copy anyway.
> > > > > > > > 
> > > > > > > > To my opinion, having a single format is a big deal, since
> > > > > > > > userspace
> > > > > > > > will generally be developed for one specific HW and we would
> > > > > > > > endup with
> > > > > > > > fragmented support. What we really want to achieve is having a
> > > > > > > > driver
> > > > > > > > interface which works across multiple HW, and I think this is
> > > > > > > > quite
> > > > > > > > possible.
> > > > > > > 
> > > > > > > I agree with that. The more I think about it, the more I believe
> > > > > > > we
> > > > > > > should just pass the whole
> > > > > > > [nal_header][nal_type][slice_header][slice]
> > > > > > > and the parsed list in every scenario.
> > > > > > 
> > > > > > What I like of the cut at nal_type, is that there is only format.
> > > > > > If we
> > > > > > cut at nal_header, then we need to expose 2 formats. And it makes
> > > > > > our
> > > > > > API similar to other accelerator API, so it's easy to "convert"
> > > > > > existing userspace.
> > > > > 
> > > > > Unless we make that cut the single one and only true cut that shall
> > > > > supersed all other cuts :)
> > > > 
> > > > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > > > thing and go back to our first idea.
> > > 
> > > Right, in the end I think we should go with:
> > > V4L2_PIX_FMT_MPEG2_SLICE
> > > V4L2_PIX_FMT_H264_SLICE
> > > V4L2_PIX_FMT_HEVC_SLICE
> > > 
> > > And just require raw bitstream for the slice with emulation-prevention
> > > bits included.
> > 
> > That's should be the set of format we start with indeed. The single
> > format for which software gets written and tested, making sure software
> > support is not fragmented, and other variants should be something to
> > opt-in.
> 
> Cheers for that!
> 
> Paul
> 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > > > > > For H.265, our decoder needs some information from the NAL type
> > > > > > > too.
> > > > > > > We currently extract that in userspace and stick it to the
> > > > > > > slice_header, but maybe it would make more sense to have drivers
> > > > > > > parse
> > > > > > > that info from the buffer if they need it. On the other hand, it
> > > > > > > seems
> > > > > > > quite common to pass information from the NAL type, so maybe we
> > > > > > > should
> > > > > > > either make a new control for it or have all the fields in the
> > > > > > > slice_header (which would still be wrong in terms of matching
> > > > > > > bitstream
> > > > > > > description).
> > > > > > 
> > > > > > Even in userspace, it's common to just parse this in place, it's a
> > > > > > simple mask. But yes, if we don't have it yet, we should expose
> > > > > > the NAL
> > > > > > type, it would be cleaner.
> > > > > 
> > > > > Right, works for me.
> > > > 
> > > > Ack.
> > > > 
> > > > > Cheers,
> > > > > 
> > > > > Paul
> > > > > 
> > > > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > > > 
> > > > > > > > > As far as I could understand, the decoded picture buffer
> > > > > > > > > (DPB) is a
> > > > > > > > > concept that only makes sense relative to a decoder
> > > > > > > > > implementation. The
> > > > > > > > > spec mentions how to manage it with the Hypothetical
> > > > > > > > > reference decoder
> > > > > > > > > (Annex C), but that's about it.
> > > > > > > > > 
> > > > > > > > > What's really in the bitstream is the list of modified
> > > > > > > > > short-term and
> > > > > > > > > long-term references, which is enough for every decoder.
> > > > > > > > > 
> > > > > > > > > For this reason, I strongly believe we should stop talking
> > > > > > > > > about DPB in
> > > > > > > > > the controls and just pass these lists agremented with
> > > > > > > > > relevant
> > > > > > > > > information for userspace.
> > > > > > > > > 
> > > > > > > > > I think it should be up to the driver to maintain a DPB and
> > > > > > > > > we could
> > > > > > > > > have helpers for common cases. For instance, the rockchip
> > > > > > > > > decoder needs
> > > > > > > > > to keep unused entries around[2] and cedrus has the same
> > > > > > > > > requirement
> > > > > > > > > for H.264. However for cedrus/H.265, we don't need to do any
> > > > > > > > > book-
> > > > > > > > > keeping in particular and can manage with the lists from the
> > > > > > > > > bitstream
> > > > > > > > > directly.
> > > > > > > > 
> > > > > > > > As discusses today, we still need to pass that list. It's
> > > > > > > > being index
> > > > > > > > by the HW to retrieve the extra information we have collected
> > > > > > > > about the
> > > > > > > > status of the reference frames. In the case of Hantro, which
> > > > > > > > process
> > > > > > > > the modification list from the slice header for us, we also
> > > > > > > > need that
> > > > > > > > list to construct the unmodified list.
> > > > > > > > 
> > > > > > > > So the problem here is just a naming problem. That list is not
> > > > > > > > really a
> > > > > > > > DPB. It is just the list of long-term/short-term references
> > > > > > > > with the
> > > > > > > > status of these references. So maybe we could just rename as
> > > > > > > > references/reference_entry ?
> > > > > > > 
> > > > > > > What I'd like to pass is the diff to the references list, as
> > > > > > > ffmpeg
> > > > > > > currently provides for v4l2 request and vaapi (probably vdpau
> > > > > > > too). No
> > > > > > > functional change here, only that we should stop calling it a
> > > > > > > DPB,
> > > > > > > which confuses everyone.
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > > > > - Using flags
> > > > > > > > > 
> > > > > > > > > The current MPEG-2 controls have lots of u8 values that can
> > > > > > > > > be
> > > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > > It's unlikely that we'll get more than 64 flags, so using a
> > > > > > > > > u64 by
> > > > > > > > > default for that sounds fine (we definitely do want to keep
> > > > > > > > > some room
> > > > > > > > > available and I don't think using 32 bits as a default is
> > > > > > > > > good enough).
> > > > > > > > > 
> > > > > > > > > I think H.264/HEVC per-control flags should also be moved to
> > > > > > > > > u64.
> > > > > > > > 
> > > > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI
> > > > > > > > right ?
> > > > > > > 
> > > > > > > Mhh, even if they are, it makes it much harder to verify 32/64
> > > > > > > bit
> > > > > > > alignment constraints (we're dealing with 64-bit platforms that
> > > > > > > need to
> > > > > > > have 32-bit userspace and compat_ioctl).
> > > > > > 
> > > > > > I see, thanks.
> > > > > > 
> > > > > > > > > - Clear split of controls and terminology
> > > > > > > > > 
> > > > > > > > > Some codecs have explicit NAL units that are good fits to
> > > > > > > > > match as
> > > > > > > > > controls: e.g. slice header, pps, sps. I think we should
> > > > > > > > > stick to the
> > > > > > > > > bitstream element names for those.
> > > > > > > > > 
> > > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > > - renaming v4l2_ctrl_h264_decode_param to
> > > > > > > > > v4l2_ctrl_h264_slice_header;
> > > > > > > > 
> > > > > > > > Oops, I think you meant slice_prams ? decode_params matches
> > > > > > > > the
> > > > > > > > information found in SPS/PPS (combined?), while slice_params
> > > > > > > > matches
> > > > > > > > the information extracted (and executed in case of l0/l1) from
> > > > > > > > the
> > > > > > > > slice headers.
> > > > > > > 
> > > > > > > Yes you're right, I mixed them up.
> > > > > > > 
> > > > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > > > 
> > > > > > > > it's not just the slice header, and it's per slice.
> > > > > > > 
> > > > > > > Mhh, what exactly remains in there and where does it originate
> > > > > > > in the
> > > > > > > bitstream? Maybe it wouldn't be too bad to have one control per
> > > > > > > actual
> > > > > > > group of bitstream elements.
> > > > > > > 
> > > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the
> > > > > > > > > reference lists
> > > > > > > > > where they belong, which seems to be slice_header;
> > > > > > > > 
> > > > > > > > There reference list is only updated by userspace (through
> > > > > > > > it's DPB)
> > > > > > > > base on the result of the last decoding step. I was very
> > > > > > > > confused for a
> > > > > > > > moment until I realize that the lists in the slice_header are
> > > > > > > > just a
> > > > > > > > list of modification to apply to the reference list in order
> > > > > > > > to produce
> > > > > > > > l0 and l1.
> > > > > > > 
> > > > > > > Indeed, and I'm suggesting that we pass the modifications only,
> > > > > > > which
> > > > > > > would fit a slice_header control.
> > > > > > 
> > > > > > I think I made my point why we want the dpb -> references. I'm
> > > > > > going to
> > > > > > validate with the VA driver now, to see if the references list
> > > > > > there is
> > > > > > usable with our code.
> > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Paul
> > > > > > > 
> > > > > > > > > I'm up for preparing and submitting these control changes
> > > > > > > > > and updating
> > > > > > > > > cedrus if they seem agreeable.
> > > > > > > > > 
> > > > > > > > > What do you think?
> > > > > > > > > 
> > > > > > > > > Cheers,
> > > > > > > > > 
> > > > > > > > > Paul
> > > > > > > > > 
> > > > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > > > [2]:
> > > > > > > > > https://chromium.googlesource.com/chromiumos/third_party/ke
> > > > > > > > > rnel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378





^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-18 10:04                 ` Jernej Škrabec
@ 2019-05-18 10:29                   ` Paul Kocialkowski
  2019-05-18 14:09                     ` Nicolas Dufresne
  0 siblings, 1 reply; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-18 10:29 UTC (permalink / raw)
  To: Jernej Škrabec
  Cc: Nicolas Dufresne, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Thierry Reding, Ezequiel Garcia, Jonas Karlman

Hi,

Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> > Hi,
> > 
> > On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> > > Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit 
> :
> > > > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a 
> écrit :
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > With the Rockchip stateless VPU driver in the works, we now
> > > > > > > > > > have a
> > > > > > > > > > better idea of what the situation is like on platforms other
> > > > > > > > > > than
> > > > > > > > > > Allwinner. This email shares my conclusions about the
> > > > > > > > > > situation and how
> > > > > > > > > > we should update the MPEG-2, H.264 and H.265 controls
> > > > > > > > > > accordingly.
> > > > > > > > > > 
> > > > > > > > > > - Per-slice decoding
> > > > > > > > > > 
> > > > > > > > > > We've discussed this one already[0] and Hans has submitted a
> > > > > > > > > > patch[1]
> > > > > > > > > > to implement the required core bits. When we agree it looks
> > > > > > > > > > good, we
> > > > > > > > > > should lift the restriction that all slices must be
> > > > > > > > > > concatenated and
> > > > > > > > > > have them submitted as individual requests.
> > > > > > > > > > 
> > > > > > > > > > One question is what to do about other controls. I feel like
> > > > > > > > > > it would
> > > > > > > > > > make sense to always pass all the required controls for
> > > > > > > > > > decoding the
> > > > > > > > > > slice, including the ones that don't change across slices.
> > > > > > > > > > But there
> > > > > > > > > > may be no particular advantage to this and only downsides.
> > > > > > > > > > Not doing it
> > > > > > > > > > and relying on the "control cache" can work, but we need to
> > > > > > > > > > specify
> > > > > > > > > > that only a single stream can be decoded per opened instance
> > > > > > > > > > of the
> > > > > > > > > > v4l2 device. This is the assumption we're going with for
> > > > > > > > > > handling
> > > > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > > > 
> > > > > > > > > My opinion on this is that the m2m instance is a state, and
> > > > > > > > > the driver
> > > > > > > > > should be responsible of doing time-division multiplexing
> > > > > > > > > across
> > > > > > > > > multiple m2m instance jobs. Doing the time-division
> > > > > > > > > multiplexing in
> > > > > > > > > userspace would require some sort of daemon to work properly
> > > > > > > > > across
> > > > > > > > > processes. I also think the kernel is better place for doing
> > > > > > > > > resource
> > > > > > > > > access scheduling in general.
> > > > > > > > 
> > > > > > > > I agree with that yes. We always have a single m2m context and
> > > > > > > > specific
> > > > > > > > controls per opened device so keeping cached values works out
> > > > > > > > well.
> > > > > > > > 
> > > > > > > > So maybe we shall explicitly require that the request with the
> > > > > > > > first
> > > > > > > > slice for a frame also contains the per-frame controls.
> > > > > > > > 
> > > > > > > > > > - Annex-B formats
> > > > > > > > > > 
> > > > > > > > > > I don't think we have really reached a conclusion on the
> > > > > > > > > > pixel formats
> > > > > > > > > > we want to expose. The main issue is how to deal with codecs
> > > > > > > > > > that need
> > > > > > > > > > the full slice NALU with start code, where the slice_header
> > > > > > > > > > is
> > > > > > > > > > duplicated in raw bitstream, when others are fine with just
> > > > > > > > > > the encoded
> > > > > > > > > > slice data and the parsed slice header control.
> > > > > > > > > > 
> > > > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > > > - One that only takes only the slice compressed data
> > > > > > > > > > (without raw slice
> > > > > > > > > > header and start code);
> > > > > > > > > > - One that takes both the NALU data (including start code,
> > > > > > > > > > raw header
> > > > > > > > > > and compressed data) and slice header controls;
> > > > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > > > 
> > > > > > > > > > But I no longer think the latter really makes sense in the
> > > > > > > > > > context of
> > > > > > > > > > stateless video decoding.
> > > > > > > > > > 
> > > > > > > > > > A side-note: I think we should definitely have data offsets
> > > > > > > > > > in every
> > > > > > > > > > case, so that implementations can just push the whole NALU
> > > > > > > > > > regardless
> > > > > > > > > > of the format if they're lazy.
> > > > > > > > > 
> > > > > > > > > I realize that I didn't share our latest research on the
> > > > > > > > > subject. So a
> > > > > > > > > slice in the original bitstream is formed of the following
> > > > > > > > > blocks
> > > > > > > > > 
> > > > > > > > > (simplified):
> > > > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > > > 
> > > > > > > > Thanks for the details!
> > > > > > > > 
> > > > > > > > > nal_header:
> > > > > > > > > This one is a header used to locate the start and the end of
> > > > > > > > > the of a
> > > > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a
> > > > > > > > > sequence
> > > > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> > > > > > > > > byte
> > > > > > > > > would be a leading 0 from the previous NAL padding, but this
> > > > > > > > > is also
> > > > > > > > > totally valid start code. The second form is the AVC form,
> > > > > > > > > notably used
> > > > > > > > > in ISOMP4 container. It simply is the size of the NAL. You
> > > > > > > > > must keep
> > > > > > > > > your buffer aligned to NALs in this case as you cannot scan
> > > > > > > > > from random
> > > > > > > > > location.
> > > > > > > > > 
> > > > > > > > > nal_type:
> > > > > > > > > It's a bit more then just the type, but it contains at least
> > > > > > > > > the
> > > > > > > > > information of the nal type. This has different size on H.264
> > > > > > > > > and HEVC
> > > > > > > > > but I know it's size is in bytes.
> > > > > > > > > 
> > > > > > > > > slice_header:
> > > > > > > > > This contains per slice parameters, like the modification
> > > > > > > > > lists to
> > > > > > > > > apply on the references. This one has a size in bits, not in
> > > > > > > > > bytes.
> > > > > > > > > 
> > > > > > > > > slice:
> > > > > > > > > I don't really know what is in it exactly, but this is the
> > > > > > > > > data used to
> > > > > > > > > decode. This bit has a special coding called the
> > > > > > > > > anti-emulation, which
> > > > > > > > > prevents a start-code from appearing in it. This coding is
> > > > > > > > > present in
> > > > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference
> > > > > > > > > manual they
> > > > > > > > > call ANNEX-B the bytestream format).
> > > > > > > > > 
> > > > > > > > > So, what we notice is that what is currently passed through
> > > > > > > > > Cedrus
> > > > > > > > > 
> > > > > > > > > driver:
> > > > > > > > >   [nal_type][slice_header][slice]
> > > > > > > > > 
> > > > > > > > > This matches what is being passed through VA-API. We can
> > > > > > > > > understand
> > > > > > > > > that stripping off the slice_header would be hard, since it's
> > > > > > > > > size is
> > > > > > > > > in bits. Instead we pass size and header_bit_size in
> > > > > > > > > slice_params.
> > > > > > > > 
> > > > > > > > True, there is that.
> > > > > > > > 
> > > > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > > > start_code_e, when you turn this off, you don't need start
> > > > > > > > > code. As a
> > > > > > > > > side effect, the bitstream becomes identical. We do now know
> > > > > > > > > that it
> > > > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > > > 
> > > > > > > > Oh great, that makes life easier in the short term, but I guess
> > > > > > > > the
> > > > > > > > issue could arise on another decoder sooner or later.
> > > > > > > > 
> > > > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is
> > > > > > > > > that it
> > > > > > > > > take care for us of reading and executing the modification
> > > > > > > > > lists found
> > > > > > > > > in the slice header. Mostly because I very disliked having to
> > > > > > > > > pass the
> > > > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver
> > > > > > > > > the
> > > > > > > > > transformation from the DPB entries into this p/b0/b1 list.
> > > > > > > > > These list
> > > > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> > > > > > > > > the
> > > > > > > > > following section is the execution of the modification list.
> > > > > > > > > As this
> > > > > > > > > list is not modified, it only need to be calculated per frame.
> > > > > > > > > As a
> > > > > > > > > result, we don't need these new lists, and we can work with
> > > > > > > > > the same
> > > > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > > > 
> > > > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > > > modifications rather than reconstructing those in the driver
> > > > > > > > from a
> > > > > > > > full list. IMO controls should stick to the bitstream as close
> > > > > > > > as
> > > > > > > > possible.
> > > > > > > 
> > > > > > > For Hantro and RKVDEC, the list of modification is parsed by the
> > > > > > > IP
> > > > > > > from the slice header bits. Just to make sure, because I myself
> > > > > > > was
> > > > > > > confused on this before, the slice header does not contain a list
> > > > > > > of
> > > > > > > references, instead it contains a list modification to be applied
> > > > > > > to
> > > > > > > the reference list. I need to check again, but to execute these
> > > > > > > modification, you need to filter and sort the references in a
> > > > > > > specific
> > > > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > > > 
> > > > > > > The list of references is deduced from the DPB. The DPB, which I
> > > > > > > thinks
> > > > > > > should be rename as "references", seems more useful then p/b0/b1,
> > > > > > > since
> > > > > > > this is the data that gives use the ability to implementing glue
> > > > > > > in the
> > > > > > > driver to compensate some HW differences.
> > > > > > > 
> > > > > > > In the case of Hantro / RKVDEC, we think it's natural to build the
> > > > > > > HW
> > > > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > > > specific list in the decode_params structure. The fact these lists
> > > > > > > are
> > > > > > > standard intermediate step of the standard is not that important.
> > > > > > 
> > > > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > > > instead of dpb.
> > > > > > 
> > > > > > > > > Now, this is just a start. For RK3399, we have a different
> > > > > > > > > CODEC
> > > > > > > > > design. This one does not have the start_code_e bit. What the
> > > > > > > > > IP does,
> > > > > > > > > is that you give it one or more slice per buffer, setup the
> > > > > > > > > params,
> > > > > > > > > start decoding, but the decoder then return the location of
> > > > > > > > > the
> > > > > > > > > following NAL. So basically you could offload the scanning of
> > > > > > > > > start
> > > > > > > > > code to the HW. That being said, with the driver layer in
> > > > > > > > > between, that
> > > > > > > > > would be amazingly inconvenient to use, and with Boyer-more
> > > > > > > > > algorithm,
> > > > > > > > > it is pretty cheap to scan this type of start-code on CPU. But
> > > > > > > > > the
> > > > > > > > > feature that this allows is to operate in frame mode. In this
> > > > > > > > > mode, you
> > > > > > > > > have 1 interrupt per frame.
> > > > > > > > 
> > > > > > > > I'm not sure there is any interest in exposing that from
> > > > > > > > userspace and
> > > > > > > > my current feeling is that we should just ditch support for
> > > > > > > > per-frame
> > > > > > > > decoding altogether. I think it mixes decoding with notions that
> > > > > > > > are
> > > > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > > > 
> > > > > > > I'm not worried about this either. We can already support that by
> > > > > > > copying the bitstream internally to the driver, though zero-copy
> > > > > > > with
> > > > > > > this would require a new format, the one we talked about,
> > > > > > > SLICE_ANNEX_B.
> > > > > > 
> > > > > > Right, but what I'm thinking about is making that the one and only
> > > > > > format. The rationale is that it's always easier to just append a
> > > > > > start
> > > > > > code from userspace if needed. And we need a bit offset to the slice
> > > > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > > > have the whole thing that will work in every situation.
> > > > > 
> > > > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > > > (.mp4, .mov) will require a full bitstream copy in userspace to add
> > > > > the
> > > > > start code (unless you hack your allocation in your demuxer, but it's
> > > > > a
> > > > > bit complicated since this code might come from two libraries). In
> > > > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > > > follows.
> > > > 
> > > > Well, I think we have to do a copy from system memory to the buffer
> > > > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > > > expected not to have any MMU unit and not allow sg import anyway.
> > > 
> > > The Rockchip has an mmu. You need one copy at least indeed,
> > 
> > Is the MMU in use currently? That can make things troublesome if we run
> > into a case where the VPU has MMU and deals with scatter-gather while
> > the display part doesn't. As far as I know, there's no way for
> > userspace to know whether a dma-buf-exported buffer is backed by CMA or
> > by scatter-gather memory. This feels like a major issue for using dma-
> > buf, since userspace can't predict whether a buffer exported on one
> > device can be imported on another when building its pipeline.
> 
> FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver 
> for it yet. It is supported for display, both VPUs and some other devices. I 
> think no sane SoC designer would left out one or another unit without IOMMU 
> support, that just calls for troubles, as you pointed out.

Right right, I've been following that from a distance :)

Indeed I think it's realistic to expect that for now, but it may not
play out so well in the long term. For instance, maybe connecting a USB
display would require CMA when the rest of the system can do with sg.

I think it would really be useful for userspace to have a way to test
whether a buffer can be imported from one device to another. It feels
better than indicating where the memory lives, since there are
countless cases where additional restrictions apply too.

Cheers,

Paul

> Best regards,
> Jernej
> 
> > > e.g. file
> > > to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
> > > ts, first you need to copy the MPEG TS to mem, then to demux you copy
> > > that H264 stream to another buffer, you then copy in the parser,
> > > removing the start-code and finally copy in the accelerator, adding the
> > > start code. If the driver would allow userptr, it would be unusable.
> > > 
> > > GStreamer on the other side implement lazy conversion, so it would copy
> > > the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
> > > parser (but stream format is negotiation, so it keeps the start-code).
> > > If you request alignment=au, you have full frame of buffers, so if your
> > > driver could do userptr, you can same that extra copy.
> > > 
> > > Now, if we demux an MP4 it's the same, the parser will need do a full
> > > copy instead of lazy aggregation in order to prepend the start code
> > > (since it had an AVC header). But userptr could save a copy.
> > > 
> > > If the driver requires no nal prefix, then we could just pass a
> > > slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
> > > which is a bit slower (even know it's nothing compare to the full
> > > copies we already do.
> > > 
> > > That was my argument in favour for no NAL prefix in term of efficiency,
> > > and it does not prevent adding a control to enable start-code for cases
> > > it make sense.
> > 
> > I see, so the internal arcitecture of userspace software may not be a
> > good fit for adding these bits and it could hurt performance a bit.
> > That feels like a significant downside.
> > 
> > > > So with that in mind, asking userspace to add a startcode it already
> > > > knows doesn't seem to be asking too much.
> > > > 
> > > > > On the other end, the data_offset thing is likely just a thing for the
> > > > > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> > > > 
> > > > Well, I think it's best to be fool-proof here and just require that
> > > > start code. We should also have per-slice bit offsets to the different
> > > > parts anyway, so drivers that don't need it can just ignore it.
> > > > 
> > > > In extreme cases where there is some interest in doing direct buffer
> > > > import without doing a copy in userspace, userspace could trick the
> > > > format and avoid a copy by not providing the start-code (assuming it
> > > > knows it doesn't need it) and specifying the bit offsets accordingly.
> > > > That'd be a hack for better performance, and it feels better to do
> > > > things in this order rather than having to hack around in the drivers
> > > > that need the start code in every other case.
> > > 
> > > So basically, you and Tomas are both strongly in favour of adding
> > > ANNEX-B start-code to the current uAPI. I have digged into Cedrus
> > > registers, and it seems that it does have start-code scanning support.
> > > I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
> > > like the RK do. That requires the IP to deal with the modifications
> > > lists, which are per slices.
> > 
> > Actually the bitstream parser won't reconfigure the pipeline
> > configuration registers, it's only around for userspace to avoid
> > implementing bitstream parsing, but it's a standalone thing.
> > 
> > So if we want to do full-frame decoding we always need to reconfigure
> > our pipeline (or do it like we do currently and just use one of the
> > per-slice configuration and hope for the best).
> > 
> > Do we have more information on the RK3399 and what it requires exactly?
> > (Just to make sure it's not another issue altogether.)
> > 
> > > My question is, are you willing to adapt the Cedrus driver to support
> > > receiving start-code ? And will this have a performance impact or not ?
> > > On RK side, it's really just about flipping 1 bit.
> > > 
> > > On the Rockchip side, Tomas had concern about CPU wakeup and the fact
> > > that we didn't aim at supporting passing multiple slices at once to the
> > > IP (something RK supports). It's important to understand that multi-
> > > slice streams are relatively rare and mostly used for low-latency /
> > > video conferencing. So aggregating in these case defeats the purpose of
> > > using slices. So I think RK feature is not very important.
> > 
> > Agreed, let's aim for low-latency as a standard.
> > 
> > > Of course, I do believe that long term we will want to expose bot
> > > stream formats on RK (because the HW can do that), so then userspace
> > > can just pick the best when available. So that boils down to our first
> > > idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
> > > Now that we have progressed on the matter, I'm quite in favour of
> > > having _SLICE in the first place, with the preferred format that
> > > everyone should support, and allow for variants later. Now, if we make
> > > one mandatory, we could also just have a menu control to allow other
> > > formats.
> > 
> > That seems fairly reasonable to me, and indeed, having one preferred
> > format at first seems to be a good move.
> > 
> > > > > > To me the breaking point was about having the slice header both in
> > > > > > raw
> > > > > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > > > > well push it to its logical conclusion and include all the bits that
> > > > > > can be useful.
> > > > > 
> > > > > To take your words, the bits that contain useful information starts
> > > > > from the NAL type byte, exactly were the data was cut by VA-API and
> > > > > the
> > > > > current uAPI.
> > > > 
> > > > Agreed, but I think that the advantages of always requiring the start
> > > > code outweigh the potential (yet quite unlikely) downsides.
> > > > 
> > > > > > > > > But it also support slice mode, with an
> > > > > > > > > interrupt per slice, which is what we decided to use.
> > > > > > > > 
> > > > > > > > Easier for everyone and probably better for latency as well :)
> > > > > > > > 
> > > > > > > > > So in this case, indeed we strictly require on start-code.
> > > > > > > > > Though, to
> > > > > > > > > me this is not a great reason to make a new fourcc, so we will
> > > > > > > > > try and
> > > > > > > > > use (data_offset = 3) in order to make some space for that
> > > > > > > > > start code,
> > > > > > > > > and write it down in the driver. This is to be continued, we
> > > > > > > > > will
> > > > > > > > > report back on this later. This could have some side effect in
> > > > > > > > > the
> > > > > > > > > ability to import buffers. But most userspace don't try to do
> > > > > > > > > zero-copy
> > > > > > > > > on the encoded size and just copy anyway.
> > > > > > > > > 
> > > > > > > > > To my opinion, having a single format is a big deal, since
> > > > > > > > > userspace
> > > > > > > > > will generally be developed for one specific HW and we would
> > > > > > > > > endup with
> > > > > > > > > fragmented support. What we really want to achieve is having a
> > > > > > > > > driver
> > > > > > > > > interface which works across multiple HW, and I think this is
> > > > > > > > > quite
> > > > > > > > > possible.
> > > > > > > > 
> > > > > > > > I agree with that. The more I think about it, the more I believe
> > > > > > > > we
> > > > > > > > should just pass the whole
> > > > > > > > [nal_header][nal_type][slice_header][slice]
> > > > > > > > and the parsed list in every scenario.
> > > > > > > 
> > > > > > > What I like of the cut at nal_type, is that there is only format.
> > > > > > > If we
> > > > > > > cut at nal_header, then we need to expose 2 formats. And it makes
> > > > > > > our
> > > > > > > API similar to other accelerator API, so it's easy to "convert"
> > > > > > > existing userspace.
> > > > > > 
> > > > > > Unless we make that cut the single one and only true cut that shall
> > > > > > supersed all other cuts :)
> > > > > 
> > > > > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > > > > thing and go back to our first idea.
> > > > 
> > > > Right, in the end I think we should go with:
> > > > V4L2_PIX_FMT_MPEG2_SLICE
> > > > V4L2_PIX_FMT_H264_SLICE
> > > > V4L2_PIX_FMT_HEVC_SLICE
> > > > 
> > > > And just require raw bitstream for the slice with emulation-prevention
> > > > bits included.
> > > 
> > > That's should be the set of format we start with indeed. The single
> > > format for which software gets written and tested, making sure software
> > > support is not fragmented, and other variants should be something to
> > > opt-in.
> > 
> > Cheers for that!
> > 
> > Paul
> > 
> > > > Cheers,
> > > > 
> > > > Paul
> > > > 
> > > > > > > > For H.265, our decoder needs some information from the NAL type
> > > > > > > > too.
> > > > > > > > We currently extract that in userspace and stick it to the
> > > > > > > > slice_header, but maybe it would make more sense to have drivers
> > > > > > > > parse
> > > > > > > > that info from the buffer if they need it. On the other hand, it
> > > > > > > > seems
> > > > > > > > quite common to pass information from the NAL type, so maybe we
> > > > > > > > should
> > > > > > > > either make a new control for it or have all the fields in the
> > > > > > > > slice_header (which would still be wrong in terms of matching
> > > > > > > > bitstream
> > > > > > > > description).
> > > > > > > 
> > > > > > > Even in userspace, it's common to just parse this in place, it's a
> > > > > > > simple mask. But yes, if we don't have it yet, we should expose
> > > > > > > the NAL
> > > > > > > type, it would be cleaner.
> > > > > > 
> > > > > > Right, works for me.
> > > > > 
> > > > > Ack.
> > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Paul
> > > > > > 
> > > > > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > > > > 
> > > > > > > > > > As far as I could understand, the decoded picture buffer
> > > > > > > > > > (DPB) is a
> > > > > > > > > > concept that only makes sense relative to a decoder
> > > > > > > > > > implementation. The
> > > > > > > > > > spec mentions how to manage it with the Hypothetical
> > > > > > > > > > reference decoder
> > > > > > > > > > (Annex C), but that's about it.
> > > > > > > > > > 
> > > > > > > > > > What's really in the bitstream is the list of modified
> > > > > > > > > > short-term and
> > > > > > > > > > long-term references, which is enough for every decoder.
> > > > > > > > > > 
> > > > > > > > > > For this reason, I strongly believe we should stop talking
> > > > > > > > > > about DPB in
> > > > > > > > > > the controls and just pass these lists agremented with
> > > > > > > > > > relevant
> > > > > > > > > > information for userspace.
> > > > > > > > > > 
> > > > > > > > > > I think it should be up to the driver to maintain a DPB and
> > > > > > > > > > we could
> > > > > > > > > > have helpers for common cases. For instance, the rockchip
> > > > > > > > > > decoder needs
> > > > > > > > > > to keep unused entries around[2] and cedrus has the same
> > > > > > > > > > requirement
> > > > > > > > > > for H.264. However for cedrus/H.265, we don't need to do any
> > > > > > > > > > book-
> > > > > > > > > > keeping in particular and can manage with the lists from the
> > > > > > > > > > bitstream
> > > > > > > > > > directly.
> > > > > > > > > 
> > > > > > > > > As discusses today, we still need to pass that list. It's
> > > > > > > > > being index
> > > > > > > > > by the HW to retrieve the extra information we have collected
> > > > > > > > > about the
> > > > > > > > > status of the reference frames. In the case of Hantro, which
> > > > > > > > > process
> > > > > > > > > the modification list from the slice header for us, we also
> > > > > > > > > need that
> > > > > > > > > list to construct the unmodified list.
> > > > > > > > > 
> > > > > > > > > So the problem here is just a naming problem. That list is not
> > > > > > > > > really a
> > > > > > > > > DPB. It is just the list of long-term/short-term references
> > > > > > > > > with the
> > > > > > > > > status of these references. So maybe we could just rename as
> > > > > > > > > references/reference_entry ?
> > > > > > > > 
> > > > > > > > What I'd like to pass is the diff to the references list, as
> > > > > > > > ffmpeg
> > > > > > > > currently provides for v4l2 request and vaapi (probably vdpau
> > > > > > > > too). No
> > > > > > > > functional change here, only that we should stop calling it a
> > > > > > > > DPB,
> > > > > > > > which confuses everyone.
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > > > - Using flags
> > > > > > > > > > 
> > > > > > > > > > The current MPEG-2 controls have lots of u8 values that can
> > > > > > > > > > be
> > > > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > > > It's unlikely that we'll get more than 64 flags, so using a
> > > > > > > > > > u64 by
> > > > > > > > > > default for that sounds fine (we definitely do want to keep
> > > > > > > > > > some room
> > > > > > > > > > available and I don't think using 32 bits as a default is
> > > > > > > > > > good enough).
> > > > > > > > > > 
> > > > > > > > > > I think H.264/HEVC per-control flags should also be moved to
> > > > > > > > > > u64.
> > > > > > > > > 
> > > > > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI
> > > > > > > > > right ?
> > > > > > > > 
> > > > > > > > Mhh, even if they are, it makes it much harder to verify 32/64
> > > > > > > > bit
> > > > > > > > alignment constraints (we're dealing with 64-bit platforms that
> > > > > > > > need to
> > > > > > > > have 32-bit userspace and compat_ioctl).
> > > > > > > 
> > > > > > > I see, thanks.
> > > > > > > 
> > > > > > > > > > - Clear split of controls and terminology
> > > > > > > > > > 
> > > > > > > > > > Some codecs have explicit NAL units that are good fits to
> > > > > > > > > > match as
> > > > > > > > > > controls: e.g. slice header, pps, sps. I think we should
> > > > > > > > > > stick to the
> > > > > > > > > > bitstream element names for those.
> > > > > > > > > > 
> > > > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > > > - renaming v4l2_ctrl_h264_decode_param to
> > > > > > > > > > v4l2_ctrl_h264_slice_header;
> > > > > > > > > 
> > > > > > > > > Oops, I think you meant slice_prams ? decode_params matches
> > > > > > > > > the
> > > > > > > > > information found in SPS/PPS (combined?), while slice_params
> > > > > > > > > matches
> > > > > > > > > the information extracted (and executed in case of l0/l1) from
> > > > > > > > > the
> > > > > > > > > slice headers.
> > > > > > > > 
> > > > > > > > Yes you're right, I mixed them up.
> > > > > > > > 
> > > > > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > > > > 
> > > > > > > > > it's not just the slice header, and it's per slice.
> > > > > > > > 
> > > > > > > > Mhh, what exactly remains in there and where does it originate
> > > > > > > > in the
> > > > > > > > bitstream? Maybe it wouldn't be too bad to have one control per
> > > > > > > > actual
> > > > > > > > group of bitstream elements.
> > > > > > > > 
> > > > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the
> > > > > > > > > > reference lists
> > > > > > > > > > where they belong, which seems to be slice_header;
> > > > > > > > > 
> > > > > > > > > There reference list is only updated by userspace (through
> > > > > > > > > it's DPB)
> > > > > > > > > base on the result of the last decoding step. I was very
> > > > > > > > > confused for a
> > > > > > > > > moment until I realize that the lists in the slice_header are
> > > > > > > > > just a
> > > > > > > > > list of modification to apply to the reference list in order
> > > > > > > > > to produce
> > > > > > > > > l0 and l1.
> > > > > > > > 
> > > > > > > > Indeed, and I'm suggesting that we pass the modifications only,
> > > > > > > > which
> > > > > > > > would fit a slice_header control.
> > > > > > > 
> > > > > > > I think I made my point why we want the dpb -> references. I'm
> > > > > > > going to
> > > > > > > validate with the VA driver now, to see if the references list
> > > > > > > there is
> > > > > > > usable with our code.
> > > > > > > 
> > > > > > > > Cheers,
> > > > > > > > 
> > > > > > > > Paul
> > > > > > > > 
> > > > > > > > > > I'm up for preparing and submitting these control changes
> > > > > > > > > > and updating
> > > > > > > > > > cedrus if they seem agreeable.
> > > > > > > > > > 
> > > > > > > > > > What do you think?
> > > > > > > > > > 
> > > > > > > > > > Cheers,
> > > > > > > > > > 
> > > > > > > > > > Paul
> > > > > > > > > > 
> > > > > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > > > > [2]:
> > > > > > > > > > https://chromium.googlesource.com/chromiumos/third_party/ke
> > > > > > > > > > rnel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> 
> 
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-18 10:29                   ` Paul Kocialkowski
@ 2019-05-18 14:09                     ` Nicolas Dufresne
  2019-05-22  6:48                       ` Tomasz Figa
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-18 14:09 UTC (permalink / raw)
  To: Paul Kocialkowski, Jernej Škrabec
  Cc: Linux Media Mailing List, Hans Verkuil, Tomasz Figa,
	Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Thierry Reding, Ezequiel Garcia, Jonas Karlman

Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> > Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> > > Hi,
> > > 
> > > On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> > > > Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > > > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit 
> > :
> > > > > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a 
> > écrit :
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > With the Rockchip stateless VPU driver in the works, we now
> > > > > > > > > > > have a
> > > > > > > > > > > better idea of what the situation is like on platforms other
> > > > > > > > > > > than
> > > > > > > > > > > Allwinner. This email shares my conclusions about the
> > > > > > > > > > > situation and how
> > > > > > > > > > > we should update the MPEG-2, H.264 and H.265 controls
> > > > > > > > > > > accordingly.
> > > > > > > > > > > 
> > > > > > > > > > > - Per-slice decoding
> > > > > > > > > > > 
> > > > > > > > > > > We've discussed this one already[0] and Hans has submitted a
> > > > > > > > > > > patch[1]
> > > > > > > > > > > to implement the required core bits. When we agree it looks
> > > > > > > > > > > good, we
> > > > > > > > > > > should lift the restriction that all slices must be
> > > > > > > > > > > concatenated and
> > > > > > > > > > > have them submitted as individual requests.
> > > > > > > > > > > 
> > > > > > > > > > > One question is what to do about other controls. I feel like
> > > > > > > > > > > it would
> > > > > > > > > > > make sense to always pass all the required controls for
> > > > > > > > > > > decoding the
> > > > > > > > > > > slice, including the ones that don't change across slices.
> > > > > > > > > > > But there
> > > > > > > > > > > may be no particular advantage to this and only downsides.
> > > > > > > > > > > Not doing it
> > > > > > > > > > > and relying on the "control cache" can work, but we need to
> > > > > > > > > > > specify
> > > > > > > > > > > that only a single stream can be decoded per opened instance
> > > > > > > > > > > of the
> > > > > > > > > > > v4l2 device. This is the assumption we're going with for
> > > > > > > > > > > handling
> > > > > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > > > > 
> > > > > > > > > > My opinion on this is that the m2m instance is a state, and
> > > > > > > > > > the driver
> > > > > > > > > > should be responsible of doing time-division multiplexing
> > > > > > > > > > across
> > > > > > > > > > multiple m2m instance jobs. Doing the time-division
> > > > > > > > > > multiplexing in
> > > > > > > > > > userspace would require some sort of daemon to work properly
> > > > > > > > > > across
> > > > > > > > > > processes. I also think the kernel is better place for doing
> > > > > > > > > > resource
> > > > > > > > > > access scheduling in general.
> > > > > > > > > 
> > > > > > > > > I agree with that yes. We always have a single m2m context and
> > > > > > > > > specific
> > > > > > > > > controls per opened device so keeping cached values works out
> > > > > > > > > well.
> > > > > > > > > 
> > > > > > > > > So maybe we shall explicitly require that the request with the
> > > > > > > > > first
> > > > > > > > > slice for a frame also contains the per-frame controls.
> > > > > > > > > 
> > > > > > > > > > > - Annex-B formats
> > > > > > > > > > > 
> > > > > > > > > > > I don't think we have really reached a conclusion on the
> > > > > > > > > > > pixel formats
> > > > > > > > > > > we want to expose. The main issue is how to deal with codecs
> > > > > > > > > > > that need
> > > > > > > > > > > the full slice NALU with start code, where the slice_header
> > > > > > > > > > > is
> > > > > > > > > > > duplicated in raw bitstream, when others are fine with just
> > > > > > > > > > > the encoded
> > > > > > > > > > > slice data and the parsed slice header control.
> > > > > > > > > > > 
> > > > > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > > > > - One that only takes only the slice compressed data
> > > > > > > > > > > (without raw slice
> > > > > > > > > > > header and start code);
> > > > > > > > > > > - One that takes both the NALU data (including start code,
> > > > > > > > > > > raw header
> > > > > > > > > > > and compressed data) and slice header controls;
> > > > > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > > > > 
> > > > > > > > > > > But I no longer think the latter really makes sense in the
> > > > > > > > > > > context of
> > > > > > > > > > > stateless video decoding.
> > > > > > > > > > > 
> > > > > > > > > > > A side-note: I think we should definitely have data offsets
> > > > > > > > > > > in every
> > > > > > > > > > > case, so that implementations can just push the whole NALU
> > > > > > > > > > > regardless
> > > > > > > > > > > of the format if they're lazy.
> > > > > > > > > > 
> > > > > > > > > > I realize that I didn't share our latest research on the
> > > > > > > > > > subject. So a
> > > > > > > > > > slice in the original bitstream is formed of the following
> > > > > > > > > > blocks
> > > > > > > > > > 
> > > > > > > > > > (simplified):
> > > > > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > > > > 
> > > > > > > > > Thanks for the details!
> > > > > > > > > 
> > > > > > > > > > nal_header:
> > > > > > > > > > This one is a header used to locate the start and the end of
> > > > > > > > > > the of a
> > > > > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a
> > > > > > > > > > sequence
> > > > > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> > > > > > > > > > byte
> > > > > > > > > > would be a leading 0 from the previous NAL padding, but this
> > > > > > > > > > is also
> > > > > > > > > > totally valid start code. The second form is the AVC form,
> > > > > > > > > > notably used
> > > > > > > > > > in ISOMP4 container. It simply is the size of the NAL. You
> > > > > > > > > > must keep
> > > > > > > > > > your buffer aligned to NALs in this case as you cannot scan
> > > > > > > > > > from random
> > > > > > > > > > location.
> > > > > > > > > > 
> > > > > > > > > > nal_type:
> > > > > > > > > > It's a bit more then just the type, but it contains at least
> > > > > > > > > > the
> > > > > > > > > > information of the nal type. This has different size on H.264
> > > > > > > > > > and HEVC
> > > > > > > > > > but I know it's size is in bytes.
> > > > > > > > > > 
> > > > > > > > > > slice_header:
> > > > > > > > > > This contains per slice parameters, like the modification
> > > > > > > > > > lists to
> > > > > > > > > > apply on the references. This one has a size in bits, not in
> > > > > > > > > > bytes.
> > > > > > > > > > 
> > > > > > > > > > slice:
> > > > > > > > > > I don't really know what is in it exactly, but this is the
> > > > > > > > > > data used to
> > > > > > > > > > decode. This bit has a special coding called the
> > > > > > > > > > anti-emulation, which
> > > > > > > > > > prevents a start-code from appearing in it. This coding is
> > > > > > > > > > present in
> > > > > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference
> > > > > > > > > > manual they
> > > > > > > > > > call ANNEX-B the bytestream format).
> > > > > > > > > > 
> > > > > > > > > > So, what we notice is that what is currently passed through
> > > > > > > > > > Cedrus
> > > > > > > > > > 
> > > > > > > > > > driver:
> > > > > > > > > >   [nal_type][slice_header][slice]
> > > > > > > > > > 
> > > > > > > > > > This matches what is being passed through VA-API. We can
> > > > > > > > > > understand
> > > > > > > > > > that stripping off the slice_header would be hard, since it's
> > > > > > > > > > size is
> > > > > > > > > > in bits. Instead we pass size and header_bit_size in
> > > > > > > > > > slice_params.
> > > > > > > > > 
> > > > > > > > > True, there is that.
> > > > > > > > > 
> > > > > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > > > > start_code_e, when you turn this off, you don't need start
> > > > > > > > > > code. As a
> > > > > > > > > > side effect, the bitstream becomes identical. We do now know
> > > > > > > > > > that it
> > > > > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > > > > 
> > > > > > > > > Oh great, that makes life easier in the short term, but I guess
> > > > > > > > > the
> > > > > > > > > issue could arise on another decoder sooner or later.
> > > > > > > > > 
> > > > > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is
> > > > > > > > > > that it
> > > > > > > > > > take care for us of reading and executing the modification
> > > > > > > > > > lists found
> > > > > > > > > > in the slice header. Mostly because I very disliked having to
> > > > > > > > > > pass the
> > > > > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver
> > > > > > > > > > the
> > > > > > > > > > transformation from the DPB entries into this p/b0/b1 list.
> > > > > > > > > > These list
> > > > > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> > > > > > > > > > the
> > > > > > > > > > following section is the execution of the modification list.
> > > > > > > > > > As this
> > > > > > > > > > list is not modified, it only need to be calculated per frame.
> > > > > > > > > > As a
> > > > > > > > > > result, we don't need these new lists, and we can work with
> > > > > > > > > > the same
> > > > > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > > > > 
> > > > > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > > > > modifications rather than reconstructing those in the driver
> > > > > > > > > from a
> > > > > > > > > full list. IMO controls should stick to the bitstream as close
> > > > > > > > > as
> > > > > > > > > possible.
> > > > > > > > 
> > > > > > > > For Hantro and RKVDEC, the list of modification is parsed by the
> > > > > > > > IP
> > > > > > > > from the slice header bits. Just to make sure, because I myself
> > > > > > > > was
> > > > > > > > confused on this before, the slice header does not contain a list
> > > > > > > > of
> > > > > > > > references, instead it contains a list modification to be applied
> > > > > > > > to
> > > > > > > > the reference list. I need to check again, but to execute these
> > > > > > > > modification, you need to filter and sort the references in a
> > > > > > > > specific
> > > > > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > > > > 
> > > > > > > > The list of references is deduced from the DPB. The DPB, which I
> > > > > > > > thinks
> > > > > > > > should be rename as "references", seems more useful then p/b0/b1,
> > > > > > > > since
> > > > > > > > this is the data that gives use the ability to implementing glue
> > > > > > > > in the
> > > > > > > > driver to compensate some HW differences.
> > > > > > > > 
> > > > > > > > In the case of Hantro / RKVDEC, we think it's natural to build the
> > > > > > > > HW
> > > > > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > > > > specific list in the decode_params structure. The fact these lists
> > > > > > > > are
> > > > > > > > standard intermediate step of the standard is not that important.
> > > > > > > 
> > > > > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > > > > instead of dpb.
> > > > > > > 
> > > > > > > > > > Now, this is just a start. For RK3399, we have a different
> > > > > > > > > > CODEC
> > > > > > > > > > design. This one does not have the start_code_e bit. What the
> > > > > > > > > > IP does,
> > > > > > > > > > is that you give it one or more slice per buffer, setup the
> > > > > > > > > > params,
> > > > > > > > > > start decoding, but the decoder then return the location of
> > > > > > > > > > the
> > > > > > > > > > following NAL. So basically you could offload the scanning of
> > > > > > > > > > start
> > > > > > > > > > code to the HW. That being said, with the driver layer in
> > > > > > > > > > between, that
> > > > > > > > > > would be amazingly inconvenient to use, and with Boyer-more
> > > > > > > > > > algorithm,
> > > > > > > > > > it is pretty cheap to scan this type of start-code on CPU. But
> > > > > > > > > > the
> > > > > > > > > > feature that this allows is to operate in frame mode. In this
> > > > > > > > > > mode, you
> > > > > > > > > > have 1 interrupt per frame.
> > > > > > > > > 
> > > > > > > > > I'm not sure there is any interest in exposing that from
> > > > > > > > > userspace and
> > > > > > > > > my current feeling is that we should just ditch support for
> > > > > > > > > per-frame
> > > > > > > > > decoding altogether. I think it mixes decoding with notions that
> > > > > > > > > are
> > > > > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > > > > 
> > > > > > > > I'm not worried about this either. We can already support that by
> > > > > > > > copying the bitstream internally to the driver, though zero-copy
> > > > > > > > with
> > > > > > > > this would require a new format, the one we talked about,
> > > > > > > > SLICE_ANNEX_B.
> > > > > > > 
> > > > > > > Right, but what I'm thinking about is making that the one and only
> > > > > > > format. The rationale is that it's always easier to just append a
> > > > > > > start
> > > > > > > code from userspace if needed. And we need a bit offset to the slice
> > > > > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > > > > have the whole thing that will work in every situation.
> > > > > > 
> > > > > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > > > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > > > > (.mp4, .mov) will require a full bitstream copy in userspace to add
> > > > > > the
> > > > > > start code (unless you hack your allocation in your demuxer, but it's
> > > > > > a
> > > > > > bit complicated since this code might come from two libraries). In
> > > > > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > > > > follows.
> > > > > 
> > > > > Well, I think we have to do a copy from system memory to the buffer
> > > > > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > > > > expected not to have any MMU unit and not allow sg import anyway.
> > > > 
> > > > The Rockchip has an mmu. You need one copy at least indeed,
> > > 
> > > Is the MMU in use currently? That can make things troublesome if we run
> > > into a case where the VPU has MMU and deals with scatter-gather while
> > > the display part doesn't. As far as I know, there's no way for
> > > userspace to know whether a dma-buf-exported buffer is backed by CMA or
> > > by scatter-gather memory. This feels like a major issue for using dma-
> > > buf, since userspace can't predict whether a buffer exported on one
> > > device can be imported on another when building its pipeline.
> > 
> > FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver 
> > for it yet. It is supported for display, both VPUs and some other devices. I 
> > think no sane SoC designer would left out one or another unit without IOMMU 
> > support, that just calls for troubles, as you pointed out.
> 
> Right right, I've been following that from a distance :)
> 
> Indeed I think it's realistic to expect that for now, but it may not
> play out so well in the long term. For instance, maybe connecting a USB
> display would require CMA when the rest of the system can do with sg.
> 
> I think it would really be useful for userspace to have a way to test
> whether a buffer can be imported from one device to another. It feels
> better than indicating where the memory lives, since there are
> countless cases where additional restrictions apply too.

I don't know for the integration on the Rockchip, but I did notice the
register documentation for it. In general, the most significant gain
with having iommu for CODECs is that it makes start up (and re-init)
time much shorter, but also in a much more predictable duration. I do
believe that the Venus driver (qualcomm) is one with solid support for
this, and it's quite noticably more snappy then the others. 

We also faced an interesting issue recently on IMX.6 (there is just no
mmu there). We where playing a stream from the camera, and the
framerate would drastically drop as soon as you plug a USB camera (and
it would drop for quite a while). We found out that Etnaviv is doing
cma allocation per frame, hopefully this won't happen under V4L2
queues. But on this platform, starting a new stream while pluggin a USB
key could take several seconds to start.

About the RK3399, work will continue in the next couple of weeks, and
when this is done, we should have a much wider view of this subject.
Hopefully what we learned about H.264 will be useful for HEVC and
eventually AV1, which in term of bitstream uses similar stream formats
method. AV1 is by far the most complicated CODEC I have read about.

> 
> Cheers,
> 
> Paul
> 
> > Best regards,
> > Jernej
> > 
> > > > e.g. file
> > > > to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
> > > > ts, first you need to copy the MPEG TS to mem, then to demux you copy
> > > > that H264 stream to another buffer, you then copy in the parser,
> > > > removing the start-code and finally copy in the accelerator, adding the
> > > > start code. If the driver would allow userptr, it would be unusable.
> > > > 
> > > > GStreamer on the other side implement lazy conversion, so it would copy
> > > > the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
> > > > parser (but stream format is negotiation, so it keeps the start-code).
> > > > If you request alignment=au, you have full frame of buffers, so if your
> > > > driver could do userptr, you can same that extra copy.
> > > > 
> > > > Now, if we demux an MP4 it's the same, the parser will need do a full
> > > > copy instead of lazy aggregation in order to prepend the start code
> > > > (since it had an AVC header). But userptr could save a copy.
> > > > 
> > > > If the driver requires no nal prefix, then we could just pass a
> > > > slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
> > > > which is a bit slower (even know it's nothing compare to the full
> > > > copies we already do.
> > > > 
> > > > That was my argument in favour for no NAL prefix in term of efficiency,
> > > > and it does not prevent adding a control to enable start-code for cases
> > > > it make sense.
> > > 
> > > I see, so the internal arcitecture of userspace software may not be a
> > > good fit for adding these bits and it could hurt performance a bit.
> > > That feels like a significant downside.
> > > 
> > > > > So with that in mind, asking userspace to add a startcode it already
> > > > > knows doesn't seem to be asking too much.
> > > > > 
> > > > > > On the other end, the data_offset thing is likely just a thing for the
> > > > > > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> > > > > 
> > > > > Well, I think it's best to be fool-proof here and just require that
> > > > > start code. We should also have per-slice bit offsets to the different
> > > > > parts anyway, so drivers that don't need it can just ignore it.
> > > > > 
> > > > > In extreme cases where there is some interest in doing direct buffer
> > > > > import without doing a copy in userspace, userspace could trick the
> > > > > format and avoid a copy by not providing the start-code (assuming it
> > > > > knows it doesn't need it) and specifying the bit offsets accordingly.
> > > > > That'd be a hack for better performance, and it feels better to do
> > > > > things in this order rather than having to hack around in the drivers
> > > > > that need the start code in every other case.
> > > > 
> > > > So basically, you and Tomas are both strongly in favour of adding
> > > > ANNEX-B start-code to the current uAPI. I have digged into Cedrus
> > > > registers, and it seems that it does have start-code scanning support.
> > > > I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
> > > > like the RK do. That requires the IP to deal with the modifications
> > > > lists, which are per slices.
> > > 
> > > Actually the bitstream parser won't reconfigure the pipeline
> > > configuration registers, it's only around for userspace to avoid
> > > implementing bitstream parsing, but it's a standalone thing.
> > > 
> > > So if we want to do full-frame decoding we always need to reconfigure
> > > our pipeline (or do it like we do currently and just use one of the
> > > per-slice configuration and hope for the best).
> > > 
> > > Do we have more information on the RK3399 and what it requires exactly?
> > > (Just to make sure it's not another issue altogether.)
> > > 
> > > > My question is, are you willing to adapt the Cedrus driver to support
> > > > receiving start-code ? And will this have a performance impact or not ?
> > > > On RK side, it's really just about flipping 1 bit.
> > > > 
> > > > On the Rockchip side, Tomas had concern about CPU wakeup and the fact
> > > > that we didn't aim at supporting passing multiple slices at once to the
> > > > IP (something RK supports). It's important to understand that multi-
> > > > slice streams are relatively rare and mostly used for low-latency /
> > > > video conferencing. So aggregating in these case defeats the purpose of
> > > > using slices. So I think RK feature is not very important.
> > > 
> > > Agreed, let's aim for low-latency as a standard.
> > > 
> > > > Of course, I do believe that long term we will want to expose bot
> > > > stream formats on RK (because the HW can do that), so then userspace
> > > > can just pick the best when available. So that boils down to our first
> > > > idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
> > > > Now that we have progressed on the matter, I'm quite in favour of
> > > > having _SLICE in the first place, with the preferred format that
> > > > everyone should support, and allow for variants later. Now, if we make
> > > > one mandatory, we could also just have a menu control to allow other
> > > > formats.
> > > 
> > > That seems fairly reasonable to me, and indeed, having one preferred
> > > format at first seems to be a good move.
> > > 
> > > > > > > To me the breaking point was about having the slice header both in
> > > > > > > raw
> > > > > > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > > > > > well push it to its logical conclusion and include all the bits that
> > > > > > > can be useful.
> > > > > > 
> > > > > > To take your words, the bits that contain useful information starts
> > > > > > from the NAL type byte, exactly were the data was cut by VA-API and
> > > > > > the
> > > > > > current uAPI.
> > > > > 
> > > > > Agreed, but I think that the advantages of always requiring the start
> > > > > code outweigh the potential (yet quite unlikely) downsides.
> > > > > 
> > > > > > > > > > But it also support slice mode, with an
> > > > > > > > > > interrupt per slice, which is what we decided to use.
> > > > > > > > > 
> > > > > > > > > Easier for everyone and probably better for latency as well :)
> > > > > > > > > 
> > > > > > > > > > So in this case, indeed we strictly require on start-code.
> > > > > > > > > > Though, to
> > > > > > > > > > me this is not a great reason to make a new fourcc, so we will
> > > > > > > > > > try and
> > > > > > > > > > use (data_offset = 3) in order to make some space for that
> > > > > > > > > > start code,
> > > > > > > > > > and write it down in the driver. This is to be continued, we
> > > > > > > > > > will
> > > > > > > > > > report back on this later. This could have some side effect in
> > > > > > > > > > the
> > > > > > > > > > ability to import buffers. But most userspace don't try to do
> > > > > > > > > > zero-copy
> > > > > > > > > > on the encoded size and just copy anyway.
> > > > > > > > > > 
> > > > > > > > > > To my opinion, having a single format is a big deal, since
> > > > > > > > > > userspace
> > > > > > > > > > will generally be developed for one specific HW and we would
> > > > > > > > > > endup with
> > > > > > > > > > fragmented support. What we really want to achieve is having a
> > > > > > > > > > driver
> > > > > > > > > > interface which works across multiple HW, and I think this is
> > > > > > > > > > quite
> > > > > > > > > > possible.
> > > > > > > > > 
> > > > > > > > > I agree with that. The more I think about it, the more I believe
> > > > > > > > > we
> > > > > > > > > should just pass the whole
> > > > > > > > > [nal_header][nal_type][slice_header][slice]
> > > > > > > > > and the parsed list in every scenario.
> > > > > > > > 
> > > > > > > > What I like of the cut at nal_type, is that there is only format.
> > > > > > > > If we
> > > > > > > > cut at nal_header, then we need to expose 2 formats. And it makes
> > > > > > > > our
> > > > > > > > API similar to other accelerator API, so it's easy to "convert"
> > > > > > > > existing userspace.
> > > > > > > 
> > > > > > > Unless we make that cut the single one and only true cut that shall
> > > > > > > supersed all other cuts :)
> > > > > > 
> > > > > > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > > > > > thing and go back to our first idea.
> > > > > 
> > > > > Right, in the end I think we should go with:
> > > > > V4L2_PIX_FMT_MPEG2_SLICE
> > > > > V4L2_PIX_FMT_H264_SLICE
> > > > > V4L2_PIX_FMT_HEVC_SLICE
> > > > > 
> > > > > And just require raw bitstream for the slice with emulation-prevention
> > > > > bits included.
> > > > 
> > > > That's should be the set of format we start with indeed. The single
> > > > format for which software gets written and tested, making sure software
> > > > support is not fragmented, and other variants should be something to
> > > > opt-in.
> > > 
> > > Cheers for that!
> > > 
> > > Paul
> > > 
> > > > > Cheers,
> > > > > 
> > > > > Paul
> > > > > 
> > > > > > > > > For H.265, our decoder needs some information from the NAL type
> > > > > > > > > too.
> > > > > > > > > We currently extract that in userspace and stick it to the
> > > > > > > > > slice_header, but maybe it would make more sense to have drivers
> > > > > > > > > parse
> > > > > > > > > that info from the buffer if they need it. On the other hand, it
> > > > > > > > > seems
> > > > > > > > > quite common to pass information from the NAL type, so maybe we
> > > > > > > > > should
> > > > > > > > > either make a new control for it or have all the fields in the
> > > > > > > > > slice_header (which would still be wrong in terms of matching
> > > > > > > > > bitstream
> > > > > > > > > description).
> > > > > > > > 
> > > > > > > > Even in userspace, it's common to just parse this in place, it's a
> > > > > > > > simple mask. But yes, if we don't have it yet, we should expose
> > > > > > > > the NAL
> > > > > > > > type, it would be cleaner.
> > > > > > > 
> > > > > > > Right, works for me.
> > > > > > 
> > > > > > Ack.
> > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Paul
> > > > > > > 
> > > > > > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > > > > > 
> > > > > > > > > > > As far as I could understand, the decoded picture buffer
> > > > > > > > > > > (DPB) is a
> > > > > > > > > > > concept that only makes sense relative to a decoder
> > > > > > > > > > > implementation. The
> > > > > > > > > > > spec mentions how to manage it with the Hypothetical
> > > > > > > > > > > reference decoder
> > > > > > > > > > > (Annex C), but that's about it.
> > > > > > > > > > > 
> > > > > > > > > > > What's really in the bitstream is the list of modified
> > > > > > > > > > > short-term and
> > > > > > > > > > > long-term references, which is enough for every decoder.
> > > > > > > > > > > 
> > > > > > > > > > > For this reason, I strongly believe we should stop talking
> > > > > > > > > > > about DPB in
> > > > > > > > > > > the controls and just pass these lists agremented with
> > > > > > > > > > > relevant
> > > > > > > > > > > information for userspace.
> > > > > > > > > > > 
> > > > > > > > > > > I think it should be up to the driver to maintain a DPB and
> > > > > > > > > > > we could
> > > > > > > > > > > have helpers for common cases. For instance, the rockchip
> > > > > > > > > > > decoder needs
> > > > > > > > > > > to keep unused entries around[2] and cedrus has the same
> > > > > > > > > > > requirement
> > > > > > > > > > > for H.264. However for cedrus/H.265, we don't need to do any
> > > > > > > > > > > book-
> > > > > > > > > > > keeping in particular and can manage with the lists from the
> > > > > > > > > > > bitstream
> > > > > > > > > > > directly.
> > > > > > > > > > 
> > > > > > > > > > As discusses today, we still need to pass that list. It's
> > > > > > > > > > being index
> > > > > > > > > > by the HW to retrieve the extra information we have collected
> > > > > > > > > > about the
> > > > > > > > > > status of the reference frames. In the case of Hantro, which
> > > > > > > > > > process
> > > > > > > > > > the modification list from the slice header for us, we also
> > > > > > > > > > need that
> > > > > > > > > > list to construct the unmodified list.
> > > > > > > > > > 
> > > > > > > > > > So the problem here is just a naming problem. That list is not
> > > > > > > > > > really a
> > > > > > > > > > DPB. It is just the list of long-term/short-term references
> > > > > > > > > > with the
> > > > > > > > > > status of these references. So maybe we could just rename as
> > > > > > > > > > references/reference_entry ?
> > > > > > > > > 
> > > > > > > > > What I'd like to pass is the diff to the references list, as
> > > > > > > > > ffmpeg
> > > > > > > > > currently provides for v4l2 request and vaapi (probably vdpau
> > > > > > > > > too). No
> > > > > > > > > functional change here, only that we should stop calling it a
> > > > > > > > > DPB,
> > > > > > > > > which confuses everyone.
> > > > > > > > 
> > > > > > > > Yes.
> > > > > > > > 
> > > > > > > > > > > - Using flags
> > > > > > > > > > > 
> > > > > > > > > > > The current MPEG-2 controls have lots of u8 values that can
> > > > > > > > > > > be
> > > > > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > > > > It's unlikely that we'll get more than 64 flags, so using a
> > > > > > > > > > > u64 by
> > > > > > > > > > > default for that sounds fine (we definitely do want to keep
> > > > > > > > > > > some room
> > > > > > > > > > > available and I don't think using 32 bits as a default is
> > > > > > > > > > > good enough).
> > > > > > > > > > > 
> > > > > > > > > > > I think H.264/HEVC per-control flags should also be moved to
> > > > > > > > > > > u64.
> > > > > > > > > > 
> > > > > > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI
> > > > > > > > > > right ?
> > > > > > > > > 
> > > > > > > > > Mhh, even if they are, it makes it much harder to verify 32/64
> > > > > > > > > bit
> > > > > > > > > alignment constraints (we're dealing with 64-bit platforms that
> > > > > > > > > need to
> > > > > > > > > have 32-bit userspace and compat_ioctl).
> > > > > > > > 
> > > > > > > > I see, thanks.
> > > > > > > > 
> > > > > > > > > > > - Clear split of controls and terminology
> > > > > > > > > > > 
> > > > > > > > > > > Some codecs have explicit NAL units that are good fits to
> > > > > > > > > > > match as
> > > > > > > > > > > controls: e.g. slice header, pps, sps. I think we should
> > > > > > > > > > > stick to the
> > > > > > > > > > > bitstream element names for those.
> > > > > > > > > > > 
> > > > > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > > > > - renaming v4l2_ctrl_h264_decode_param to
> > > > > > > > > > > v4l2_ctrl_h264_slice_header;
> > > > > > > > > > 
> > > > > > > > > > Oops, I think you meant slice_prams ? decode_params matches
> > > > > > > > > > the
> > > > > > > > > > information found in SPS/PPS (combined?), while slice_params
> > > > > > > > > > matches
> > > > > > > > > > the information extracted (and executed in case of l0/l1) from
> > > > > > > > > > the
> > > > > > > > > > slice headers.
> > > > > > > > > 
> > > > > > > > > Yes you're right, I mixed them up.
> > > > > > > > > 
> > > > > > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > > > > > 
> > > > > > > > > > it's not just the slice header, and it's per slice.
> > > > > > > > > 
> > > > > > > > > Mhh, what exactly remains in there and where does it originate
> > > > > > > > > in the
> > > > > > > > > bitstream? Maybe it wouldn't be too bad to have one control per
> > > > > > > > > actual
> > > > > > > > > group of bitstream elements.
> > > > > > > > > 
> > > > > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the
> > > > > > > > > > > reference lists
> > > > > > > > > > > where they belong, which seems to be slice_header;
> > > > > > > > > > 
> > > > > > > > > > There reference list is only updated by userspace (through
> > > > > > > > > > it's DPB)
> > > > > > > > > > base on the result of the last decoding step. I was very
> > > > > > > > > > confused for a
> > > > > > > > > > moment until I realize that the lists in the slice_header are
> > > > > > > > > > just a
> > > > > > > > > > list of modification to apply to the reference list in order
> > > > > > > > > > to produce
> > > > > > > > > > l0 and l1.
> > > > > > > > > 
> > > > > > > > > Indeed, and I'm suggesting that we pass the modifications only,
> > > > > > > > > which
> > > > > > > > > would fit a slice_header control.
> > > > > > > > 
> > > > > > > > I think I made my point why we want the dpb -> references. I'm
> > > > > > > > going to
> > > > > > > > validate with the VA driver now, to see if the references list
> > > > > > > > there is
> > > > > > > > usable with our code.
> > > > > > > > 
> > > > > > > > > Cheers,
> > > > > > > > > 
> > > > > > > > > Paul
> > > > > > > > > 
> > > > > > > > > > > I'm up for preparing and submitting these control changes
> > > > > > > > > > > and updating
> > > > > > > > > > > cedrus if they seem agreeable.
> > > > > > > > > > > 
> > > > > > > > > > > What do you think?
> > > > > > > > > > > 
> > > > > > > > > > > Cheers,
> > > > > > > > > > > 
> > > > > > > > > > > Paul
> > > > > > > > > > > 
> > > > > > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > > > > > [2]:
> > > > > > > > > > > https://chromium.googlesource.com/chromiumos/third_party/ke
> > > > > > > > > > > rnel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > 
> > 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 17:42   ` Paul Kocialkowski
  2019-05-15 18:54     ` Nicolas Dufresne
@ 2019-05-21 10:27     ` Tomasz Figa
  2019-05-21 11:44       ` Paul Kocialkowski
  2019-05-21 15:43     ` Thierry Reding
  2 siblings, 1 reply; 55+ messages in thread
From: Tomasz Figa @ 2019-05-21 10:27 UTC (permalink / raw)
  To: Paul Kocialkowski, Nicolas Dufresne
  Cc: Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Thierry Reding, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
<paul.kocialkowski@bootlin.com> wrote:
>
> Hi,
>
> Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > >
> > > With the Rockchip stateless VPU driver in the works, we now have a
> > > better idea of what the situation is like on platforms other than
> > > Allwinner. This email shares my conclusions about the situation and how
> > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > >
> > > - Per-slice decoding
> > >
> > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > to implement the required core bits. When we agree it looks good, we
> > > should lift the restriction that all slices must be concatenated and
> > > have them submitted as individual requests.
> > >
> > > One question is what to do about other controls. I feel like it would
> > > make sense to always pass all the required controls for decoding the
> > > slice, including the ones that don't change across slices. But there
> > > may be no particular advantage to this and only downsides. Not doing it
> > > and relying on the "control cache" can work, but we need to specify
> > > that only a single stream can be decoded per opened instance of the
> > > v4l2 device. This is the assumption we're going with for handling
> > > multi-slice anyway, so it shouldn't be an issue.
> >
> > My opinion on this is that the m2m instance is a state, and the driver
> > should be responsible of doing time-division multiplexing across
> > multiple m2m instance jobs. Doing the time-division multiplexing in
> > userspace would require some sort of daemon to work properly across
> > processes. I also think the kernel is better place for doing resource
> > access scheduling in general.
>
> I agree with that yes. We always have a single m2m context and specific
> controls per opened device so keeping cached values works out well.
>
> So maybe we shall explicitly require that the request with the first
> slice for a frame also contains the per-frame controls.
>

Agreed.

One more argument not to allow such multiplexing is that despite the
API being called "stateless", there is actually some state saved
between frames, e.g. the Rockchip decoder writes some intermediate
data to some local buffers which need to be given to the decoder to
decode the next frame. Actually, on Rockchip there is even a
requirement to keep the reference list entries in the same order
between frames.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 10:27     ` Tomasz Figa
@ 2019-05-21 11:44       ` Paul Kocialkowski
  2019-05-21 15:09         ` Thierry Reding
  2019-05-22  6:01         ` Tomasz Figa
  0 siblings, 2 replies; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-21 11:44 UTC (permalink / raw)
  To: Tomasz Figa, Nicolas Dufresne
  Cc: Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Thierry Reding, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

Hi,

On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote:
> On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
> <paul.kocialkowski@bootlin.com> wrote:
> > Hi,
> > 
> > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > better idea of what the situation is like on platforms other than
> > > > Allwinner. This email shares my conclusions about the situation and how
> > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > 
> > > > - Per-slice decoding
> > > > 
> > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > to implement the required core bits. When we agree it looks good, we
> > > > should lift the restriction that all slices must be concatenated and
> > > > have them submitted as individual requests.
> > > > 
> > > > One question is what to do about other controls. I feel like it would
> > > > make sense to always pass all the required controls for decoding the
> > > > slice, including the ones that don't change across slices. But there
> > > > may be no particular advantage to this and only downsides. Not doing it
> > > > and relying on the "control cache" can work, but we need to specify
> > > > that only a single stream can be decoded per opened instance of the
> > > > v4l2 device. This is the assumption we're going with for handling
> > > > multi-slice anyway, so it shouldn't be an issue.
> > > 
> > > My opinion on this is that the m2m instance is a state, and the driver
> > > should be responsible of doing time-division multiplexing across
> > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > userspace would require some sort of daemon to work properly across
> > > processes. I also think the kernel is better place for doing resource
> > > access scheduling in general.
> > 
> > I agree with that yes. We always have a single m2m context and specific
> > controls per opened device so keeping cached values works out well.
> > 
> > So maybe we shall explicitly require that the request with the first
> > slice for a frame also contains the per-frame controls.
> > 
> 
> Agreed.
> 
> One more argument not to allow such multiplexing is that despite the
> API being called "stateless", there is actually some state saved
> between frames, e.g. the Rockchip decoder writes some intermediate
> data to some local buffers which need to be given to the decoder to
> decode the next frame. Actually, on Rockchip there is even a
> requirement to keep the reference list entries in the same order
> between frames.

Well, what I'm suggesting is to have one stream per m2m context, but it
should certainly be possible to have multiple m2m contexts (multiple
userspace open calls) that decode different streams concurrently.

Is that really going to be a problem for Rockchip? If so, then the
driver should probably enforce allowing a single userspace open and m2m
context at a time.

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 11:44       ` Paul Kocialkowski
@ 2019-05-21 15:09         ` Thierry Reding
  2019-05-21 16:07           ` Nicolas Dufresne
  2019-05-22  6:01         ` Tomasz Figa
  1 sibling, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-05-21 15:09 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Tomasz Figa, Nicolas Dufresne, Linux Media Mailing List,
	Hans Verkuil, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 4027 bytes --]

On Tue, May 21, 2019 at 01:44:50PM +0200, Paul Kocialkowski wrote:
> Hi,
> 
> On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote:
> > On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
> > <paul.kocialkowski@bootlin.com> wrote:
> > > Hi,
> > > 
> > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > better idea of what the situation is like on platforms other than
> > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > 
> > > > > - Per-slice decoding
> > > > > 
> > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > to implement the required core bits. When we agree it looks good, we
> > > > > should lift the restriction that all slices must be concatenated and
> > > > > have them submitted as individual requests.
> > > > > 
> > > > > One question is what to do about other controls. I feel like it would
> > > > > make sense to always pass all the required controls for decoding the
> > > > > slice, including the ones that don't change across slices. But there
> > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > and relying on the "control cache" can work, but we need to specify
> > > > > that only a single stream can be decoded per opened instance of the
> > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > 
> > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > should be responsible of doing time-division multiplexing across
> > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > userspace would require some sort of daemon to work properly across
> > > > processes. I also think the kernel is better place for doing resource
> > > > access scheduling in general.
> > > 
> > > I agree with that yes. We always have a single m2m context and specific
> > > controls per opened device so keeping cached values works out well.
> > > 
> > > So maybe we shall explicitly require that the request with the first
> > > slice for a frame also contains the per-frame controls.
> > > 
> > 
> > Agreed.
> > 
> > One more argument not to allow such multiplexing is that despite the
> > API being called "stateless", there is actually some state saved
> > between frames, e.g. the Rockchip decoder writes some intermediate
> > data to some local buffers which need to be given to the decoder to
> > decode the next frame. Actually, on Rockchip there is even a
> > requirement to keep the reference list entries in the same order
> > between frames.
> 
> Well, what I'm suggesting is to have one stream per m2m context, but it
> should certainly be possible to have multiple m2m contexts (multiple
> userspace open calls) that decode different streams concurrently.
> 
> Is that really going to be a problem for Rockchip? If so, then the
> driver should probably enforce allowing a single userspace open and m2m
> context at a time.

If you have hardware storing data necessary to the decoding process in
buffers local to the decoder you'd have to have some sort of context
switch operation that backs up the data in those buffers before you
switch to a different context and restore those buffers when you switch
back. We have similar hardware on Tegra, though I'm not exactly familiar
with the details of what is saved and how essential it is. My
understanding is that those internal buffers can be copied to external
RAM or vice versa, but I suspect that this isn't going to be very
efficient. It may very well be that restricting to a single userspace
open is the most sensible option.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 17:42   ` Paul Kocialkowski
  2019-05-15 18:54     ` Nicolas Dufresne
  2019-05-21 10:27     ` Tomasz Figa
@ 2019-05-21 15:43     ` Thierry Reding
  2019-05-21 16:23       ` Nicolas Dufresne
  2 siblings, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-05-21 15:43 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Nicolas Dufresne, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 15824 bytes --]

On Wed, May 15, 2019 at 07:42:50PM +0200, Paul Kocialkowski wrote:
> Hi,
> 
> Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > With the Rockchip stateless VPU driver in the works, we now have a
> > > better idea of what the situation is like on platforms other than
> > > Allwinner. This email shares my conclusions about the situation and how
> > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > 
> > > - Per-slice decoding
> > > 
> > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > to implement the required core bits. When we agree it looks good, we
> > > should lift the restriction that all slices must be concatenated and
> > > have them submitted as individual requests.
> > > 
> > > One question is what to do about other controls. I feel like it would
> > > make sense to always pass all the required controls for decoding the
> > > slice, including the ones that don't change across slices. But there
> > > may be no particular advantage to this and only downsides. Not doing it
> > > and relying on the "control cache" can work, but we need to specify
> > > that only a single stream can be decoded per opened instance of the
> > > v4l2 device. This is the assumption we're going with for handling
> > > multi-slice anyway, so it shouldn't be an issue.
> > 
> > My opinion on this is that the m2m instance is a state, and the driver
> > should be responsible of doing time-division multiplexing across
> > multiple m2m instance jobs. Doing the time-division multiplexing in
> > userspace would require some sort of daemon to work properly across
> > processes. I also think the kernel is better place for doing resource
> > access scheduling in general.
> 
> I agree with that yes. We always have a single m2m context and specific
> controls per opened device so keeping cached values works out well.
> 
> So maybe we shall explicitly require that the request with the first
> slice for a frame also contains the per-frame controls.
> 
> > > - Annex-B formats
> > > 
> > > I don't think we have really reached a conclusion on the pixel formats
> > > we want to expose. The main issue is how to deal with codecs that need
> > > the full slice NALU with start code, where the slice_header is
> > > duplicated in raw bitstream, when others are fine with just the encoded
> > > slice data and the parsed slice header control.
> > > 
> > > My initial thinking was that we'd need 3 formats:
> > > - One that only takes only the slice compressed data (without raw slice
> > > header and start code);
> > > - One that takes both the NALU data (including start code, raw header
> > > and compressed data) and slice header controls;
> > > - One that takes the NALU data but no slice header.
> > > 
> > > But I no longer think the latter really makes sense in the context of
> > > stateless video decoding.
> > > 
> > > A side-note: I think we should definitely have data offsets in every
> > > case, so that implementations can just push the whole NALU regardless
> > > of the format if they're lazy.
> > 
> > I realize that I didn't share our latest research on the subject. So a
> > slice in the original bitstream is formed of the following blocks
> > (simplified):
> > 
> >   [nal_header][nal_type][slice_header][slice]
> 
> Thanks for the details!
> 
> > nal_header:
> > This one is a header used to locate the start and the end of the of a
> > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > would be a leading 0 from the previous NAL padding, but this is also
> > totally valid start code. The second form is the AVC form, notably used
> > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > your buffer aligned to NALs in this case as you cannot scan from random
> > location.
> > 
> > nal_type:
> > It's a bit more then just the type, but it contains at least the
> > information of the nal type. This has different size on H.264 and HEVC
> > but I know it's size is in bytes.
> > 
> > slice_header:
> > This contains per slice parameters, like the modification lists to
> > apply on the references. This one has a size in bits, not in bytes.
> > 
> > slice:
> > I don't really know what is in it exactly, but this is the data used to
> > decode. This bit has a special coding called the anti-emulation, which
> > prevents a start-code from appearing in it. This coding is present in
> > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > call ANNEX-B the bytestream format).
> > 
> > So, what we notice is that what is currently passed through Cedrus
> > driver:
> >   [nal_type][slice_header][slice]
> > 
> > This matches what is being passed through VA-API. We can understand
> > that stripping off the slice_header would be hard, since it's size is
> > in bits. Instead we pass size and header_bit_size in slice_params.
> 
> True, there is that.
> 
> > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > start_code_e, when you turn this off, you don't need start code. As a
> > side effect, the bitstream becomes identical. We do now know that it
> > works with the ffmpeg branch implement for cedrus.
> 
> Oh great, that makes life easier in the short term, but I guess the
> issue could arise on another decoder sooner or later.
> 
> > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > take care for us of reading and executing the modification lists found
> > in the slice header. Mostly because I very disliked having to pass the
> > p/b0/b1 parameters, is that Boris implemented in the driver the
> > transformation from the DPB entries into this p/b0/b1 list. These list
> > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > following section is the execution of the modification list. As this
> > list is not modified, it only need to be calculated per frame. As a
> > result, we don't need these new lists, and we can work with the same
> > H264_SLICE format as Cedrus is using.
> 
> Yes but I definitely think it makes more sense to pass the list
> modifications rather than reconstructing those in the driver from a
> full list. IMO controls should stick to the bitstream as close as
> possible.
> 
> > Now, this is just a start. For RK3399, we have a different CODEC
> > design. This one does not have the start_code_e bit. What the IP does,
> > is that you give it one or more slice per buffer, setup the params,
> > start decoding, but the decoder then return the location of the
> > following NAL. So basically you could offload the scanning of start
> > code to the HW. That being said, with the driver layer in between, that
> > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > it is pretty cheap to scan this type of start-code on CPU. But the
> > feature that this allows is to operate in frame mode. In this mode, you
> > have 1 interrupt per frame.
> 
> I'm not sure there is any interest in exposing that from userspace and
> my current feeling is that we should just ditch support for per-frame
> decoding altogether. I think it mixes decoding with notions that are
> higher-level than decoding, but I agree it's a blurry line.

I'm not sure ditching support for per-frame decoding would be a wise
decision. What if some device comes around that only supports frame
decoding and can't handle individual slices?

We have such a situation on Tegra, for example. I think the hardware can
technically decode individual slices, but it can also be set up to do a
lot more and operate in basically a per-frame mode where you just pass
it a buffer containing the complete bitstream for one frame and it'll
just raise an interrupt when it's done decoding.

Per-frame mode is what's currently implemented in the staging driver and
as far as I can tell it's also what's implemented in the downstream
driver, which uses a completely different architecture (it uploads a
firmware that processes a command stream). I have seen registers that
seem to be related to a slice-decoding mode, but honestly I have no idea
how to program them to achieve that.

Now the VDE IP that I'm dealing with is pretty old, but from what I know
of newer IP, they follow a similar command stream architecture as the
downstream VDE driver, so I'm not sure those support per-slice decoding
either. They typically have a firmware that processes command streams
and userspace typically just passes a single bitstream buffer along with
reference frames and gets back the decoded frame. I'd have to
investigate further to understand if slice-level decoding is supported
on the newer hardware.

I'm not familiar with any other decoders, but per-frame decoding doesn't
strike me as a very exotic idea. Excluding such decoders from the ABI
sounds a bit premature.

> > But it also support slice mode, with an
> > interrupt per slice, which is what we decided to use.
> 
> Easier for everyone and probably better for latency as well :)

I'm not sure I understand what's easier about slice-level decoding or
how this would improve latency. If anything getting less interrupts is
good, isn't it?

If we can offload more to hardware, certainly that's something we want
to take advantage of, no?

Thierry

> > So in this case, indeed we strictly require on start-code. Though, to
> > me this is not a great reason to make a new fourcc, so we will try and
> > use (data_offset = 3) in order to make some space for that start code,
> > and write it down in the driver. This is to be continued, we will
> > report back on this later. This could have some side effect in the
> > ability to import buffers. But most userspace don't try to do zero-copy 
> > on the encoded size and just copy anyway.
> > 
> > To my opinion, having a single format is a big deal, since userspace
> > will generally be developed for one specific HW and we would endup with
> > fragmented support. What we really want to achieve is having a driver
> > interface which works across multiple HW, and I think this is quite
> > possible.
> 
> I agree with that. The more I think about it, the more I believe we
> should just pass the whole [nal_header][nal_type][slice_header][slice]
> and the parsed list in every scenario.
> 
> For H.265, our decoder needs some information from the NAL type too.
> We currently extract that in userspace and stick it to the
> slice_header, but maybe it would make more sense to have drivers parse
> that info from the buffer if they need it. On the other hand, it seems
> quite common to pass information from the NAL type, so maybe we should
> either make a new control for it or have all the fields in the
> slice_header (which would still be wrong in terms of matching bitstream
> description).
> 
> > > - Dropping the DPB concept in H.264/H.265
> > > 
> > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > concept that only makes sense relative to a decoder implementation. The
> > > spec mentions how to manage it with the Hypothetical reference decoder
> > > (Annex C), but that's about it.
> > > 
> > > What's really in the bitstream is the list of modified short-term and
> > > long-term references, which is enough for every decoder.
> > > 
> > > For this reason, I strongly believe we should stop talking about DPB in
> > > the controls and just pass these lists agremented with relevant
> > > information for userspace.
> > > 
> > > I think it should be up to the driver to maintain a DPB and we could
> > > have helpers for common cases. For instance, the rockchip decoder needs
> > > to keep unused entries around[2] and cedrus has the same requirement
> > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > keeping in particular and can manage with the lists from the bitstream
> > > directly.
> > 
> > As discusses today, we still need to pass that list. It's being index
> > by the HW to retrieve the extra information we have collected about the
> > status of the reference frames. In the case of Hantro, which process
> > the modification list from the slice header for us, we also need that
> > list to construct the unmodified list.
> > 
> > So the problem here is just a naming problem. That list is not really a
> > DPB. It is just the list of long-term/short-term references with the
> > status of these references. So maybe we could just rename as
> > references/reference_entry ?
> 
> What I'd like to pass is the diff to the references list, as ffmpeg
> currently provides for v4l2 request and vaapi (probably vdpau too). No
> functional change here, only that we should stop calling it a DPB,
> which confuses everyone.
> 
> > > - Using flags
> > > 
> > > The current MPEG-2 controls have lots of u8 values that can be
> > > represented as flags. Using flags also helps with padding.
> > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > default for that sounds fine (we definitely do want to keep some room
> > > available and I don't think using 32 bits as a default is good enough).
> > > 
> > > I think H.264/HEVC per-control flags should also be moved to u64.
> > 
> > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> 
> Mhh, even if they are, it makes it much harder to verify 32/64 bit
> alignment constraints (we're dealing with 64-bit platforms that need to
> have 32-bit userspace and compat_ioctl).
> 
> > > - Clear split of controls and terminology
> > > 
> > > Some codecs have explicit NAL units that are good fits to match as
> > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > bitstream element names for those.
> > > 
> > > For H.264, that would suggest the following changes:
> > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > 
> > Oops, I think you meant slice_prams ? decode_params matches the
> > information found in SPS/PPS (combined?), while slice_params matches
> > the information extracted (and executed in case of l0/l1) from the
> > slice headers.
> 
> Yes you're right, I mixed them up.
> 
> >  That being said, to me this name wasn't confusing, since
> > it's not just the slice header, and it's per slice.
> 
> Mhh, what exactly remains in there and where does it originate in the
> bitstream? Maybe it wouldn't be too bad to have one control per actual
> group of bitstream elements.
> 
> > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > where they belong, which seems to be slice_header;
> > 
> > There reference list is only updated by userspace (through it's DPB)
> > base on the result of the last decoding step. I was very confused for a
> > moment until I realize that the lists in the slice_header are just a
> > list of modification to apply to the reference list in order to produce
> > l0 and l1.
> 
> Indeed, and I'm suggesting that we pass the modifications only, which
> would fit a slice_header control.
> 
> Cheers,
> 
> Paul
> 
> > > I'm up for preparing and submitting these control changes and updating
> > > cedrus if they seem agreeable.
> > > 
> > > What do you think?
> > > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 15:09         ` Thierry Reding
@ 2019-05-21 16:07           ` Nicolas Dufresne
  2019-05-22  8:08             ` Thierry Reding
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-21 16:07 UTC (permalink / raw)
  To: Thierry Reding, Paul Kocialkowski
  Cc: Tomasz Figa, Linux Media Mailing List, Hans Verkuil,
	Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 4980 bytes --]

Le mardi 21 mai 2019 à 17:09 +0200, Thierry Reding a écrit :
> On Tue, May 21, 2019 at 01:44:50PM +0200, Paul Kocialkowski wrote:
> > Hi,
> > 
> > On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote:
> > > On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
> > > <paul.kocialkowski@bootlin.com> wrote:
> > > > Hi,
> > > > 
> > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > > better idea of what the situation is like on platforms other than
> > > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > > 
> > > > > > - Per-slice decoding
> > > > > > 
> > > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > > to implement the required core bits. When we agree it looks good, we
> > > > > > should lift the restriction that all slices must be concatenated and
> > > > > > have them submitted as individual requests.
> > > > > > 
> > > > > > One question is what to do about other controls. I feel like it would
> > > > > > make sense to always pass all the required controls for decoding the
> > > > > > slice, including the ones that don't change across slices. But there
> > > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > > and relying on the "control cache" can work, but we need to specify
> > > > > > that only a single stream can be decoded per opened instance of the
> > > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > 
> > > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > > should be responsible of doing time-division multiplexing across
> > > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > > userspace would require some sort of daemon to work properly across
> > > > > processes. I also think the kernel is better place for doing resource
> > > > > access scheduling in general.
> > > > 
> > > > I agree with that yes. We always have a single m2m context and specific
> > > > controls per opened device so keeping cached values works out well.
> > > > 
> > > > So maybe we shall explicitly require that the request with the first
> > > > slice for a frame also contains the per-frame controls.
> > > > 
> > > 
> > > Agreed.
> > > 
> > > One more argument not to allow such multiplexing is that despite the
> > > API being called "stateless", there is actually some state saved
> > > between frames, e.g. the Rockchip decoder writes some intermediate
> > > data to some local buffers which need to be given to the decoder to
> > > decode the next frame. Actually, on Rockchip there is even a
> > > requirement to keep the reference list entries in the same order
> > > between frames.
> > 
> > Well, what I'm suggesting is to have one stream per m2m context, but it
> > should certainly be possible to have multiple m2m contexts (multiple
> > userspace open calls) that decode different streams concurrently.
> > 
> > Is that really going to be a problem for Rockchip? If so, then the
> > driver should probably enforce allowing a single userspace open and m2m
> > context at a time.
> 
> If you have hardware storing data necessary to the decoding process in
> buffers local to the decoder you'd have to have some sort of context
> switch operation that backs up the data in those buffers before you
> switch to a different context and restore those buffers when you switch
> back. We have similar hardware on Tegra, though I'm not exactly familiar
> with the details of what is saved and how essential it is. My
> understanding is that those internal buffers can be copied to external
> RAM or vice versa, but I suspect that this isn't going to be very
> efficient. It may very well be that restricting to a single userspace
> open is the most sensible option.

That would be by far the worst for a browser use case where an adds
might have stolen that single instance you have available in HW. It's
normal that context switching will have some impact on performance, but
in general, most of the time, the other instances will be idles by
userspace. If there is not context switches, there should be no (or
very little overhead). Of course, it's should not be a heard
requirement to get a driver in the kernel, I'm not saying that.

p.s. In the IMX8M/Hantro G1 they specifically says that the single core
decoder can handle up to 8 1080p60 streams at the same time. But there
is some buffers being write-back by the IP for every slice (at the end
of the decoded reference frames).

> 
> Thierry

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 15:43     ` Thierry Reding
@ 2019-05-21 16:23       ` Nicolas Dufresne
  2019-05-22  6:39         ` Tomasz Figa
  2019-05-22 10:08         ` Thierry Reding
  0 siblings, 2 replies; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-21 16:23 UTC (permalink / raw)
  To: Thierry Reding, Paul Kocialkowski
  Cc: Linux Media Mailing List, Hans Verkuil, Tomasz Figa,
	Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 18831 bytes --]

Le mardi 21 mai 2019 à 17:43 +0200, Thierry Reding a écrit :
> On Wed, May 15, 2019 at 07:42:50PM +0200, Paul Kocialkowski wrote:
> > Hi,
> > 
> > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > better idea of what the situation is like on platforms other than
> > > > Allwinner. This email shares my conclusions about the situation and how
> > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > 
> > > > - Per-slice decoding
> > > > 
> > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > to implement the required core bits. When we agree it looks good, we
> > > > should lift the restriction that all slices must be concatenated and
> > > > have them submitted as individual requests.
> > > > 
> > > > One question is what to do about other controls. I feel like it would
> > > > make sense to always pass all the required controls for decoding the
> > > > slice, including the ones that don't change across slices. But there
> > > > may be no particular advantage to this and only downsides. Not doing it
> > > > and relying on the "control cache" can work, but we need to specify
> > > > that only a single stream can be decoded per opened instance of the
> > > > v4l2 device. This is the assumption we're going with for handling
> > > > multi-slice anyway, so it shouldn't be an issue.
> > > 
> > > My opinion on this is that the m2m instance is a state, and the driver
> > > should be responsible of doing time-division multiplexing across
> > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > userspace would require some sort of daemon to work properly across
> > > processes. I also think the kernel is better place for doing resource
> > > access scheduling in general.
> > 
> > I agree with that yes. We always have a single m2m context and specific
> > controls per opened device so keeping cached values works out well.
> > 
> > So maybe we shall explicitly require that the request with the first
> > slice for a frame also contains the per-frame controls.
> > 
> > > > - Annex-B formats
> > > > 
> > > > I don't think we have really reached a conclusion on the pixel formats
> > > > we want to expose. The main issue is how to deal with codecs that need
> > > > the full slice NALU with start code, where the slice_header is
> > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > slice data and the parsed slice header control.
> > > > 
> > > > My initial thinking was that we'd need 3 formats:
> > > > - One that only takes only the slice compressed data (without raw slice
> > > > header and start code);
> > > > - One that takes both the NALU data (including start code, raw header
> > > > and compressed data) and slice header controls;
> > > > - One that takes the NALU data but no slice header.
> > > > 
> > > > But I no longer think the latter really makes sense in the context of
> > > > stateless video decoding.
> > > > 
> > > > A side-note: I think we should definitely have data offsets in every
> > > > case, so that implementations can just push the whole NALU regardless
> > > > of the format if they're lazy.
> > > 
> > > I realize that I didn't share our latest research on the subject. So a
> > > slice in the original bitstream is formed of the following blocks
> > > (simplified):
> > > 
> > >   [nal_header][nal_type][slice_header][slice]
> > 
> > Thanks for the details!
> > 
> > > nal_header:
> > > This one is a header used to locate the start and the end of the of a
> > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > would be a leading 0 from the previous NAL padding, but this is also
> > > totally valid start code. The second form is the AVC form, notably used
> > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > your buffer aligned to NALs in this case as you cannot scan from random
> > > location.
> > > 
> > > nal_type:
> > > It's a bit more then just the type, but it contains at least the
> > > information of the nal type. This has different size on H.264 and HEVC
> > > but I know it's size is in bytes.
> > > 
> > > slice_header:
> > > This contains per slice parameters, like the modification lists to
> > > apply on the references. This one has a size in bits, not in bytes.
> > > 
> > > slice:
> > > I don't really know what is in it exactly, but this is the data used to
> > > decode. This bit has a special coding called the anti-emulation, which
> > > prevents a start-code from appearing in it. This coding is present in
> > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > call ANNEX-B the bytestream format).
> > > 
> > > So, what we notice is that what is currently passed through Cedrus
> > > driver:
> > >   [nal_type][slice_header][slice]
> > > 
> > > This matches what is being passed through VA-API. We can understand
> > > that stripping off the slice_header would be hard, since it's size is
> > > in bits. Instead we pass size and header_bit_size in slice_params.
> > 
> > True, there is that.
> > 
> > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > start_code_e, when you turn this off, you don't need start code. As a
> > > side effect, the bitstream becomes identical. We do now know that it
> > > works with the ffmpeg branch implement for cedrus.
> > 
> > Oh great, that makes life easier in the short term, but I guess the
> > issue could arise on another decoder sooner or later.
> > 
> > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > take care for us of reading and executing the modification lists found
> > > in the slice header. Mostly because I very disliked having to pass the
> > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > following section is the execution of the modification list. As this
> > > list is not modified, it only need to be calculated per frame. As a
> > > result, we don't need these new lists, and we can work with the same
> > > H264_SLICE format as Cedrus is using.
> > 
> > Yes but I definitely think it makes more sense to pass the list
> > modifications rather than reconstructing those in the driver from a
> > full list. IMO controls should stick to the bitstream as close as
> > possible.
> > 
> > > Now, this is just a start. For RK3399, we have a different CODEC
> > > design. This one does not have the start_code_e bit. What the IP does,
> > > is that you give it one or more slice per buffer, setup the params,
> > > start decoding, but the decoder then return the location of the
> > > following NAL. So basically you could offload the scanning of start
> > > code to the HW. That being said, with the driver layer in between, that
> > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > feature that this allows is to operate in frame mode. In this mode, you
> > > have 1 interrupt per frame.
> > 
> > I'm not sure there is any interest in exposing that from userspace and
> > my current feeling is that we should just ditch support for per-frame
> > decoding altogether. I think it mixes decoding with notions that are
> > higher-level than decoding, but I agree it's a blurry line.
> 
> I'm not sure ditching support for per-frame decoding would be a wise
> decision. What if some device comes around that only supports frame
> decoding and can't handle individual slices?
> 
> We have such a situation on Tegra, for example. I think the hardware can
> technically decode individual slices, but it can also be set up to do a
> lot more and operate in basically a per-frame mode where you just pass
> it a buffer containing the complete bitstream for one frame and it'll
> just raise an interrupt when it's done decoding.
> 
> Per-frame mode is what's currently implemented in the staging driver and
> as far as I can tell it's also what's implemented in the downstream
> driver, which uses a completely different architecture (it uploads a
> firmware that processes a command stream). I have seen registers that
> seem to be related to a slice-decoding mode, but honestly I have no idea
> how to program them to achieve that.
> 
> Now the VDE IP that I'm dealing with is pretty old, but from what I know
> of newer IP, they follow a similar command stream architecture as the
> downstream VDE driver, so I'm not sure those support per-slice decoding
> either. They typically have a firmware that processes command streams
> and userspace typically just passes a single bitstream buffer along with
> reference frames and gets back the decoded frame. I'd have to
> investigate further to understand if slice-level decoding is supported
> on the newer hardware.
> 
> I'm not familiar with any other decoders, but per-frame decoding doesn't
> strike me as a very exotic idea. Excluding such decoders from the ABI
> sounds a bit premature.

It would be premature to state that we are excluding. We are just
trying to find one format to get things upstream, and make sure we have
a plan how to extend it. Trying to support everything on the first try
is not going to work so well.

What is interesting to provide is how does you IP achieve multi-slice
decoding per frame. That's what we are studying on the RK/Hantro chip.
Typical questions are:

  1. Do all slices have to be contiguous in memory
  2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
  3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)

And other things like this. The more data we have, the better the
initial interface will be.

> 
> > > But it also support slice mode, with an
> > > interrupt per slice, which is what we decided to use.
> > 
> > Easier for everyone and probably better for latency as well :)
> 
> I'm not sure I understand what's easier about slice-level decoding or
> how this would improve latency. If anything getting less interrupts is
> good, isn't it?
> 
> If we can offload more to hardware, certainly that's something we want
> to take advantage of, no?

In H.264, pretty much all stream have single slice per frame. That's
because it gives the highest quality. But in live streaming, like for
webrtc, it's getting more common to actually encode with multiple
slices (it's group of macroblocks usually in raster order). Usually
it's a very small amount of slices, 4, 8, something in this range.

When a slice is encoded, the encoder will let it go before it starts
the following, this allow network transfer to happen in parallel of
decoding.

On the receiver, as soon as a slice is available, the decoder will be
started immediately, which allow the receiving of buffer and the
decoding of the slices to happen in parallel. You end up with a lot
less delay between the reception of the last slice and having a full
frame ready.

So that's how slices are used to reduce latency. Now, if you are
decoding from a container like ISOMP4, you'll have full frame, so it
make sense to queue all these frame, and le the decoder bundle that if
possible, if the HW allow to enable mode where you have single IRQ per
frame. Though, it's pretty rare that you'll find such a file with
slices. What we'd like to resolve is how these are resolved. There is
nothing that prevents it right now in the uAPI, but you'd have to copy
the input into another buffer, adding the separators if needed.

What we are trying to achieve in this thread is to find a compromise
that makes uAPI sane, but also makes decoding efficient on all the HW
we know at least.

> 
> Thierry
> 
> > > So in this case, indeed we strictly require on start-code. Though, to
> > > me this is not a great reason to make a new fourcc, so we will try and
> > > use (data_offset = 3) in order to make some space for that start code,
> > > and write it down in the driver. This is to be continued, we will
> > > report back on this later. This could have some side effect in the
> > > ability to import buffers. But most userspace don't try to do zero-copy 
> > > on the encoded size and just copy anyway.
> > > 
> > > To my opinion, having a single format is a big deal, since userspace
> > > will generally be developed for one specific HW and we would endup with
> > > fragmented support. What we really want to achieve is having a driver
> > > interface which works across multiple HW, and I think this is quite
> > > possible.
> > 
> > I agree with that. The more I think about it, the more I believe we
> > should just pass the whole [nal_header][nal_type][slice_header][slice]
> > and the parsed list in every scenario.
> > 
> > For H.265, our decoder needs some information from the NAL type too.
> > We currently extract that in userspace and stick it to the
> > slice_header, but maybe it would make more sense to have drivers parse
> > that info from the buffer if they need it. On the other hand, it seems
> > quite common to pass information from the NAL type, so maybe we should
> > either make a new control for it or have all the fields in the
> > slice_header (which would still be wrong in terms of matching bitstream
> > description).
> > 
> > > > - Dropping the DPB concept in H.264/H.265
> > > > 
> > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > concept that only makes sense relative to a decoder implementation. The
> > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > (Annex C), but that's about it.
> > > > 
> > > > What's really in the bitstream is the list of modified short-term and
> > > > long-term references, which is enough for every decoder.
> > > > 
> > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > the controls and just pass these lists agremented with relevant
> > > > information for userspace.
> > > > 
> > > > I think it should be up to the driver to maintain a DPB and we could
> > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > keeping in particular and can manage with the lists from the bitstream
> > > > directly.
> > > 
> > > As discusses today, we still need to pass that list. It's being index
> > > by the HW to retrieve the extra information we have collected about the
> > > status of the reference frames. In the case of Hantro, which process
> > > the modification list from the slice header for us, we also need that
> > > list to construct the unmodified list.
> > > 
> > > So the problem here is just a naming problem. That list is not really a
> > > DPB. It is just the list of long-term/short-term references with the
> > > status of these references. So maybe we could just rename as
> > > references/reference_entry ?
> > 
> > What I'd like to pass is the diff to the references list, as ffmpeg
> > currently provides for v4l2 request and vaapi (probably vdpau too). No
> > functional change here, only that we should stop calling it a DPB,
> > which confuses everyone.
> > 
> > > > - Using flags
> > > > 
> > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > represented as flags. Using flags also helps with padding.
> > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > default for that sounds fine (we definitely do want to keep some room
> > > > available and I don't think using 32 bits as a default is good enough).
> > > > 
> > > > I think H.264/HEVC per-control flags should also be moved to u64.
> > > 
> > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> > 
> > Mhh, even if they are, it makes it much harder to verify 32/64 bit
> > alignment constraints (we're dealing with 64-bit platforms that need to
> > have 32-bit userspace and compat_ioctl).
> > 
> > > > - Clear split of controls and terminology
> > > > 
> > > > Some codecs have explicit NAL units that are good fits to match as
> > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > bitstream element names for those.
> > > > 
> > > > For H.264, that would suggest the following changes:
> > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > 
> > > Oops, I think you meant slice_prams ? decode_params matches the
> > > information found in SPS/PPS (combined?), while slice_params matches
> > > the information extracted (and executed in case of l0/l1) from the
> > > slice headers.
> > 
> > Yes you're right, I mixed them up.
> > 
> > >  That being said, to me this name wasn't confusing, since
> > > it's not just the slice header, and it's per slice.
> > 
> > Mhh, what exactly remains in there and where does it originate in the
> > bitstream? Maybe it wouldn't be too bad to have one control per actual
> > group of bitstream elements.
> > 
> > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > where they belong, which seems to be slice_header;
> > > 
> > > There reference list is only updated by userspace (through it's DPB)
> > > base on the result of the last decoding step. I was very confused for a
> > > moment until I realize that the lists in the slice_header are just a
> > > list of modification to apply to the reference list in order to produce
> > > l0 and l1.
> > 
> > Indeed, and I'm suggesting that we pass the modifications only, which
> > would fit a slice_header control.
> > 
> > Cheers,
> > 
> > Paul
> > 
> > > > I'm up for preparing and submitting these control changes and updating
> > > > cedrus if they seem agreeable.
> > > > 
> > > > What do you think?
> > > > 
> > > > Cheers,
> > > > 
> > > > Paul
> > > > 
> > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > > 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 11:44       ` Paul Kocialkowski
  2019-05-21 15:09         ` Thierry Reding
@ 2019-05-22  6:01         ` Tomasz Figa
  2019-05-22 18:15           ` Nicolas Dufresne
  1 sibling, 1 reply; 55+ messages in thread
From: Tomasz Figa @ 2019-05-22  6:01 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Nicolas Dufresne, Linux Media Mailing List, Hans Verkuil,
	Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Thierry Reding, Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

On Tue, May 21, 2019 at 8:45 PM Paul Kocialkowski
<paul.kocialkowski@bootlin.com> wrote:
>
> Hi,
>
> On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote:
> > On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
> > <paul.kocialkowski@bootlin.com> wrote:
> > > Hi,
> > >
> > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > >
> > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > better idea of what the situation is like on platforms other than
> > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > >
> > > > > - Per-slice decoding
> > > > >
> > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > to implement the required core bits. When we agree it looks good, we
> > > > > should lift the restriction that all slices must be concatenated and
> > > > > have them submitted as individual requests.
> > > > >
> > > > > One question is what to do about other controls. I feel like it would
> > > > > make sense to always pass all the required controls for decoding the
> > > > > slice, including the ones that don't change across slices. But there
> > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > and relying on the "control cache" can work, but we need to specify
> > > > > that only a single stream can be decoded per opened instance of the
> > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > multi-slice anyway, so it shouldn't be an issue.
> > > >
> > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > should be responsible of doing time-division multiplexing across
> > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > userspace would require some sort of daemon to work properly across
> > > > processes. I also think the kernel is better place for doing resource
> > > > access scheduling in general.
> > >
> > > I agree with that yes. We always have a single m2m context and specific
> > > controls per opened device so keeping cached values works out well.
> > >
> > > So maybe we shall explicitly require that the request with the first
> > > slice for a frame also contains the per-frame controls.
> > >
> >
> > Agreed.
> >
> > One more argument not to allow such multiplexing is that despite the

^^ Here I meant the "userspace multiplexing".

> > API being called "stateless", there is actually some state saved
> > between frames, e.g. the Rockchip decoder writes some intermediate
> > data to some local buffers which need to be given to the decoder to
> > decode the next frame. Actually, on Rockchip there is even a
> > requirement to keep the reference list entries in the same order
> > between frames.
>
> Well, what I'm suggesting is to have one stream per m2m context, but it
> should certainly be possible to have multiple m2m contexts (multiple
> userspace open calls) that decode different streams concurrently.
>
> Is that really going to be a problem for Rockchip? If so, then the
> driver should probably enforce allowing a single userspace open and m2m
> context at a time.

No, that's not what I meant. Obviously the driver can switch between
different sets of private buffers when scheduling different contexts,
as long as the userspace doesn't attempt to do any multiplexing
itself.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 16:23       ` Nicolas Dufresne
@ 2019-05-22  6:39         ` Tomasz Figa
  2019-05-22  7:29           ` Boris Brezillon
  2019-05-22 10:08         ` Thierry Reding
  1 sibling, 1 reply; 55+ messages in thread
From: Tomasz Figa @ 2019-05-22  6:39 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Thierry Reding, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

On Wed, May 22, 2019 at 1:23 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le mardi 21 mai 2019 à 17:43 +0200, Thierry Reding a écrit :
> > On Wed, May 15, 2019 at 07:42:50PM +0200, Paul Kocialkowski wrote:
> > > Hi,
> > >
> > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > >
> > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > better idea of what the situation is like on platforms other than
> > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > >
> > > > > - Per-slice decoding
> > > > >
> > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > to implement the required core bits. When we agree it looks good, we
> > > > > should lift the restriction that all slices must be concatenated and
> > > > > have them submitted as individual requests.
> > > > >
> > > > > One question is what to do about other controls. I feel like it would
> > > > > make sense to always pass all the required controls for decoding the
> > > > > slice, including the ones that don't change across slices. But there
> > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > and relying on the "control cache" can work, but we need to specify
> > > > > that only a single stream can be decoded per opened instance of the
> > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > multi-slice anyway, so it shouldn't be an issue.
> > > >
> > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > should be responsible of doing time-division multiplexing across
> > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > userspace would require some sort of daemon to work properly across
> > > > processes. I also think the kernel is better place for doing resource
> > > > access scheduling in general.
> > >
> > > I agree with that yes. We always have a single m2m context and specific
> > > controls per opened device so keeping cached values works out well.
> > >
> > > So maybe we shall explicitly require that the request with the first
> > > slice for a frame also contains the per-frame controls.
> > >
> > > > > - Annex-B formats
> > > > >
> > > > > I don't think we have really reached a conclusion on the pixel formats
> > > > > we want to expose. The main issue is how to deal with codecs that need
> > > > > the full slice NALU with start code, where the slice_header is
> > > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > > slice data and the parsed slice header control.
> > > > >
> > > > > My initial thinking was that we'd need 3 formats:
> > > > > - One that only takes only the slice compressed data (without raw slice
> > > > > header and start code);
> > > > > - One that takes both the NALU data (including start code, raw header
> > > > > and compressed data) and slice header controls;
> > > > > - One that takes the NALU data but no slice header.
> > > > >
> > > > > But I no longer think the latter really makes sense in the context of
> > > > > stateless video decoding.
> > > > >
> > > > > A side-note: I think we should definitely have data offsets in every
> > > > > case, so that implementations can just push the whole NALU regardless
> > > > > of the format if they're lazy.
> > > >
> > > > I realize that I didn't share our latest research on the subject. So a
> > > > slice in the original bitstream is formed of the following blocks
> > > > (simplified):
> > > >
> > > >   [nal_header][nal_type][slice_header][slice]
> > >
> > > Thanks for the details!
> > >
> > > > nal_header:
> > > > This one is a header used to locate the start and the end of the of a
> > > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > > would be a leading 0 from the previous NAL padding, but this is also
> > > > totally valid start code. The second form is the AVC form, notably used
> > > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > > your buffer aligned to NALs in this case as you cannot scan from random
> > > > location.
> > > >
> > > > nal_type:
> > > > It's a bit more then just the type, but it contains at least the
> > > > information of the nal type. This has different size on H.264 and HEVC
> > > > but I know it's size is in bytes.
> > > >
> > > > slice_header:
> > > > This contains per slice parameters, like the modification lists to
> > > > apply on the references. This one has a size in bits, not in bytes.
> > > >
> > > > slice:
> > > > I don't really know what is in it exactly, but this is the data used to
> > > > decode. This bit has a special coding called the anti-emulation, which
> > > > prevents a start-code from appearing in it. This coding is present in
> > > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > > call ANNEX-B the bytestream format).
> > > >
> > > > So, what we notice is that what is currently passed through Cedrus
> > > > driver:
> > > >   [nal_type][slice_header][slice]
> > > >
> > > > This matches what is being passed through VA-API. We can understand
> > > > that stripping off the slice_header would be hard, since it's size is
> > > > in bits. Instead we pass size and header_bit_size in slice_params.
> > >
> > > True, there is that.
> > >
> > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > start_code_e, when you turn this off, you don't need start code. As a
> > > > side effect, the bitstream becomes identical. We do now know that it
> > > > works with the ffmpeg branch implement for cedrus.
> > >
> > > Oh great, that makes life easier in the short term, but I guess the
> > > issue could arise on another decoder sooner or later.
> > >
> > > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > > take care for us of reading and executing the modification lists found
> > > > in the slice header. Mostly because I very disliked having to pass the
> > > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > > following section is the execution of the modification list. As this
> > > > list is not modified, it only need to be calculated per frame. As a
> > > > result, we don't need these new lists, and we can work with the same
> > > > H264_SLICE format as Cedrus is using.
> > >
> > > Yes but I definitely think it makes more sense to pass the list
> > > modifications rather than reconstructing those in the driver from a
> > > full list. IMO controls should stick to the bitstream as close as
> > > possible.
> > >
> > > > Now, this is just a start. For RK3399, we have a different CODEC
> > > > design. This one does not have the start_code_e bit. What the IP does,
> > > > is that you give it one or more slice per buffer, setup the params,
> > > > start decoding, but the decoder then return the location of the
> > > > following NAL. So basically you could offload the scanning of start
> > > > code to the HW. That being said, with the driver layer in between, that
> > > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > > feature that this allows is to operate in frame mode. In this mode, you
> > > > have 1 interrupt per frame.
> > >
> > > I'm not sure there is any interest in exposing that from userspace and
> > > my current feeling is that we should just ditch support for per-frame
> > > decoding altogether. I think it mixes decoding with notions that are
> > > higher-level than decoding, but I agree it's a blurry line.
> >
> > I'm not sure ditching support for per-frame decoding would be a wise
> > decision. What if some device comes around that only supports frame
> > decoding and can't handle individual slices?
> >
> > We have such a situation on Tegra, for example. I think the hardware can
> > technically decode individual slices, but it can also be set up to do a
> > lot more and operate in basically a per-frame mode where you just pass
> > it a buffer containing the complete bitstream for one frame and it'll
> > just raise an interrupt when it's done decoding.
> >
> > Per-frame mode is what's currently implemented in the staging driver and
> > as far as I can tell it's also what's implemented in the downstream
> > driver, which uses a completely different architecture (it uploads a
> > firmware that processes a command stream). I have seen registers that
> > seem to be related to a slice-decoding mode, but honestly I have no idea
> > how to program them to achieve that.
> >
> > Now the VDE IP that I'm dealing with is pretty old, but from what I know
> > of newer IP, they follow a similar command stream architecture as the
> > downstream VDE driver, so I'm not sure those support per-slice decoding
> > either. They typically have a firmware that processes command streams
> > and userspace typically just passes a single bitstream buffer along with
> > reference frames and gets back the decoded frame. I'd have to
> > investigate further to understand if slice-level decoding is supported
> > on the newer hardware.
> >
> > I'm not familiar with any other decoders, but per-frame decoding doesn't
> > strike me as a very exotic idea. Excluding such decoders from the ABI
> > sounds a bit premature.
>
> It would be premature to state that we are excluding. We are just
> trying to find one format to get things upstream, and make sure we have
> a plan how to extend it. Trying to support everything on the first try
> is not going to work so well.
>
> What is interesting to provide is how does you IP achieve multi-slice
> decoding per frame. That's what we are studying on the RK/Hantro chip.
> Typical questions are:
>
>   1. Do all slices have to be contiguous in memory
>   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
>   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)

AFAICT, the bit about RK3288 isn't true. At least in our downstream
driver that was created mostly by RK themselves, we've been assuming
that the interrupt is for the complete frame, without any problems.

Best regards,
Tomasz

>
> And other things like this. The more data we have, the better the
> initial interface will be.
>
> >
> > > > But it also support slice mode, with an
> > > > interrupt per slice, which is what we decided to use.
> > >
> > > Easier for everyone and probably better for latency as well :)
> >
> > I'm not sure I understand what's easier about slice-level decoding or
> > how this would improve latency. If anything getting less interrupts is
> > good, isn't it?
> >
> > If we can offload more to hardware, certainly that's something we want
> > to take advantage of, no?
>
> In H.264, pretty much all stream have single slice per frame. That's
> because it gives the highest quality. But in live streaming, like for
> webrtc, it's getting more common to actually encode with multiple
> slices (it's group of macroblocks usually in raster order). Usually
> it's a very small amount of slices, 4, 8, something in this range.
>
> When a slice is encoded, the encoder will let it go before it starts
> the following, this allow network transfer to happen in parallel of
> decoding.
>
> On the receiver, as soon as a slice is available, the decoder will be
> started immediately, which allow the receiving of buffer and the
> decoding of the slices to happen in parallel. You end up with a lot
> less delay between the reception of the last slice and having a full
> frame ready.
>
> So that's how slices are used to reduce latency. Now, if you are
> decoding from a container like ISOMP4, you'll have full frame, so it
> make sense to queue all these frame, and le the decoder bundle that if
> possible, if the HW allow to enable mode where you have single IRQ per
> frame. Though, it's pretty rare that you'll find such a file with
> slices. What we'd like to resolve is how these are resolved. There is
> nothing that prevents it right now in the uAPI, but you'd have to copy
> the input into another buffer, adding the separators if needed.
>
> What we are trying to achieve in this thread is to find a compromise
> that makes uAPI sane, but also makes decoding efficient on all the HW
> we know at least.
>
> >
> > Thierry
> >
> > > > So in this case, indeed we strictly require on start-code. Though, to
> > > > me this is not a great reason to make a new fourcc, so we will try and
> > > > use (data_offset = 3) in order to make some space for that start code,
> > > > and write it down in the driver. This is to be continued, we will
> > > > report back on this later. This could have some side effect in the
> > > > ability to import buffers. But most userspace don't try to do zero-copy
> > > > on the encoded size and just copy anyway.
> > > >
> > > > To my opinion, having a single format is a big deal, since userspace
> > > > will generally be developed for one specific HW and we would endup with
> > > > fragmented support. What we really want to achieve is having a driver
> > > > interface which works across multiple HW, and I think this is quite
> > > > possible.
> > >
> > > I agree with that. The more I think about it, the more I believe we
> > > should just pass the whole [nal_header][nal_type][slice_header][slice]
> > > and the parsed list in every scenario.
> > >
> > > For H.265, our decoder needs some information from the NAL type too.
> > > We currently extract that in userspace and stick it to the
> > > slice_header, but maybe it would make more sense to have drivers parse
> > > that info from the buffer if they need it. On the other hand, it seems
> > > quite common to pass information from the NAL type, so maybe we should
> > > either make a new control for it or have all the fields in the
> > > slice_header (which would still be wrong in terms of matching bitstream
> > > description).
> > >
> > > > > - Dropping the DPB concept in H.264/H.265
> > > > >
> > > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > > concept that only makes sense relative to a decoder implementation. The
> > > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > > (Annex C), but that's about it.
> > > > >
> > > > > What's really in the bitstream is the list of modified short-term and
> > > > > long-term references, which is enough for every decoder.
> > > > >
> > > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > > the controls and just pass these lists agremented with relevant
> > > > > information for userspace.
> > > > >
> > > > > I think it should be up to the driver to maintain a DPB and we could
> > > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > > keeping in particular and can manage with the lists from the bitstream
> > > > > directly.
> > > >
> > > > As discusses today, we still need to pass that list. It's being index
> > > > by the HW to retrieve the extra information we have collected about the
> > > > status of the reference frames. In the case of Hantro, which process
> > > > the modification list from the slice header for us, we also need that
> > > > list to construct the unmodified list.
> > > >
> > > > So the problem here is just a naming problem. That list is not really a
> > > > DPB. It is just the list of long-term/short-term references with the
> > > > status of these references. So maybe we could just rename as
> > > > references/reference_entry ?
> > >
> > > What I'd like to pass is the diff to the references list, as ffmpeg
> > > currently provides for v4l2 request and vaapi (probably vdpau too). No
> > > functional change here, only that we should stop calling it a DPB,
> > > which confuses everyone.
> > >
> > > > > - Using flags
> > > > >
> > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > represented as flags. Using flags also helps with padding.
> > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > available and I don't think using 32 bits as a default is good enough).
> > > > >
> > > > > I think H.264/HEVC per-control flags should also be moved to u64.
> > > >
> > > > Make sense, I guess bits (member : 1) are not allowed in uAPI right ?
> > >
> > > Mhh, even if they are, it makes it much harder to verify 32/64 bit
> > > alignment constraints (we're dealing with 64-bit platforms that need to
> > > have 32-bit userspace and compat_ioctl).
> > >
> > > > > - Clear split of controls and terminology
> > > > >
> > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > bitstream element names for those.
> > > > >
> > > > > For H.264, that would suggest the following changes:
> > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > >
> > > > Oops, I think you meant slice_prams ? decode_params matches the
> > > > information found in SPS/PPS (combined?), while slice_params matches
> > > > the information extracted (and executed in case of l0/l1) from the
> > > > slice headers.
> > >
> > > Yes you're right, I mixed them up.
> > >
> > > >  That being said, to me this name wasn't confusing, since
> > > > it's not just the slice header, and it's per slice.
> > >
> > > Mhh, what exactly remains in there and where does it originate in the
> > > bitstream? Maybe it wouldn't be too bad to have one control per actual
> > > group of bitstream elements.
> > >
> > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > where they belong, which seems to be slice_header;
> > > >
> > > > There reference list is only updated by userspace (through it's DPB)
> > > > base on the result of the last decoding step. I was very confused for a
> > > > moment until I realize that the lists in the slice_header are just a
> > > > list of modification to apply to the reference list in order to produce
> > > > l0 and l1.
> > >
> > > Indeed, and I'm suggesting that we pass the modifications only, which
> > > would fit a slice_header control.
> > >
> > > Cheers,
> > >
> > > Paul
> > >
> > > > > I'm up for preparing and submitting these control changes and updating
> > > > > cedrus if they seem agreeable.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Paul
> > > > >
> > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > [2]: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > > > >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-18 14:09                     ` Nicolas Dufresne
@ 2019-05-22  6:48                       ` Tomasz Figa
  2019-05-22  8:26                         ` Paul Kocialkowski
  0 siblings, 1 reply; 55+ messages in thread
From: Tomasz Figa @ 2019-05-22  6:48 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Paul Kocialkowski, Jernej Škrabec, Linux Media Mailing List,
	Hans Verkuil, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Thierry Reding, Ezequiel Garcia, Jonas Karlman

On Sat, May 18, 2019 at 11:09 PM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
> > Hi,
> >
> > Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> > > Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> > > > Hi,
> > > >
> > > > On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> > > > > Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > >
> > > > > > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > > > > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit
> > > :
> > > > > > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a
> > > écrit :
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > With the Rockchip stateless VPU driver in the works, we now
> > > > > > > > > > > > have a
> > > > > > > > > > > > better idea of what the situation is like on platforms other
> > > > > > > > > > > > than
> > > > > > > > > > > > Allwinner. This email shares my conclusions about the
> > > > > > > > > > > > situation and how
> > > > > > > > > > > > we should update the MPEG-2, H.264 and H.265 controls
> > > > > > > > > > > > accordingly.
> > > > > > > > > > > >
> > > > > > > > > > > > - Per-slice decoding
> > > > > > > > > > > >
> > > > > > > > > > > > We've discussed this one already[0] and Hans has submitted a
> > > > > > > > > > > > patch[1]
> > > > > > > > > > > > to implement the required core bits. When we agree it looks
> > > > > > > > > > > > good, we
> > > > > > > > > > > > should lift the restriction that all slices must be
> > > > > > > > > > > > concatenated and
> > > > > > > > > > > > have them submitted as individual requests.
> > > > > > > > > > > >
> > > > > > > > > > > > One question is what to do about other controls. I feel like
> > > > > > > > > > > > it would
> > > > > > > > > > > > make sense to always pass all the required controls for
> > > > > > > > > > > > decoding the
> > > > > > > > > > > > slice, including the ones that don't change across slices.
> > > > > > > > > > > > But there
> > > > > > > > > > > > may be no particular advantage to this and only downsides.
> > > > > > > > > > > > Not doing it
> > > > > > > > > > > > and relying on the "control cache" can work, but we need to
> > > > > > > > > > > > specify
> > > > > > > > > > > > that only a single stream can be decoded per opened instance
> > > > > > > > > > > > of the
> > > > > > > > > > > > v4l2 device. This is the assumption we're going with for
> > > > > > > > > > > > handling
> > > > > > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > > > > >
> > > > > > > > > > > My opinion on this is that the m2m instance is a state, and
> > > > > > > > > > > the driver
> > > > > > > > > > > should be responsible of doing time-division multiplexing
> > > > > > > > > > > across
> > > > > > > > > > > multiple m2m instance jobs. Doing the time-division
> > > > > > > > > > > multiplexing in
> > > > > > > > > > > userspace would require some sort of daemon to work properly
> > > > > > > > > > > across
> > > > > > > > > > > processes. I also think the kernel is better place for doing
> > > > > > > > > > > resource
> > > > > > > > > > > access scheduling in general.
> > > > > > > > > >
> > > > > > > > > > I agree with that yes. We always have a single m2m context and
> > > > > > > > > > specific
> > > > > > > > > > controls per opened device so keeping cached values works out
> > > > > > > > > > well.
> > > > > > > > > >
> > > > > > > > > > So maybe we shall explicitly require that the request with the
> > > > > > > > > > first
> > > > > > > > > > slice for a frame also contains the per-frame controls.
> > > > > > > > > >
> > > > > > > > > > > > - Annex-B formats
> > > > > > > > > > > >
> > > > > > > > > > > > I don't think we have really reached a conclusion on the
> > > > > > > > > > > > pixel formats
> > > > > > > > > > > > we want to expose. The main issue is how to deal with codecs
> > > > > > > > > > > > that need
> > > > > > > > > > > > the full slice NALU with start code, where the slice_header
> > > > > > > > > > > > is
> > > > > > > > > > > > duplicated in raw bitstream, when others are fine with just
> > > > > > > > > > > > the encoded
> > > > > > > > > > > > slice data and the parsed slice header control.
> > > > > > > > > > > >
> > > > > > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > > > > > - One that only takes only the slice compressed data
> > > > > > > > > > > > (without raw slice
> > > > > > > > > > > > header and start code);
> > > > > > > > > > > > - One that takes both the NALU data (including start code,
> > > > > > > > > > > > raw header
> > > > > > > > > > > > and compressed data) and slice header controls;
> > > > > > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > > > > >
> > > > > > > > > > > > But I no longer think the latter really makes sense in the
> > > > > > > > > > > > context of
> > > > > > > > > > > > stateless video decoding.
> > > > > > > > > > > >
> > > > > > > > > > > > A side-note: I think we should definitely have data offsets
> > > > > > > > > > > > in every
> > > > > > > > > > > > case, so that implementations can just push the whole NALU
> > > > > > > > > > > > regardless
> > > > > > > > > > > > of the format if they're lazy.
> > > > > > > > > > >
> > > > > > > > > > > I realize that I didn't share our latest research on the
> > > > > > > > > > > subject. So a
> > > > > > > > > > > slice in the original bitstream is formed of the following
> > > > > > > > > > > blocks
> > > > > > > > > > >
> > > > > > > > > > > (simplified):
> > > > > > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > > > > >
> > > > > > > > > > Thanks for the details!
> > > > > > > > > >
> > > > > > > > > > > nal_header:
> > > > > > > > > > > This one is a header used to locate the start and the end of
> > > > > > > > > > > the of a
> > > > > > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a
> > > > > > > > > > > sequence
> > > > > > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> > > > > > > > > > > byte
> > > > > > > > > > > would be a leading 0 from the previous NAL padding, but this
> > > > > > > > > > > is also
> > > > > > > > > > > totally valid start code. The second form is the AVC form,
> > > > > > > > > > > notably used
> > > > > > > > > > > in ISOMP4 container. It simply is the size of the NAL. You
> > > > > > > > > > > must keep
> > > > > > > > > > > your buffer aligned to NALs in this case as you cannot scan
> > > > > > > > > > > from random
> > > > > > > > > > > location.
> > > > > > > > > > >
> > > > > > > > > > > nal_type:
> > > > > > > > > > > It's a bit more then just the type, but it contains at least
> > > > > > > > > > > the
> > > > > > > > > > > information of the nal type. This has different size on H.264
> > > > > > > > > > > and HEVC
> > > > > > > > > > > but I know it's size is in bytes.
> > > > > > > > > > >
> > > > > > > > > > > slice_header:
> > > > > > > > > > > This contains per slice parameters, like the modification
> > > > > > > > > > > lists to
> > > > > > > > > > > apply on the references. This one has a size in bits, not in
> > > > > > > > > > > bytes.
> > > > > > > > > > >
> > > > > > > > > > > slice:
> > > > > > > > > > > I don't really know what is in it exactly, but this is the
> > > > > > > > > > > data used to
> > > > > > > > > > > decode. This bit has a special coding called the
> > > > > > > > > > > anti-emulation, which
> > > > > > > > > > > prevents a start-code from appearing in it. This coding is
> > > > > > > > > > > present in
> > > > > > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference
> > > > > > > > > > > manual they
> > > > > > > > > > > call ANNEX-B the bytestream format).
> > > > > > > > > > >
> > > > > > > > > > > So, what we notice is that what is currently passed through
> > > > > > > > > > > Cedrus
> > > > > > > > > > >
> > > > > > > > > > > driver:
> > > > > > > > > > >   [nal_type][slice_header][slice]
> > > > > > > > > > >
> > > > > > > > > > > This matches what is being passed through VA-API. We can
> > > > > > > > > > > understand
> > > > > > > > > > > that stripping off the slice_header would be hard, since it's
> > > > > > > > > > > size is
> > > > > > > > > > > in bits. Instead we pass size and header_bit_size in
> > > > > > > > > > > slice_params.
> > > > > > > > > >
> > > > > > > > > > True, there is that.
> > > > > > > > > >
> > > > > > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > > > > > start_code_e, when you turn this off, you don't need start
> > > > > > > > > > > code. As a
> > > > > > > > > > > side effect, the bitstream becomes identical. We do now know
> > > > > > > > > > > that it
> > > > > > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > > > > >
> > > > > > > > > > Oh great, that makes life easier in the short term, but I guess
> > > > > > > > > > the
> > > > > > > > > > issue could arise on another decoder sooner or later.
> > > > > > > > > >
> > > > > > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is
> > > > > > > > > > > that it
> > > > > > > > > > > take care for us of reading and executing the modification
> > > > > > > > > > > lists found
> > > > > > > > > > > in the slice header. Mostly because I very disliked having to
> > > > > > > > > > > pass the
> > > > > > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver
> > > > > > > > > > > the
> > > > > > > > > > > transformation from the DPB entries into this p/b0/b1 list.
> > > > > > > > > > > These list
> > > > > > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> > > > > > > > > > > the
> > > > > > > > > > > following section is the execution of the modification list.
> > > > > > > > > > > As this
> > > > > > > > > > > list is not modified, it only need to be calculated per frame.
> > > > > > > > > > > As a
> > > > > > > > > > > result, we don't need these new lists, and we can work with
> > > > > > > > > > > the same
> > > > > > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > > > > >
> > > > > > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > > > > > modifications rather than reconstructing those in the driver
> > > > > > > > > > from a
> > > > > > > > > > full list. IMO controls should stick to the bitstream as close
> > > > > > > > > > as
> > > > > > > > > > possible.
> > > > > > > > >
> > > > > > > > > For Hantro and RKVDEC, the list of modification is parsed by the
> > > > > > > > > IP
> > > > > > > > > from the slice header bits. Just to make sure, because I myself
> > > > > > > > > was
> > > > > > > > > confused on this before, the slice header does not contain a list
> > > > > > > > > of
> > > > > > > > > references, instead it contains a list modification to be applied
> > > > > > > > > to
> > > > > > > > > the reference list. I need to check again, but to execute these
> > > > > > > > > modification, you need to filter and sort the references in a
> > > > > > > > > specific
> > > > > > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > > > > >
> > > > > > > > > The list of references is deduced from the DPB. The DPB, which I
> > > > > > > > > thinks
> > > > > > > > > should be rename as "references", seems more useful then p/b0/b1,
> > > > > > > > > since
> > > > > > > > > this is the data that gives use the ability to implementing glue
> > > > > > > > > in the
> > > > > > > > > driver to compensate some HW differences.
> > > > > > > > >
> > > > > > > > > In the case of Hantro / RKVDEC, we think it's natural to build the
> > > > > > > > > HW
> > > > > > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > > > > > specific list in the decode_params structure. The fact these lists
> > > > > > > > > are
> > > > > > > > > standard intermediate step of the standard is not that important.
> > > > > > > >
> > > > > > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > > > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > > > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > > > > > instead of dpb.
> > > > > > > >
> > > > > > > > > > > Now, this is just a start. For RK3399, we have a different
> > > > > > > > > > > CODEC
> > > > > > > > > > > design. This one does not have the start_code_e bit. What the
> > > > > > > > > > > IP does,
> > > > > > > > > > > is that you give it one or more slice per buffer, setup the
> > > > > > > > > > > params,
> > > > > > > > > > > start decoding, but the decoder then return the location of
> > > > > > > > > > > the
> > > > > > > > > > > following NAL. So basically you could offload the scanning of
> > > > > > > > > > > start
> > > > > > > > > > > code to the HW. That being said, with the driver layer in
> > > > > > > > > > > between, that
> > > > > > > > > > > would be amazingly inconvenient to use, and with Boyer-more
> > > > > > > > > > > algorithm,
> > > > > > > > > > > it is pretty cheap to scan this type of start-code on CPU. But
> > > > > > > > > > > the
> > > > > > > > > > > feature that this allows is to operate in frame mode. In this
> > > > > > > > > > > mode, you
> > > > > > > > > > > have 1 interrupt per frame.
> > > > > > > > > >
> > > > > > > > > > I'm not sure there is any interest in exposing that from
> > > > > > > > > > userspace and
> > > > > > > > > > my current feeling is that we should just ditch support for
> > > > > > > > > > per-frame
> > > > > > > > > > decoding altogether. I think it mixes decoding with notions that
> > > > > > > > > > are
> > > > > > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > > > > >
> > > > > > > > > I'm not worried about this either. We can already support that by
> > > > > > > > > copying the bitstream internally to the driver, though zero-copy
> > > > > > > > > with
> > > > > > > > > this would require a new format, the one we talked about,
> > > > > > > > > SLICE_ANNEX_B.
> > > > > > > >
> > > > > > > > Right, but what I'm thinking about is making that the one and only
> > > > > > > > format. The rationale is that it's always easier to just append a
> > > > > > > > start
> > > > > > > > code from userspace if needed. And we need a bit offset to the slice
> > > > > > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > > > > > have the whole thing that will work in every situation.
> > > > > > >
> > > > > > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > > > > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > > > > > (.mp4, .mov) will require a full bitstream copy in userspace to add
> > > > > > > the
> > > > > > > start code (unless you hack your allocation in your demuxer, but it's
> > > > > > > a
> > > > > > > bit complicated since this code might come from two libraries). In
> > > > > > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > > > > > follows.
> > > > > >
> > > > > > Well, I think we have to do a copy from system memory to the buffer
> > > > > > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > > > > > expected not to have any MMU unit and not allow sg import anyway.
> > > > >
> > > > > The Rockchip has an mmu. You need one copy at least indeed,
> > > >
> > > > Is the MMU in use currently? That can make things troublesome if we run
> > > > into a case where the VPU has MMU and deals with scatter-gather while
> > > > the display part doesn't. As far as I know, there's no way for
> > > > userspace to know whether a dma-buf-exported buffer is backed by CMA or
> > > > by scatter-gather memory. This feels like a major issue for using dma-
> > > > buf, since userspace can't predict whether a buffer exported on one
> > > > device can be imported on another when building its pipeline.
> > >
> > > FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver
> > > for it yet. It is supported for display, both VPUs and some other devices. I
> > > think no sane SoC designer would left out one or another unit without IOMMU
> > > support, that just calls for troubles, as you pointed out.
> >
> > Right right, I've been following that from a distance :)
> >
> > Indeed I think it's realistic to expect that for now, but it may not
> > play out so well in the long term. For instance, maybe connecting a USB
> > display would require CMA when the rest of the system can do with sg.
> >
> > I think it would really be useful for userspace to have a way to test
> > whether a buffer can be imported from one device to another. It feels
> > better than indicating where the memory lives, since there are
> > countless cases where additional restrictions apply too.
>
> I don't know for the integration on the Rockchip, but I did notice the
> register documentation for it.

All the important components in the SoC have their IOMMUs as well -
display controller, GPU.

There is a blitter called RGA that is not behind an IOMMU, but has
some scatter-gather capability (with a need for the hardware sg table
to be physically contiguous). That said, significance of such blitters
nowadays is rather low, as most of the time you need a compositor on
the GPU anyway, which can do any transformation in the same pass as
the composition.

> In general, the most significant gain
> with having iommu for CODECs is that it makes start up (and re-init)
> time much shorter, but also in a much more predictable duration. I do
> believe that the Venus driver (qualcomm) is one with solid support for
> this, and it's quite noticably more snappy then the others.

Obviously you also get support for USERPTR if you have an IOMMU, but
that also has some costs - you need to pin the user pages and map to
the IOMMU before each frame and unmap and unpin after each frame,
which sometimes is more costly than actually having the userspace copy
to a preallocated and premapped buffer, especially for relatively
small contents, such as compressed bitstream.

Best regards,
Tomasz

>
> We also faced an interesting issue recently on IMX.6 (there is just no
> mmu there). We where playing a stream from the camera, and the
> framerate would drastically drop as soon as you plug a USB camera (and
> it would drop for quite a while). We found out that Etnaviv is doing
> cma allocation per frame, hopefully this won't happen under V4L2
> queues. But on this platform, starting a new stream while pluggin a USB
> key could take several seconds to start.
>
> About the RK3399, work will continue in the next couple of weeks, and
> when this is done, we should have a much wider view of this subject.
> Hopefully what we learned about H.264 will be useful for HEVC and
> eventually AV1, which in term of bitstream uses similar stream formats
> method. AV1 is by far the most complicated CODEC I have read about.
>
> >
> > Cheers,
> >
> > Paul
> >
> > > Best regards,
> > > Jernej
> > >
> > > > > e.g. file
> > > > > to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
> > > > > ts, first you need to copy the MPEG TS to mem, then to demux you copy
> > > > > that H264 stream to another buffer, you then copy in the parser,
> > > > > removing the start-code and finally copy in the accelerator, adding the
> > > > > start code. If the driver would allow userptr, it would be unusable.
> > > > >
> > > > > GStreamer on the other side implement lazy conversion, so it would copy
> > > > > the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
> > > > > parser (but stream format is negotiation, so it keeps the start-code).
> > > > > If you request alignment=au, you have full frame of buffers, so if your
> > > > > driver could do userptr, you can same that extra copy.
> > > > >
> > > > > Now, if we demux an MP4 it's the same, the parser will need do a full
> > > > > copy instead of lazy aggregation in order to prepend the start code
> > > > > (since it had an AVC header). But userptr could save a copy.
> > > > >
> > > > > If the driver requires no nal prefix, then we could just pass a
> > > > > slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
> > > > > which is a bit slower (even know it's nothing compare to the full
> > > > > copies we already do.
> > > > >
> > > > > That was my argument in favour for no NAL prefix in term of efficiency,
> > > > > and it does not prevent adding a control to enable start-code for cases
> > > > > it make sense.
> > > >
> > > > I see, so the internal arcitecture of userspace software may not be a
> > > > good fit for adding these bits and it could hurt performance a bit.
> > > > That feels like a significant downside.
> > > >
> > > > > > So with that in mind, asking userspace to add a startcode it already
> > > > > > knows doesn't seem to be asking too much.
> > > > > >
> > > > > > > On the other end, the data_offset thing is likely just a thing for the
> > > > > > > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> > > > > >
> > > > > > Well, I think it's best to be fool-proof here and just require that
> > > > > > start code. We should also have per-slice bit offsets to the different
> > > > > > parts anyway, so drivers that don't need it can just ignore it.
> > > > > >
> > > > > > In extreme cases where there is some interest in doing direct buffer
> > > > > > import without doing a copy in userspace, userspace could trick the
> > > > > > format and avoid a copy by not providing the start-code (assuming it
> > > > > > knows it doesn't need it) and specifying the bit offsets accordingly.
> > > > > > That'd be a hack for better performance, and it feels better to do
> > > > > > things in this order rather than having to hack around in the drivers
> > > > > > that need the start code in every other case.
> > > > >
> > > > > So basically, you and Tomas are both strongly in favour of adding
> > > > > ANNEX-B start-code to the current uAPI. I have digged into Cedrus
> > > > > registers, and it seems that it does have start-code scanning support.
> > > > > I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
> > > > > like the RK do. That requires the IP to deal with the modifications
> > > > > lists, which are per slices.
> > > >
> > > > Actually the bitstream parser won't reconfigure the pipeline
> > > > configuration registers, it's only around for userspace to avoid
> > > > implementing bitstream parsing, but it's a standalone thing.
> > > >
> > > > So if we want to do full-frame decoding we always need to reconfigure
> > > > our pipeline (or do it like we do currently and just use one of the
> > > > per-slice configuration and hope for the best).
> > > >
> > > > Do we have more information on the RK3399 and what it requires exactly?
> > > > (Just to make sure it's not another issue altogether.)
> > > >
> > > > > My question is, are you willing to adapt the Cedrus driver to support
> > > > > receiving start-code ? And will this have a performance impact or not ?
> > > > > On RK side, it's really just about flipping 1 bit.
> > > > >
> > > > > On the Rockchip side, Tomas had concern about CPU wakeup and the fact
> > > > > that we didn't aim at supporting passing multiple slices at once to the
> > > > > IP (something RK supports). It's important to understand that multi-
> > > > > slice streams are relatively rare and mostly used for low-latency /
> > > > > video conferencing. So aggregating in these case defeats the purpose of
> > > > > using slices. So I think RK feature is not very important.
> > > >
> > > > Agreed, let's aim for low-latency as a standard.
> > > >
> > > > > Of course, I do believe that long term we will want to expose bot
> > > > > stream formats on RK (because the HW can do that), so then userspace
> > > > > can just pick the best when available. So that boils down to our first
> > > > > idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
> > > > > Now that we have progressed on the matter, I'm quite in favour of
> > > > > having _SLICE in the first place, with the preferred format that
> > > > > everyone should support, and allow for variants later. Now, if we make
> > > > > one mandatory, we could also just have a menu control to allow other
> > > > > formats.
> > > >
> > > > That seems fairly reasonable to me, and indeed, having one preferred
> > > > format at first seems to be a good move.
> > > >
> > > > > > > > To me the breaking point was about having the slice header both in
> > > > > > > > raw
> > > > > > > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > > > > > > well push it to its logical conclusion and include all the bits that
> > > > > > > > can be useful.
> > > > > > >
> > > > > > > To take your words, the bits that contain useful information starts
> > > > > > > from the NAL type byte, exactly were the data was cut by VA-API and
> > > > > > > the
> > > > > > > current uAPI.
> > > > > >
> > > > > > Agreed, but I think that the advantages of always requiring the start
> > > > > > code outweigh the potential (yet quite unlikely) downsides.
> > > > > >
> > > > > > > > > > > But it also support slice mode, with an
> > > > > > > > > > > interrupt per slice, which is what we decided to use.
> > > > > > > > > >
> > > > > > > > > > Easier for everyone and probably better for latency as well :)
> > > > > > > > > >
> > > > > > > > > > > So in this case, indeed we strictly require on start-code.
> > > > > > > > > > > Though, to
> > > > > > > > > > > me this is not a great reason to make a new fourcc, so we will
> > > > > > > > > > > try and
> > > > > > > > > > > use (data_offset = 3) in order to make some space for that
> > > > > > > > > > > start code,
> > > > > > > > > > > and write it down in the driver. This is to be continued, we
> > > > > > > > > > > will
> > > > > > > > > > > report back on this later. This could have some side effect in
> > > > > > > > > > > the
> > > > > > > > > > > ability to import buffers. But most userspace don't try to do
> > > > > > > > > > > zero-copy
> > > > > > > > > > > on the encoded size and just copy anyway.
> > > > > > > > > > >
> > > > > > > > > > > To my opinion, having a single format is a big deal, since
> > > > > > > > > > > userspace
> > > > > > > > > > > will generally be developed for one specific HW and we would
> > > > > > > > > > > endup with
> > > > > > > > > > > fragmented support. What we really want to achieve is having a
> > > > > > > > > > > driver
> > > > > > > > > > > interface which works across multiple HW, and I think this is
> > > > > > > > > > > quite
> > > > > > > > > > > possible.
> > > > > > > > > >
> > > > > > > > > > I agree with that. The more I think about it, the more I believe
> > > > > > > > > > we
> > > > > > > > > > should just pass the whole
> > > > > > > > > > [nal_header][nal_type][slice_header][slice]
> > > > > > > > > > and the parsed list in every scenario.
> > > > > > > > >
> > > > > > > > > What I like of the cut at nal_type, is that there is only format.
> > > > > > > > > If we
> > > > > > > > > cut at nal_header, then we need to expose 2 formats. And it makes
> > > > > > > > > our
> > > > > > > > > API similar to other accelerator API, so it's easy to "convert"
> > > > > > > > > existing userspace.
> > > > > > > >
> > > > > > > > Unless we make that cut the single one and only true cut that shall
> > > > > > > > supersed all other cuts :)
> > > > > > >
> > > > > > > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > > > > > > thing and go back to our first idea.
> > > > > >
> > > > > > Right, in the end I think we should go with:
> > > > > > V4L2_PIX_FMT_MPEG2_SLICE
> > > > > > V4L2_PIX_FMT_H264_SLICE
> > > > > > V4L2_PIX_FMT_HEVC_SLICE
> > > > > >
> > > > > > And just require raw bitstream for the slice with emulation-prevention
> > > > > > bits included.
> > > > >
> > > > > That's should be the set of format we start with indeed. The single
> > > > > format for which software gets written and tested, making sure software
> > > > > support is not fragmented, and other variants should be something to
> > > > > opt-in.
> > > >
> > > > Cheers for that!
> > > >
> > > > Paul
> > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Paul
> > > > > >
> > > > > > > > > > For H.265, our decoder needs some information from the NAL type
> > > > > > > > > > too.
> > > > > > > > > > We currently extract that in userspace and stick it to the
> > > > > > > > > > slice_header, but maybe it would make more sense to have drivers
> > > > > > > > > > parse
> > > > > > > > > > that info from the buffer if they need it. On the other hand, it
> > > > > > > > > > seems
> > > > > > > > > > quite common to pass information from the NAL type, so maybe we
> > > > > > > > > > should
> > > > > > > > > > either make a new control for it or have all the fields in the
> > > > > > > > > > slice_header (which would still be wrong in terms of matching
> > > > > > > > > > bitstream
> > > > > > > > > > description).
> > > > > > > > >
> > > > > > > > > Even in userspace, it's common to just parse this in place, it's a
> > > > > > > > > simple mask. But yes, if we don't have it yet, we should expose
> > > > > > > > > the NAL
> > > > > > > > > type, it would be cleaner.
> > > > > > > >
> > > > > > > > Right, works for me.
> > > > > > >
> > > > > > > Ack.
> > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Paul
> > > > > > > >
> > > > > > > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > > > > > >
> > > > > > > > > > > > As far as I could understand, the decoded picture buffer
> > > > > > > > > > > > (DPB) is a
> > > > > > > > > > > > concept that only makes sense relative to a decoder
> > > > > > > > > > > > implementation. The
> > > > > > > > > > > > spec mentions how to manage it with the Hypothetical
> > > > > > > > > > > > reference decoder
> > > > > > > > > > > > (Annex C), but that's about it.
> > > > > > > > > > > >
> > > > > > > > > > > > What's really in the bitstream is the list of modified
> > > > > > > > > > > > short-term and
> > > > > > > > > > > > long-term references, which is enough for every decoder.
> > > > > > > > > > > >
> > > > > > > > > > > > For this reason, I strongly believe we should stop talking
> > > > > > > > > > > > about DPB in
> > > > > > > > > > > > the controls and just pass these lists agremented with
> > > > > > > > > > > > relevant
> > > > > > > > > > > > information for userspace.
> > > > > > > > > > > >
> > > > > > > > > > > > I think it should be up to the driver to maintain a DPB and
> > > > > > > > > > > > we could
> > > > > > > > > > > > have helpers for common cases. For instance, the rockchip
> > > > > > > > > > > > decoder needs
> > > > > > > > > > > > to keep unused entries around[2] and cedrus has the same
> > > > > > > > > > > > requirement
> > > > > > > > > > > > for H.264. However for cedrus/H.265, we don't need to do any
> > > > > > > > > > > > book-
> > > > > > > > > > > > keeping in particular and can manage with the lists from the
> > > > > > > > > > > > bitstream
> > > > > > > > > > > > directly.
> > > > > > > > > > >
> > > > > > > > > > > As discusses today, we still need to pass that list. It's
> > > > > > > > > > > being index
> > > > > > > > > > > by the HW to retrieve the extra information we have collected
> > > > > > > > > > > about the
> > > > > > > > > > > status of the reference frames. In the case of Hantro, which
> > > > > > > > > > > process
> > > > > > > > > > > the modification list from the slice header for us, we also
> > > > > > > > > > > need that
> > > > > > > > > > > list to construct the unmodified list.
> > > > > > > > > > >
> > > > > > > > > > > So the problem here is just a naming problem. That list is not
> > > > > > > > > > > really a
> > > > > > > > > > > DPB. It is just the list of long-term/short-term references
> > > > > > > > > > > with the
> > > > > > > > > > > status of these references. So maybe we could just rename as
> > > > > > > > > > > references/reference_entry ?
> > > > > > > > > >
> > > > > > > > > > What I'd like to pass is the diff to the references list, as
> > > > > > > > > > ffmpeg
> > > > > > > > > > currently provides for v4l2 request and vaapi (probably vdpau
> > > > > > > > > > too). No
> > > > > > > > > > functional change here, only that we should stop calling it a
> > > > > > > > > > DPB,
> > > > > > > > > > which confuses everyone.
> > > > > > > > >
> > > > > > > > > Yes.
> > > > > > > > >
> > > > > > > > > > > > - Using flags
> > > > > > > > > > > >
> > > > > > > > > > > > The current MPEG-2 controls have lots of u8 values that can
> > > > > > > > > > > > be
> > > > > > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > > > > > It's unlikely that we'll get more than 64 flags, so using a
> > > > > > > > > > > > u64 by
> > > > > > > > > > > > default for that sounds fine (we definitely do want to keep
> > > > > > > > > > > > some room
> > > > > > > > > > > > available and I don't think using 32 bits as a default is
> > > > > > > > > > > > good enough).
> > > > > > > > > > > >
> > > > > > > > > > > > I think H.264/HEVC per-control flags should also be moved to
> > > > > > > > > > > > u64.
> > > > > > > > > > >
> > > > > > > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI
> > > > > > > > > > > right ?
> > > > > > > > > >
> > > > > > > > > > Mhh, even if they are, it makes it much harder to verify 32/64
> > > > > > > > > > bit
> > > > > > > > > > alignment constraints (we're dealing with 64-bit platforms that
> > > > > > > > > > need to
> > > > > > > > > > have 32-bit userspace and compat_ioctl).
> > > > > > > > >
> > > > > > > > > I see, thanks.
> > > > > > > > >
> > > > > > > > > > > > - Clear split of controls and terminology
> > > > > > > > > > > >
> > > > > > > > > > > > Some codecs have explicit NAL units that are good fits to
> > > > > > > > > > > > match as
> > > > > > > > > > > > controls: e.g. slice header, pps, sps. I think we should
> > > > > > > > > > > > stick to the
> > > > > > > > > > > > bitstream element names for those.
> > > > > > > > > > > >
> > > > > > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > > > > > - renaming v4l2_ctrl_h264_decode_param to
> > > > > > > > > > > > v4l2_ctrl_h264_slice_header;
> > > > > > > > > > >
> > > > > > > > > > > Oops, I think you meant slice_prams ? decode_params matches
> > > > > > > > > > > the
> > > > > > > > > > > information found in SPS/PPS (combined?), while slice_params
> > > > > > > > > > > matches
> > > > > > > > > > > the information extracted (and executed in case of l0/l1) from
> > > > > > > > > > > the
> > > > > > > > > > > slice headers.
> > > > > > > > > >
> > > > > > > > > > Yes you're right, I mixed them up.
> > > > > > > > > >
> > > > > > > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > > > > > >
> > > > > > > > > > > it's not just the slice header, and it's per slice.
> > > > > > > > > >
> > > > > > > > > > Mhh, what exactly remains in there and where does it originate
> > > > > > > > > > in the
> > > > > > > > > > bitstream? Maybe it wouldn't be too bad to have one control per
> > > > > > > > > > actual
> > > > > > > > > > group of bitstream elements.
> > > > > > > > > >
> > > > > > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the
> > > > > > > > > > > > reference lists
> > > > > > > > > > > > where they belong, which seems to be slice_header;
> > > > > > > > > > >
> > > > > > > > > > > There reference list is only updated by userspace (through
> > > > > > > > > > > it's DPB)
> > > > > > > > > > > base on the result of the last decoding step. I was very
> > > > > > > > > > > confused for a
> > > > > > > > > > > moment until I realize that the lists in the slice_header are
> > > > > > > > > > > just a
> > > > > > > > > > > list of modification to apply to the reference list in order
> > > > > > > > > > > to produce
> > > > > > > > > > > l0 and l1.
> > > > > > > > > >
> > > > > > > > > > Indeed, and I'm suggesting that we pass the modifications only,
> > > > > > > > > > which
> > > > > > > > > > would fit a slice_header control.
> > > > > > > > >
> > > > > > > > > I think I made my point why we want the dpb -> references. I'm
> > > > > > > > > going to
> > > > > > > > > validate with the VA driver now, to see if the references list
> > > > > > > > > there is
> > > > > > > > > usable with our code.
> > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > > > Paul
> > > > > > > > > >
> > > > > > > > > > > > I'm up for preparing and submitting these control changes
> > > > > > > > > > > > and updating
> > > > > > > > > > > > cedrus if they seem agreeable.
> > > > > > > > > > > >
> > > > > > > > > > > > What do you think?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > >
> > > > > > > > > > > > Paul
> > > > > > > > > > > >
> > > > > > > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > > > > > > [2]:
> > > > > > > > > > > > https://chromium.googlesource.com/chromiumos/third_party/ke
> > > > > > > > > > > > rnel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378
> > >
> > >
>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  6:39         ` Tomasz Figa
@ 2019-05-22  7:29           ` Boris Brezillon
  2019-05-22  8:20             ` Boris Brezillon
  2019-05-22  8:32             ` Thierry Reding
  0 siblings, 2 replies; 55+ messages in thread
From: Boris Brezillon @ 2019-05-22  7:29 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Nicolas Dufresne, Thierry Reding, Paul Kocialkowski,
	Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Maxime Ripard, Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

On Wed, 22 May 2019 15:39:37 +0900
Tomasz Figa <tfiga@chromium.org> wrote:

> > It would be premature to state that we are excluding. We are just
> > trying to find one format to get things upstream, and make sure we have
> > a plan how to extend it. Trying to support everything on the first try
> > is not going to work so well.
> >
> > What is interesting to provide is how does you IP achieve multi-slice
> > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > Typical questions are:
> >
> >   1. Do all slices have to be contiguous in memory
> >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> 
> AFAICT, the bit about RK3288 isn't true. At least in our downstream
> driver that was created mostly by RK themselves, we've been assuming
> that the interrupt is for the complete frame, without any problems.

I confirm that's what happens when all slices forming a frame are packed
in a single output buffer: you only get one interrupt at the end of the
decoding process (in that case, when the frame is decoded). Of course,
if you split things up and do per-slice decoding instead (one slice per
buffer) you get an interrupt per slice, though I didn't manage to make
that work.
I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
interrupt on the second slice, which makes me think some states are
reset between the 2 operations leading the engine to think that the
second slice is part of a new frame.

Anyway, it doesn't sound like a crazy idea to support both per-slice
and per-frame decoding and maybe have a way to expose what a
specific codec can do (through an extra cap mechanism).
The other option would be to support only per-slice decoding with a
mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
only support per-frame decoding know when they should trigger the
decoding operation. The downside is that it implies having a bounce
buffer where the driver can pack slices to be decoded on the END_FRAME
event.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 16:07           ` Nicolas Dufresne
@ 2019-05-22  8:08             ` Thierry Reding
  0 siblings, 0 replies; 55+ messages in thread
From: Thierry Reding @ 2019-05-22  8:08 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Paul Kocialkowski, Tomasz Figa, Linux Media Mailing List,
	Hans Verkuil, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 6234 bytes --]

On Tue, May 21, 2019 at 12:07:47PM -0400, Nicolas Dufresne wrote:
> Le mardi 21 mai 2019 à 17:09 +0200, Thierry Reding a écrit :
> > On Tue, May 21, 2019 at 01:44:50PM +0200, Paul Kocialkowski wrote:
> > > Hi,
> > > 
> > > On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote:
> > > > On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
> > > > <paul.kocialkowski@bootlin.com> wrote:
> > > > > Hi,
> > > > > 
> > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > > > better idea of what the situation is like on platforms other than
> > > > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > > > 
> > > > > > > - Per-slice decoding
> > > > > > > 
> > > > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > > > to implement the required core bits. When we agree it looks good, we
> > > > > > > should lift the restriction that all slices must be concatenated and
> > > > > > > have them submitted as individual requests.
> > > > > > > 
> > > > > > > One question is what to do about other controls. I feel like it would
> > > > > > > make sense to always pass all the required controls for decoding the
> > > > > > > slice, including the ones that don't change across slices. But there
> > > > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > > > and relying on the "control cache" can work, but we need to specify
> > > > > > > that only a single stream can be decoded per opened instance of the
> > > > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > 
> > > > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > > > should be responsible of doing time-division multiplexing across
> > > > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > > > userspace would require some sort of daemon to work properly across
> > > > > > processes. I also think the kernel is better place for doing resource
> > > > > > access scheduling in general.
> > > > > 
> > > > > I agree with that yes. We always have a single m2m context and specific
> > > > > controls per opened device so keeping cached values works out well.
> > > > > 
> > > > > So maybe we shall explicitly require that the request with the first
> > > > > slice for a frame also contains the per-frame controls.
> > > > > 
> > > > 
> > > > Agreed.
> > > > 
> > > > One more argument not to allow such multiplexing is that despite the
> > > > API being called "stateless", there is actually some state saved
> > > > between frames, e.g. the Rockchip decoder writes some intermediate
> > > > data to some local buffers which need to be given to the decoder to
> > > > decode the next frame. Actually, on Rockchip there is even a
> > > > requirement to keep the reference list entries in the same order
> > > > between frames.
> > > 
> > > Well, what I'm suggesting is to have one stream per m2m context, but it
> > > should certainly be possible to have multiple m2m contexts (multiple
> > > userspace open calls) that decode different streams concurrently.
> > > 
> > > Is that really going to be a problem for Rockchip? If so, then the
> > > driver should probably enforce allowing a single userspace open and m2m
> > > context at a time.
> > 
> > If you have hardware storing data necessary to the decoding process in
> > buffers local to the decoder you'd have to have some sort of context
> > switch operation that backs up the data in those buffers before you
> > switch to a different context and restore those buffers when you switch
> > back. We have similar hardware on Tegra, though I'm not exactly familiar
> > with the details of what is saved and how essential it is. My
> > understanding is that those internal buffers can be copied to external
> > RAM or vice versa, but I suspect that this isn't going to be very
> > efficient. It may very well be that restricting to a single userspace
> > open is the most sensible option.
> 
> That would be by far the worst for a browser use case where an adds
> might have stolen that single instance you have available in HW. It's
> normal that context switching will have some impact on performance, but
> in general, most of the time, the other instances will be idles by
> userspace. If there is not context switches, there should be no (or
> very little overhead). Of course, it's should not be a heard
> requirement to get a driver in the kernel, I'm not saying that.

Sounds like we're in agreement. I didn't mean to imply that all drivers
should be single-open. I was just trying to say that there may be cases
where it's not possible or highly impractical to do a context switch or
multiple ones in a driver.

> p.s. In the IMX8M/Hantro G1 they specifically says that the single core
> decoder can handle up to 8 1080p60 streams at the same time. But there
> is some buffers being write-back by the IP for every slice (at the end
> of the decoded reference frames).

I know that there is a similar mechanism on VDE for Tegra where an extra
auxiliary buffer can be defined where extra data is written, though it's
only used for some profiles (H.264 constrained baseline for example does
not seem to require that). I think this has to do with reference picture
marking. It may very well be that the other internal buffers don't
actually need to persist across multiple frame decode operations, which
would of course eliminate any concurrency issues.

Sorry for this being somewhat vague. I've only begun to familiarize
myself with the VDE and I keep getting side-tracked by a bunch of other
things. But I'm trying to pitch in while the discussion is ongoing in
the hope that it will help us come up with the best solution.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  7:29           ` Boris Brezillon
@ 2019-05-22  8:20             ` Boris Brezillon
  2019-05-22 18:18               ` Nicolas Dufresne
  2019-05-22  8:32             ` Thierry Reding
  1 sibling, 1 reply; 55+ messages in thread
From: Boris Brezillon @ 2019-05-22  8:20 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Nicolas Dufresne, Thierry Reding, Paul Kocialkowski,
	Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Maxime Ripard, Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

On Wed, 22 May 2019 09:29:24 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Wed, 22 May 2019 15:39:37 +0900
> Tomasz Figa <tfiga@chromium.org> wrote:
> 
> > > It would be premature to state that we are excluding. We are just
> > > trying to find one format to get things upstream, and make sure we have
> > > a plan how to extend it. Trying to support everything on the first try
> > > is not going to work so well.
> > >
> > > What is interesting to provide is how does you IP achieve multi-slice
> > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > Typical questions are:
> > >
> > >   1. Do all slices have to be contiguous in memory
> > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)    
> > 
> > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > driver that was created mostly by RK themselves, we've been assuming
> > that the interrupt is for the complete frame, without any problems.  
> 
> I confirm that's what happens when all slices forming a frame are packed
> in a single output buffer: you only get one interrupt at the end of the
> decoding process (in that case, when the frame is decoded). Of course,
> if you split things up and do per-slice decoding instead (one slice per
> buffer) you get an interrupt per slice, though I didn't manage to make
> that work.
> I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> interrupt on the second slice, which makes me think some states are
> reset between the 2 operations leading the engine to think that the
> second slice is part of a new frame.
> 
> Anyway, it doesn't sound like a crazy idea to support both per-slice
> and per-frame decoding and maybe have a way to expose what a
> specific codec can do (through an extra cap mechanism).
> The other option would be to support only per-slice decoding with a
> mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> only support per-frame decoding know when they should trigger the
> decoding operation.

Just to clarify, we can use Hans' V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF
work to identify start/end frame boundaries, the only problem I see is
that users are not required to clear the flag on the last slice of a
frame, so there's no way for the driver to know when it should trigger
the decode-frame operation. I guess we could trigger this decode
operation when v4l2_m2m_release_capture_buf() returns true, but I
wonder if it's not too late to do that.

> The downside is that it implies having a bounce
> buffer where the driver can pack slices to be decoded on the END_FRAME
> event.
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  6:48                       ` Tomasz Figa
@ 2019-05-22  8:26                         ` Paul Kocialkowski
  2019-05-22 10:42                           ` Thierry Reding
  0 siblings, 1 reply; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-22  8:26 UTC (permalink / raw)
  To: Tomasz Figa, Nicolas Dufresne
  Cc: Jernej Škrabec, Linux Media Mailing List, Hans Verkuil,
	Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Thierry Reding, Ezequiel Garcia, Jonas Karlman

Hi,

Le mercredi 22 mai 2019 à 15:48 +0900, Tomasz Figa a écrit :
> On Sat, May 18, 2019 at 11:09 PM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> > > > Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> > > > > Hi,
> > > > > 
> > > > > On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> > > > > > Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > > > > > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > > > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > > > Hi,
> > > > > > > > > > > 
> > > > > > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit
> > > > :
> > > > > > > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a
> > > > écrit :
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > With the Rockchip stateless VPU driver in the works, we now
> > > > > > > > > > > > > have a
> > > > > > > > > > > > > better idea of what the situation is like on platforms other
> > > > > > > > > > > > > than
> > > > > > > > > > > > > Allwinner. This email shares my conclusions about the
> > > > > > > > > > > > > situation and how
> > > > > > > > > > > > > we should update the MPEG-2, H.264 and H.265 controls
> > > > > > > > > > > > > accordingly.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > - Per-slice decoding
> > > > > > > > > > > > > 
> > > > > > > > > > > > > We've discussed this one already[0] and Hans has submitted a
> > > > > > > > > > > > > patch[1]
> > > > > > > > > > > > > to implement the required core bits. When we agree it looks
> > > > > > > > > > > > > good, we
> > > > > > > > > > > > > should lift the restriction that all slices must be
> > > > > > > > > > > > > concatenated and
> > > > > > > > > > > > > have them submitted as individual requests.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > One question is what to do about other controls. I feel like
> > > > > > > > > > > > > it would
> > > > > > > > > > > > > make sense to always pass all the required controls for
> > > > > > > > > > > > > decoding the
> > > > > > > > > > > > > slice, including the ones that don't change across slices.
> > > > > > > > > > > > > But there
> > > > > > > > > > > > > may be no particular advantage to this and only downsides.
> > > > > > > > > > > > > Not doing it
> > > > > > > > > > > > > and relying on the "control cache" can work, but we need to
> > > > > > > > > > > > > specify
> > > > > > > > > > > > > that only a single stream can be decoded per opened instance
> > > > > > > > > > > > > of the
> > > > > > > > > > > > > v4l2 device. This is the assumption we're going with for
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > > > > > > 
> > > > > > > > > > > > My opinion on this is that the m2m instance is a state, and
> > > > > > > > > > > > the driver
> > > > > > > > > > > > should be responsible of doing time-division multiplexing
> > > > > > > > > > > > across
> > > > > > > > > > > > multiple m2m instance jobs. Doing the time-division
> > > > > > > > > > > > multiplexing in
> > > > > > > > > > > > userspace would require some sort of daemon to work properly
> > > > > > > > > > > > across
> > > > > > > > > > > > processes. I also think the kernel is better place for doing
> > > > > > > > > > > > resource
> > > > > > > > > > > > access scheduling in general.
> > > > > > > > > > > 
> > > > > > > > > > > I agree with that yes. We always have a single m2m context and
> > > > > > > > > > > specific
> > > > > > > > > > > controls per opened device so keeping cached values works out
> > > > > > > > > > > well.
> > > > > > > > > > > 
> > > > > > > > > > > So maybe we shall explicitly require that the request with the
> > > > > > > > > > > first
> > > > > > > > > > > slice for a frame also contains the per-frame controls.
> > > > > > > > > > > 
> > > > > > > > > > > > > - Annex-B formats
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I don't think we have really reached a conclusion on the
> > > > > > > > > > > > > pixel formats
> > > > > > > > > > > > > we want to expose. The main issue is how to deal with codecs
> > > > > > > > > > > > > that need
> > > > > > > > > > > > > the full slice NALU with start code, where the slice_header
> > > > > > > > > > > > > is
> > > > > > > > > > > > > duplicated in raw bitstream, when others are fine with just
> > > > > > > > > > > > > the encoded
> > > > > > > > > > > > > slice data and the parsed slice header control.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > > > > > > - One that only takes only the slice compressed data
> > > > > > > > > > > > > (without raw slice
> > > > > > > > > > > > > header and start code);
> > > > > > > > > > > > > - One that takes both the NALU data (including start code,
> > > > > > > > > > > > > raw header
> > > > > > > > > > > > > and compressed data) and slice header controls;
> > > > > > > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > But I no longer think the latter really makes sense in the
> > > > > > > > > > > > > context of
> > > > > > > > > > > > > stateless video decoding.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > A side-note: I think we should definitely have data offsets
> > > > > > > > > > > > > in every
> > > > > > > > > > > > > case, so that implementations can just push the whole NALU
> > > > > > > > > > > > > regardless
> > > > > > > > > > > > > of the format if they're lazy.
> > > > > > > > > > > > 
> > > > > > > > > > > > I realize that I didn't share our latest research on the
> > > > > > > > > > > > subject. So a
> > > > > > > > > > > > slice in the original bitstream is formed of the following
> > > > > > > > > > > > blocks
> > > > > > > > > > > > 
> > > > > > > > > > > > (simplified):
> > > > > > > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > > > > > > 
> > > > > > > > > > > Thanks for the details!
> > > > > > > > > > > 
> > > > > > > > > > > > nal_header:
> > > > > > > > > > > > This one is a header used to locate the start and the end of
> > > > > > > > > > > > the of a
> > > > > > > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a
> > > > > > > > > > > > sequence
> > > > > > > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> > > > > > > > > > > > byte
> > > > > > > > > > > > would be a leading 0 from the previous NAL padding, but this
> > > > > > > > > > > > is also
> > > > > > > > > > > > totally valid start code. The second form is the AVC form,
> > > > > > > > > > > > notably used
> > > > > > > > > > > > in ISOMP4 container. It simply is the size of the NAL. You
> > > > > > > > > > > > must keep
> > > > > > > > > > > > your buffer aligned to NALs in this case as you cannot scan
> > > > > > > > > > > > from random
> > > > > > > > > > > > location.
> > > > > > > > > > > > 
> > > > > > > > > > > > nal_type:
> > > > > > > > > > > > It's a bit more then just the type, but it contains at least
> > > > > > > > > > > > the
> > > > > > > > > > > > information of the nal type. This has different size on H.264
> > > > > > > > > > > > and HEVC
> > > > > > > > > > > > but I know it's size is in bytes.
> > > > > > > > > > > > 
> > > > > > > > > > > > slice_header:
> > > > > > > > > > > > This contains per slice parameters, like the modification
> > > > > > > > > > > > lists to
> > > > > > > > > > > > apply on the references. This one has a size in bits, not in
> > > > > > > > > > > > bytes.
> > > > > > > > > > > > 
> > > > > > > > > > > > slice:
> > > > > > > > > > > > I don't really know what is in it exactly, but this is the
> > > > > > > > > > > > data used to
> > > > > > > > > > > > decode. This bit has a special coding called the
> > > > > > > > > > > > anti-emulation, which
> > > > > > > > > > > > prevents a start-code from appearing in it. This coding is
> > > > > > > > > > > > present in
> > > > > > > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference
> > > > > > > > > > > > manual they
> > > > > > > > > > > > call ANNEX-B the bytestream format).
> > > > > > > > > > > > 
> > > > > > > > > > > > So, what we notice is that what is currently passed through
> > > > > > > > > > > > Cedrus
> > > > > > > > > > > > 
> > > > > > > > > > > > driver:
> > > > > > > > > > > >   [nal_type][slice_header][slice]
> > > > > > > > > > > > 
> > > > > > > > > > > > This matches what is being passed through VA-API. We can
> > > > > > > > > > > > understand
> > > > > > > > > > > > that stripping off the slice_header would be hard, since it's
> > > > > > > > > > > > size is
> > > > > > > > > > > > in bits. Instead we pass size and header_bit_size in
> > > > > > > > > > > > slice_params.
> > > > > > > > > > > 
> > > > > > > > > > > True, there is that.
> > > > > > > > > > > 
> > > > > > > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > > > > > > start_code_e, when you turn this off, you don't need start
> > > > > > > > > > > > code. As a
> > > > > > > > > > > > side effect, the bitstream becomes identical. We do now know
> > > > > > > > > > > > that it
> > > > > > > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > > > > > > 
> > > > > > > > > > > Oh great, that makes life easier in the short term, but I guess
> > > > > > > > > > > the
> > > > > > > > > > > issue could arise on another decoder sooner or later.
> > > > > > > > > > > 
> > > > > > > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is
> > > > > > > > > > > > that it
> > > > > > > > > > > > take care for us of reading and executing the modification
> > > > > > > > > > > > lists found
> > > > > > > > > > > > in the slice header. Mostly because I very disliked having to
> > > > > > > > > > > > pass the
> > > > > > > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver
> > > > > > > > > > > > the
> > > > > > > > > > > > transformation from the DPB entries into this p/b0/b1 list.
> > > > > > > > > > > > These list
> > > > > > > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> > > > > > > > > > > > the
> > > > > > > > > > > > following section is the execution of the modification list.
> > > > > > > > > > > > As this
> > > > > > > > > > > > list is not modified, it only need to be calculated per frame.
> > > > > > > > > > > > As a
> > > > > > > > > > > > result, we don't need these new lists, and we can work with
> > > > > > > > > > > > the same
> > > > > > > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > > > > > > 
> > > > > > > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > > > > > > modifications rather than reconstructing those in the driver
> > > > > > > > > > > from a
> > > > > > > > > > > full list. IMO controls should stick to the bitstream as close
> > > > > > > > > > > as
> > > > > > > > > > > possible.
> > > > > > > > > > 
> > > > > > > > > > For Hantro and RKVDEC, the list of modification is parsed by the
> > > > > > > > > > IP
> > > > > > > > > > from the slice header bits. Just to make sure, because I myself
> > > > > > > > > > was
> > > > > > > > > > confused on this before, the slice header does not contain a list
> > > > > > > > > > of
> > > > > > > > > > references, instead it contains a list modification to be applied
> > > > > > > > > > to
> > > > > > > > > > the reference list. I need to check again, but to execute these
> > > > > > > > > > modification, you need to filter and sort the references in a
> > > > > > > > > > specific
> > > > > > > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > > > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > > > > > > 
> > > > > > > > > > The list of references is deduced from the DPB. The DPB, which I
> > > > > > > > > > thinks
> > > > > > > > > > should be rename as "references", seems more useful then p/b0/b1,
> > > > > > > > > > since
> > > > > > > > > > this is the data that gives use the ability to implementing glue
> > > > > > > > > > in the
> > > > > > > > > > driver to compensate some HW differences.
> > > > > > > > > > 
> > > > > > > > > > In the case of Hantro / RKVDEC, we think it's natural to build the
> > > > > > > > > > HW
> > > > > > > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > > > > > > specific list in the decode_params structure. The fact these lists
> > > > > > > > > > are
> > > > > > > > > > standard intermediate step of the standard is not that important.
> > > > > > > > > 
> > > > > > > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > > > > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > > > > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > > > > > > instead of dpb.
> > > > > > > > > 
> > > > > > > > > > > > Now, this is just a start. For RK3399, we have a different
> > > > > > > > > > > > CODEC
> > > > > > > > > > > > design. This one does not have the start_code_e bit. What the
> > > > > > > > > > > > IP does,
> > > > > > > > > > > > is that you give it one or more slice per buffer, setup the
> > > > > > > > > > > > params,
> > > > > > > > > > > > start decoding, but the decoder then return the location of
> > > > > > > > > > > > the
> > > > > > > > > > > > following NAL. So basically you could offload the scanning of
> > > > > > > > > > > > start
> > > > > > > > > > > > code to the HW. That being said, with the driver layer in
> > > > > > > > > > > > between, that
> > > > > > > > > > > > would be amazingly inconvenient to use, and with Boyer-more
> > > > > > > > > > > > algorithm,
> > > > > > > > > > > > it is pretty cheap to scan this type of start-code on CPU. But
> > > > > > > > > > > > the
> > > > > > > > > > > > feature that this allows is to operate in frame mode. In this
> > > > > > > > > > > > mode, you
> > > > > > > > > > > > have 1 interrupt per frame.
> > > > > > > > > > > 
> > > > > > > > > > > I'm not sure there is any interest in exposing that from
> > > > > > > > > > > userspace and
> > > > > > > > > > > my current feeling is that we should just ditch support for
> > > > > > > > > > > per-frame
> > > > > > > > > > > decoding altogether. I think it mixes decoding with notions that
> > > > > > > > > > > are
> > > > > > > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > > > > > > 
> > > > > > > > > > I'm not worried about this either. We can already support that by
> > > > > > > > > > copying the bitstream internally to the driver, though zero-copy
> > > > > > > > > > with
> > > > > > > > > > this would require a new format, the one we talked about,
> > > > > > > > > > SLICE_ANNEX_B.
> > > > > > > > > 
> > > > > > > > > Right, but what I'm thinking about is making that the one and only
> > > > > > > > > format. The rationale is that it's always easier to just append a
> > > > > > > > > start
> > > > > > > > > code from userspace if needed. And we need a bit offset to the slice
> > > > > > > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > > > > > > have the whole thing that will work in every situation.
> > > > > > > > 
> > > > > > > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > > > > > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > > > > > > (.mp4, .mov) will require a full bitstream copy in userspace to add
> > > > > > > > the
> > > > > > > > start code (unless you hack your allocation in your demuxer, but it's
> > > > > > > > a
> > > > > > > > bit complicated since this code might come from two libraries). In
> > > > > > > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > > > > > > follows.
> > > > > > > 
> > > > > > > Well, I think we have to do a copy from system memory to the buffer
> > > > > > > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > > > > > > expected not to have any MMU unit and not allow sg import anyway.
> > > > > > 
> > > > > > The Rockchip has an mmu. You need one copy at least indeed,
> > > > > 
> > > > > Is the MMU in use currently? That can make things troublesome if we run
> > > > > into a case where the VPU has MMU and deals with scatter-gather while
> > > > > the display part doesn't. As far as I know, there's no way for
> > > > > userspace to know whether a dma-buf-exported buffer is backed by CMA or
> > > > > by scatter-gather memory. This feels like a major issue for using dma-
> > > > > buf, since userspace can't predict whether a buffer exported on one
> > > > > device can be imported on another when building its pipeline.
> > > > 
> > > > FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver
> > > > for it yet. It is supported for display, both VPUs and some other devices. I
> > > > think no sane SoC designer would left out one or another unit without IOMMU
> > > > support, that just calls for troubles, as you pointed out.
> > > 
> > > Right right, I've been following that from a distance :)
> > > 
> > > Indeed I think it's realistic to expect that for now, but it may not
> > > play out so well in the long term. For instance, maybe connecting a USB
> > > display would require CMA when the rest of the system can do with sg.
> > > 
> > > I think it would really be useful for userspace to have a way to test
> > > whether a buffer can be imported from one device to another. It feels
> > > better than indicating where the memory lives, since there are
> > > countless cases where additional restrictions apply too.
> > 
> > I don't know for the integration on the Rockchip, but I did notice the
> > register documentation for it.
> 
> All the important components in the SoC have their IOMMUs as well -
> display controller, GPU.
> 
> There is a blitter called RGA that is not behind an IOMMU, but has
> some scatter-gather capability (with a need for the hardware sg table
> to be physically contiguous). 

That's definitely good to know and justfies the need to introduce a way
for userspace to check if a buffer can be imported from one device to
another.

> That said, significance of such blitters
> nowadays is rather low, as most of the time you need a compositor on
> the GPU anyway, which can do any transformation in the same pass as
> the composition.

I think that is a crucial mistake and the way I see things, this will
have to change eventually. We cannot keep under-using the fast and
efficient hardware components and going with the war machine that is
the GPU in all situations. This has caused enough trouble in the
GNU/Linux userspace display stack already and I strongly believe it has
to stop.

> > In general, the most significant gain
> > with having iommu for CODECs is that it makes start up (and re-init)
> > time much shorter, but also in a much more predictable duration. I do
> > believe that the Venus driver (qualcomm) is one with solid support for
> > this, and it's quite noticably more snappy then the others.
> 
> Obviously you also get support for USERPTR if you have an IOMMU, but
> that also has some costs - you need to pin the user pages and map to
> the IOMMU before each frame and unmap and unpin after each frame,
> which sometimes is more costly than actually having the userspace copy
> to a preallocated and premapped buffer, especially for relatively
> small contents, such as compressed bitstream.

Heh, interesting point!

Cheers,

Paul

> Best regards,
> Tomasz
> 
> > We also faced an interesting issue recently on IMX.6 (there is just no
> > mmu there). We where playing a stream from the camera, and the
> > framerate would drastically drop as soon as you plug a USB camera (and
> > it would drop for quite a while). We found out that Etnaviv is doing
> > cma allocation per frame, hopefully this won't happen under V4L2
> > queues. But on this platform, starting a new stream while pluggin a USB
> > key could take several seconds to start.
> > 
> > About the RK3399, work will continue in the next couple of weeks, and
> > when this is done, we should have a much wider view of this subject.
> > Hopefully what we learned about H.264 will be useful for HEVC and
> > eventually AV1, which in term of bitstream uses similar stream formats
> > method. AV1 is by far the most complicated CODEC I have read about.
> > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > > Best regards,
> > > > Jernej
> > > > 
> > > > > > e.g. file
> > > > > > to mem, or udpsocket to mem. But right now, let's say with ffmpeg/mpeg-
> > > > > > ts, first you need to copy the MPEG TS to mem, then to demux you copy
> > > > > > that H264 stream to another buffer, you then copy in the parser,
> > > > > > removing the start-code and finally copy in the accelerator, adding the
> > > > > > start code. If the driver would allow userptr, it would be unusable.
> > > > > > 
> > > > > > GStreamer on the other side implement lazy conversion, so it would copy
> > > > > > the mpegts to mem, copy to demux, aggregate (with lazy merging) in the
> > > > > > parser (but stream format is negotiation, so it keeps the start-code).
> > > > > > If you request alignment=au, you have full frame of buffers, so if your
> > > > > > driver could do userptr, you can same that extra copy.
> > > > > > 
> > > > > > Now, if we demux an MP4 it's the same, the parser will need do a full
> > > > > > copy instead of lazy aggregation in order to prepend the start code
> > > > > > (since it had an AVC header). But userptr could save a copy.
> > > > > > 
> > > > > > If the driver requires no nal prefix, then we could just pass a
> > > > > > slightly forward point to userptr and avoid ACV to ANNEX-B conversion,
> > > > > > which is a bit slower (even know it's nothing compare to the full
> > > > > > copies we already do.
> > > > > > 
> > > > > > That was my argument in favour for no NAL prefix in term of efficiency,
> > > > > > and it does not prevent adding a control to enable start-code for cases
> > > > > > it make sense.
> > > > > 
> > > > > I see, so the internal arcitecture of userspace software may not be a
> > > > > good fit for adding these bits and it could hurt performance a bit.
> > > > > That feels like a significant downside.
> > > > > 
> > > > > > > So with that in mind, asking userspace to add a startcode it already
> > > > > > > knows doesn't seem to be asking too much.
> > > > > > > 
> > > > > > > > On the other end, the data_offset thing is likely just a thing for the
> > > > > > > > RK3399 to handle, it does not affect RK3288, Cedrus or IMX8M.
> > > > > > > 
> > > > > > > Well, I think it's best to be fool-proof here and just require that
> > > > > > > start code. We should also have per-slice bit offsets to the different
> > > > > > > parts anyway, so drivers that don't need it can just ignore it.
> > > > > > > 
> > > > > > > In extreme cases where there is some interest in doing direct buffer
> > > > > > > import without doing a copy in userspace, userspace could trick the
> > > > > > > format and avoid a copy by not providing the start-code (assuming it
> > > > > > > knows it doesn't need it) and specifying the bit offsets accordingly.
> > > > > > > That'd be a hack for better performance, and it feels better to do
> > > > > > > things in this order rather than having to hack around in the drivers
> > > > > > > that need the start code in every other case.
> > > > > > 
> > > > > > So basically, you and Tomas are both strongly in favour of adding
> > > > > > ANNEX-B start-code to the current uAPI. I have digged into Cedrus
> > > > > > registers, and it seems that it does have start-code scanning support.
> > > > > > I'm not sure it can do "full-frame" decoding, 1 interrupt per frame
> > > > > > like the RK do. That requires the IP to deal with the modifications
> > > > > > lists, which are per slices.
> > > > > 
> > > > > Actually the bitstream parser won't reconfigure the pipeline
> > > > > configuration registers, it's only around for userspace to avoid
> > > > > implementing bitstream parsing, but it's a standalone thing.
> > > > > 
> > > > > So if we want to do full-frame decoding we always need to reconfigure
> > > > > our pipeline (or do it like we do currently and just use one of the
> > > > > per-slice configuration and hope for the best).
> > > > > 
> > > > > Do we have more information on the RK3399 and what it requires exactly?
> > > > > (Just to make sure it's not another issue altogether.)
> > > > > 
> > > > > > My question is, are you willing to adapt the Cedrus driver to support
> > > > > > receiving start-code ? And will this have a performance impact or not ?
> > > > > > On RK side, it's really just about flipping 1 bit.
> > > > > > 
> > > > > > On the Rockchip side, Tomas had concern about CPU wakeup and the fact
> > > > > > that we didn't aim at supporting passing multiple slices at once to the
> > > > > > IP (something RK supports). It's important to understand that multi-
> > > > > > slice streams are relatively rare and mostly used for low-latency /
> > > > > > video conferencing. So aggregating in these case defeats the purpose of
> > > > > > using slices. So I think RK feature is not very important.
> > > > > 
> > > > > Agreed, let's aim for low-latency as a standard.
> > > > > 
> > > > > > Of course, I do believe that long term we will want to expose bot
> > > > > > stream formats on RK (because the HW can do that), so then userspace
> > > > > > can just pick the best when available. So that boils down to our first
> > > > > > idea, shall we expose _SLICE_A and _SLICE_B or something like this ?
> > > > > > Now that we have progressed on the matter, I'm quite in favour of
> > > > > > having _SLICE in the first place, with the preferred format that
> > > > > > everyone should support, and allow for variants later. Now, if we make
> > > > > > one mandatory, we could also just have a menu control to allow other
> > > > > > formats.
> > > > > 
> > > > > That seems fairly reasonable to me, and indeed, having one preferred
> > > > > format at first seems to be a good move.
> > > > > 
> > > > > > > > > To me the breaking point was about having the slice header both in
> > > > > > > > > raw
> > > > > > > > > bitstream and parsed forms. Since we agree that's fine, we might as
> > > > > > > > > well push it to its logical conclusion and include all the bits that
> > > > > > > > > can be useful.
> > > > > > > > 
> > > > > > > > To take your words, the bits that contain useful information starts
> > > > > > > > from the NAL type byte, exactly were the data was cut by VA-API and
> > > > > > > > the
> > > > > > > > current uAPI.
> > > > > > > 
> > > > > > > Agreed, but I think that the advantages of always requiring the start
> > > > > > > code outweigh the potential (yet quite unlikely) downsides.
> > > > > > > 
> > > > > > > > > > > > But it also support slice mode, with an
> > > > > > > > > > > > interrupt per slice, which is what we decided to use.
> > > > > > > > > > > 
> > > > > > > > > > > Easier for everyone and probably better for latency as well :)
> > > > > > > > > > > 
> > > > > > > > > > > > So in this case, indeed we strictly require on start-code.
> > > > > > > > > > > > Though, to
> > > > > > > > > > > > me this is not a great reason to make a new fourcc, so we will
> > > > > > > > > > > > try and
> > > > > > > > > > > > use (data_offset = 3) in order to make some space for that
> > > > > > > > > > > > start code,
> > > > > > > > > > > > and write it down in the driver. This is to be continued, we
> > > > > > > > > > > > will
> > > > > > > > > > > > report back on this later. This could have some side effect in
> > > > > > > > > > > > the
> > > > > > > > > > > > ability to import buffers. But most userspace don't try to do
> > > > > > > > > > > > zero-copy
> > > > > > > > > > > > on the encoded size and just copy anyway.
> > > > > > > > > > > > 
> > > > > > > > > > > > To my opinion, having a single format is a big deal, since
> > > > > > > > > > > > userspace
> > > > > > > > > > > > will generally be developed for one specific HW and we would
> > > > > > > > > > > > endup with
> > > > > > > > > > > > fragmented support. What we really want to achieve is having a
> > > > > > > > > > > > driver
> > > > > > > > > > > > interface which works across multiple HW, and I think this is
> > > > > > > > > > > > quite
> > > > > > > > > > > > possible.
> > > > > > > > > > > 
> > > > > > > > > > > I agree with that. The more I think about it, the more I believe
> > > > > > > > > > > we
> > > > > > > > > > > should just pass the whole
> > > > > > > > > > > [nal_header][nal_type][slice_header][slice]
> > > > > > > > > > > and the parsed list in every scenario.
> > > > > > > > > > 
> > > > > > > > > > What I like of the cut at nal_type, is that there is only format.
> > > > > > > > > > If we
> > > > > > > > > > cut at nal_header, then we need to expose 2 formats. And it makes
> > > > > > > > > > our
> > > > > > > > > > API similar to other accelerator API, so it's easy to "convert"
> > > > > > > > > > existing userspace.
> > > > > > > > > 
> > > > > > > > > Unless we make that cut the single one and only true cut that shall
> > > > > > > > > supersed all other cuts :)
> > > > > > > > 
> > > > > > > > That's basically what I've been trying to do, kill this _RAW/ANNEX_B
> > > > > > > > thing and go back to our first idea.
> > > > > > > 
> > > > > > > Right, in the end I think we should go with:
> > > > > > > V4L2_PIX_FMT_MPEG2_SLICE
> > > > > > > V4L2_PIX_FMT_H264_SLICE
> > > > > > > V4L2_PIX_FMT_HEVC_SLICE
> > > > > > > 
> > > > > > > And just require raw bitstream for the slice with emulation-prevention
> > > > > > > bits included.
> > > > > > 
> > > > > > That's should be the set of format we start with indeed. The single
> > > > > > format for which software gets written and tested, making sure software
> > > > > > support is not fragmented, and other variants should be something to
> > > > > > opt-in.
> > > > > 
> > > > > Cheers for that!
> > > > > 
> > > > > Paul
> > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Paul
> > > > > > > 
> > > > > > > > > > > For H.265, our decoder needs some information from the NAL type
> > > > > > > > > > > too.
> > > > > > > > > > > We currently extract that in userspace and stick it to the
> > > > > > > > > > > slice_header, but maybe it would make more sense to have drivers
> > > > > > > > > > > parse
> > > > > > > > > > > that info from the buffer if they need it. On the other hand, it
> > > > > > > > > > > seems
> > > > > > > > > > > quite common to pass information from the NAL type, so maybe we
> > > > > > > > > > > should
> > > > > > > > > > > either make a new control for it or have all the fields in the
> > > > > > > > > > > slice_header (which would still be wrong in terms of matching
> > > > > > > > > > > bitstream
> > > > > > > > > > > description).
> > > > > > > > > > 
> > > > > > > > > > Even in userspace, it's common to just parse this in place, it's a
> > > > > > > > > > simple mask. But yes, if we don't have it yet, we should expose
> > > > > > > > > > the NAL
> > > > > > > > > > type, it would be cleaner.
> > > > > > > > > 
> > > > > > > > > Right, works for me.
> > > > > > > > 
> > > > > > > > Ack.
> > > > > > > > 
> > > > > > > > > Cheers,
> > > > > > > > > 
> > > > > > > > > Paul
> > > > > > > > > 
> > > > > > > > > > > > > - Dropping the DPB concept in H.264/H.265
> > > > > > > > > > > > > 
> > > > > > > > > > > > > As far as I could understand, the decoded picture buffer
> > > > > > > > > > > > > (DPB) is a
> > > > > > > > > > > > > concept that only makes sense relative to a decoder
> > > > > > > > > > > > > implementation. The
> > > > > > > > > > > > > spec mentions how to manage it with the Hypothetical
> > > > > > > > > > > > > reference decoder
> > > > > > > > > > > > > (Annex C), but that's about it.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What's really in the bitstream is the list of modified
> > > > > > > > > > > > > short-term and
> > > > > > > > > > > > > long-term references, which is enough for every decoder.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For this reason, I strongly believe we should stop talking
> > > > > > > > > > > > > about DPB in
> > > > > > > > > > > > > the controls and just pass these lists agremented with
> > > > > > > > > > > > > relevant
> > > > > > > > > > > > > information for userspace.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think it should be up to the driver to maintain a DPB and
> > > > > > > > > > > > > we could
> > > > > > > > > > > > > have helpers for common cases. For instance, the rockchip
> > > > > > > > > > > > > decoder needs
> > > > > > > > > > > > > to keep unused entries around[2] and cedrus has the same
> > > > > > > > > > > > > requirement
> > > > > > > > > > > > > for H.264. However for cedrus/H.265, we don't need to do any
> > > > > > > > > > > > > book-
> > > > > > > > > > > > > keeping in particular and can manage with the lists from the
> > > > > > > > > > > > > bitstream
> > > > > > > > > > > > > directly.
> > > > > > > > > > > > 
> > > > > > > > > > > > As discusses today, we still need to pass that list. It's
> > > > > > > > > > > > being index
> > > > > > > > > > > > by the HW to retrieve the extra information we have collected
> > > > > > > > > > > > about the
> > > > > > > > > > > > status of the reference frames. In the case of Hantro, which
> > > > > > > > > > > > process
> > > > > > > > > > > > the modification list from the slice header for us, we also
> > > > > > > > > > > > need that
> > > > > > > > > > > > list to construct the unmodified list.
> > > > > > > > > > > > 
> > > > > > > > > > > > So the problem here is just a naming problem. That list is not
> > > > > > > > > > > > really a
> > > > > > > > > > > > DPB. It is just the list of long-term/short-term references
> > > > > > > > > > > > with the
> > > > > > > > > > > > status of these references. So maybe we could just rename as
> > > > > > > > > > > > references/reference_entry ?
> > > > > > > > > > > 
> > > > > > > > > > > What I'd like to pass is the diff to the references list, as
> > > > > > > > > > > ffmpeg
> > > > > > > > > > > currently provides for v4l2 request and vaapi (probably vdpau
> > > > > > > > > > > too). No
> > > > > > > > > > > functional change here, only that we should stop calling it a
> > > > > > > > > > > DPB,
> > > > > > > > > > > which confuses everyone.
> > > > > > > > > > 
> > > > > > > > > > Yes.
> > > > > > > > > > 
> > > > > > > > > > > > > - Using flags
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The current MPEG-2 controls have lots of u8 values that can
> > > > > > > > > > > > > be
> > > > > > > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > > > > > > It's unlikely that we'll get more than 64 flags, so using a
> > > > > > > > > > > > > u64 by
> > > > > > > > > > > > > default for that sounds fine (we definitely do want to keep
> > > > > > > > > > > > > some room
> > > > > > > > > > > > > available and I don't think using 32 bits as a default is
> > > > > > > > > > > > > good enough).
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I think H.264/HEVC per-control flags should also be moved to
> > > > > > > > > > > > > u64.
> > > > > > > > > > > > 
> > > > > > > > > > > > Make sense, I guess bits (member : 1) are not allowed in uAPI
> > > > > > > > > > > > right ?
> > > > > > > > > > > 
> > > > > > > > > > > Mhh, even if they are, it makes it much harder to verify 32/64
> > > > > > > > > > > bit
> > > > > > > > > > > alignment constraints (we're dealing with 64-bit platforms that
> > > > > > > > > > > need to
> > > > > > > > > > > have 32-bit userspace and compat_ioctl).
> > > > > > > > > > 
> > > > > > > > > > I see, thanks.
> > > > > > > > > > 
> > > > > > > > > > > > > - Clear split of controls and terminology
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Some codecs have explicit NAL units that are good fits to
> > > > > > > > > > > > > match as
> > > > > > > > > > > > > controls: e.g. slice header, pps, sps. I think we should
> > > > > > > > > > > > > stick to the
> > > > > > > > > > > > > bitstream element names for those.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > > > > > > - renaming v4l2_ctrl_h264_decode_param to
> > > > > > > > > > > > > v4l2_ctrl_h264_slice_header;
> > > > > > > > > > > > 
> > > > > > > > > > > > Oops, I think you meant slice_prams ? decode_params matches
> > > > > > > > > > > > the
> > > > > > > > > > > > information found in SPS/PPS (combined?), while slice_params
> > > > > > > > > > > > matches
> > > > > > > > > > > > the information extracted (and executed in case of l0/l1) from
> > > > > > > > > > > > the
> > > > > > > > > > > > slice headers.
> > > > > > > > > > > 
> > > > > > > > > > > Yes you're right, I mixed them up.
> > > > > > > > > > > 
> > > > > > > > > > > >  That being said, to me this name wasn't confusing, since
> > > > > > > > > > > > 
> > > > > > > > > > > > it's not just the slice header, and it's per slice.
> > > > > > > > > > > 
> > > > > > > > > > > Mhh, what exactly remains in there and where does it originate
> > > > > > > > > > > in the
> > > > > > > > > > > bitstream? Maybe it wouldn't be too bad to have one control per
> > > > > > > > > > > actual
> > > > > > > > > > > group of bitstream elements.
> > > > > > > > > > > 
> > > > > > > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the
> > > > > > > > > > > > > reference lists
> > > > > > > > > > > > > where they belong, which seems to be slice_header;
> > > > > > > > > > > > 
> > > > > > > > > > > > There reference list is only updated by userspace (through
> > > > > > > > > > > > it's DPB)
> > > > > > > > > > > > base on the result of the last decoding step. I was very
> > > > > > > > > > > > confused for a
> > > > > > > > > > > > moment until I realize that the lists in the slice_header are
> > > > > > > > > > > > just a
> > > > > > > > > > > > list of modification to apply to the reference list in order
> > > > > > > > > > > > to produce
> > > > > > > > > > > > l0 and l1.
> > > > > > > > > > > 
> > > > > > > > > > > Indeed, and I'm suggesting that we pass the modifications only,
> > > > > > > > > > > which
> > > > > > > > > > > would fit a slice_header control.
> > > > > > > > > > 
> > > > > > > > > > I think I made my point why we want the dpb -> references. I'm
> > > > > > > > > > going to
> > > > > > > > > > validate with the VA driver now, to see if the references list
> > > > > > > > > > there is
> > > > > > > > > > usable with our code.
> > > > > > > > > > 
> > > > > > > > > > > Cheers,
> > > > > > > > > > > 
> > > > > > > > > > > Paul
> > > > > > > > > > > 
> > > > > > > > > > > > > I'm up for preparing and submitting these control changes
> > > > > > > > > > > > > and updating
> > > > > > > > > > > > > cedrus if they seem agreeable.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Paul
> > > > > > > > > > > > > 
> > > > > > > > > > > > > [0]: https://lkml.org/lkml/2019/3/6/82
> > > > > > > > > > > > > [1]: https://patchwork.linuxtv.org/patch/55947/
> > > > > > > > > > > > > [2]:
> > > > > > > > > > > > > https://chromium.googlesource.com/chromiumos/third_party/ke
> > > > > > > > > > > > > rnel/+/4d7cb46539a93bb6acc802f5a46acddb5aaab378


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  7:29           ` Boris Brezillon
  2019-05-22  8:20             ` Boris Brezillon
@ 2019-05-22  8:32             ` Thierry Reding
  2019-05-22  9:29               ` Paul Kocialkowski
  1 sibling, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-05-22  8:32 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomasz Figa, Nicolas Dufresne, Paul Kocialkowski,
	Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Maxime Ripard, Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 4323 bytes --]

On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote:
> On Wed, 22 May 2019 15:39:37 +0900
> Tomasz Figa <tfiga@chromium.org> wrote:
> 
> > > It would be premature to state that we are excluding. We are just
> > > trying to find one format to get things upstream, and make sure we have
> > > a plan how to extend it. Trying to support everything on the first try
> > > is not going to work so well.
> > >
> > > What is interesting to provide is how does you IP achieve multi-slice
> > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > Typical questions are:
> > >
> > >   1. Do all slices have to be contiguous in memory
> > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> > 
> > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > driver that was created mostly by RK themselves, we've been assuming
> > that the interrupt is for the complete frame, without any problems.
> 
> I confirm that's what happens when all slices forming a frame are packed
> in a single output buffer: you only get one interrupt at the end of the
> decoding process (in that case, when the frame is decoded). Of course,
> if you split things up and do per-slice decoding instead (one slice per
> buffer) you get an interrupt per slice, though I didn't manage to make
> that work.
> I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> interrupt on the second slice, which makes me think some states are
> reset between the 2 operations leading the engine to think that the
> second slice is part of a new frame.

That sounds a lot like how this works on Tegra. My understanding is that
for slice decoding you'd also get an interrupt every time a full slice
has been decoded perhaps coupled with another "frame done" interrupt
when the full frame has been decoded after the last slice.

In frame-level decode mode you don't get interrupts in between and
instead only get the "frame done" interrupt. Unless something went wrong
during decoding, in which case you also get an interrupt but with error
flags and status registers that help determine what exactly happened.

> Anyway, it doesn't sound like a crazy idea to support both per-slice
> and per-frame decoding and maybe have a way to expose what a
> specific codec can do (through an extra cap mechanism).

Yeah, I think it makes sense to support both for devices that can do
both. From what Nicolas said it may make sense for an application to
want to do slice-level decoding if receiving a stream from the network
and frame-level decoding if playing back from a local file. If a driver
supports both, the application could detect that and choose the
appropriate format.

It sounds to me like using different input formats for that would be a
very natural way to describe it. Applications can already detect the set
of supported input formats and set the format when they allocate buffers
so that should work very nicely.

> The other option would be to support only per-slice decoding with a
> mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> only support per-frame decoding know when they should trigger the
> decoding operation. The downside is that it implies having a bounce
> buffer where the driver can pack slices to be decoded on the END_FRAME
> event.

I vaguely remember that that's what the video codec abstraction does in
Mesa/Gallium. I'm not very familiar with V4L2, but this seems like it
could be problematic to integrate with the way that V4L2 works in
general. Perhaps sending a special buffer (0 length or whatever) to mark
the end of a frame would work. But this is probably something that
others have already thought about, since slice-level decoding is what
most people are using, hence there must already be a way for userspace
to somehow synchronize input vs. output buffers. Or does this currently
just work by queueing bitstream buffers as fast as possible and then
dequeueing frame buffers as they become available?

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  8:32             ` Thierry Reding
@ 2019-05-22  9:29               ` Paul Kocialkowski
  2019-05-22 11:39                 ` Thierry Reding
  2019-05-22 18:26                 ` Nicolas Dufresne
  0 siblings, 2 replies; 55+ messages in thread
From: Paul Kocialkowski @ 2019-05-22  9:29 UTC (permalink / raw)
  To: Thierry Reding, Boris Brezillon
  Cc: Tomasz Figa, Nicolas Dufresne, Linux Media Mailing List,
	Hans Verkuil, Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

Le mercredi 22 mai 2019 à 10:32 +0200, Thierry Reding a écrit :
> On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote:
> > On Wed, 22 May 2019 15:39:37 +0900
> > Tomasz Figa <tfiga@chromium.org> wrote:
> > 
> > > > It would be premature to state that we are excluding. We are just
> > > > trying to find one format to get things upstream, and make sure we have
> > > > a plan how to extend it. Trying to support everything on the first try
> > > > is not going to work so well.
> > > > 
> > > > What is interesting to provide is how does you IP achieve multi-slice
> > > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > > Typical questions are:
> > > > 
> > > >   1. Do all slices have to be contiguous in memory
> > > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> > > 
> > > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > > driver that was created mostly by RK themselves, we've been assuming
> > > that the interrupt is for the complete frame, without any problems.
> > 
> > I confirm that's what happens when all slices forming a frame are packed
> > in a single output buffer: you only get one interrupt at the end of the
> > decoding process (in that case, when the frame is decoded). Of course,
> > if you split things up and do per-slice decoding instead (one slice per
> > buffer) you get an interrupt per slice, though I didn't manage to make
> > that work.
> > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> > interrupt on the second slice, which makes me think some states are
> > reset between the 2 operations leading the engine to think that the
> > second slice is part of a new frame.
> 
> That sounds a lot like how this works on Tegra. My understanding is that
> for slice decoding you'd also get an interrupt every time a full slice
> has been decoded perhaps coupled with another "frame done" interrupt
> when the full frame has been decoded after the last slice.
> 
> In frame-level decode mode you don't get interrupts in between and
> instead only get the "frame done" interrupt. Unless something went wrong
> during decoding, in which case you also get an interrupt but with error
> flags and status registers that help determine what exactly happened.
> 
> > Anyway, it doesn't sound like a crazy idea to support both per-slice
> > and per-frame decoding and maybe have a way to expose what a
> > specific codec can do (through an extra cap mechanism).
> 
> Yeah, I think it makes sense to support both for devices that can do
> both. From what Nicolas said it may make sense for an application to
> want to do slice-level decoding if receiving a stream from the network
> and frame-level decoding if playing back from a local file. If a driver
> supports both, the application could detect that and choose the
> appropriate format.
> 
> It sounds to me like using different input formats for that would be a
> very natural way to describe it. Applications can already detect the set
> of supported input formats and set the format when they allocate buffers
> so that should work very nicely.

Pixel formats are indeed the natural way to go about this, but I have
some reservations in this case. Slices are the natural unit of video
streams, just like frames are to display hardware. Part of the pipeline
configuration is slice-specific, so in theory, the pipeline needs to be
reconfigured with each slice.

What we have been doing in Cedrus is to currently gather all the slices
and use the last slice's specific configuration for the pipeline, which
sort of works, but is very likely not a good idea.

You mentionned that the Tegra VPU currentyl always operates in frame
mode (even when the stream actually has multiple slices, which I assume
are gathered at some point). I wonder how it goes about configuring
different slice parameters (which are specific to each slice, not
frame) for the different slices. 

I believe we should at least always expose per-slice granularity in the
pixel format and requests. Maybe we could have a way to allow multiple
slices to be gathered in the source buffer and have a control slice
array for each request. In that case, we'd have a single request queued
for the series of slices, with a bit offset in each control to the
matching slice.

Then we could specify that such slices must be appended in a way that
suits most decoders that would have to operate per-frame (so we need to
figure this out) and worst case, we'll always have offsets in the
controls if we need to setup a bounce buffer in the driver because
things are not laid out the way we specified.

Then we introduce a specific cap to indicate which mode is supported
(per-slice and/or per-frame) and adapt our ffmpeg reference to be able
to operate in both modes.

That adds some complexity for userspace, but I don't think we can avoid
it at this point and it feels better than having two different pixel
formats (which would probably be even more complex to manage for
userspace).

What do you think?

> > The other option would be to support only per-slice decoding with a
> > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> > only support per-frame decoding know when they should trigger the
> > decoding operation. The downside is that it implies having a bounce
> > buffer where the driver can pack slices to be decoded on the END_FRAME
> > event.
> 
> I vaguely remember that that's what the video codec abstraction does in
> Mesa/Gallium. 

Well, if it's exposed through VDPAU or VAAPI, the interface already
operates per-slice and it would certainly not be a big issue to change
that.

Talking about the mesa/gallium video decoding stuff, I think it would
be worth having V4L2 interfaces for that now that we have the Request
API.

Basically, Nvidia GPUs have video decoding blocks (which could be
similar to the ones present on Tegra) that are accessed through a
firmware running on a Falcon MCU on the GPU side.

Having a standardized firmware interface for these and a V4L2 M2M
driver for the interface would certainly make it easier for everyone to
handle that. I don't really see why these video decoding hardware has
to be exposed through the display stack anyway and one could want to
use the GPU's video decoder without bringing up the shading cores.

> I'm not very familiar with V4L2, but this seems like it
> could be problematic to integrate with the way that V4L2 works in
> general. Perhaps sending a special buffer (0 length or whatever) to mark
> the end of a frame would work. But this is probably something that
> others have already thought about, since slice-level decoding is what
> most people are using, hence there must already be a way for userspace
> to somehow synchronize input vs. output buffers. Or does this currently
> just work by queueing bitstream buffers as fast as possible and then
> dequeueing frame buffers as they become available?

We have a Request API mechanism where we group controls (parsed
bitstream meta-data) and source (OUTPUT) buffers together and submit
them tied. When each request gets processed its buffer enters the
OUTPUT queue, which gets picked up by the driver and associated with
the first destination (CAPTURE) buffer available. Then the driver grabs
the buffers and applies the controls matching the source buffer's
request before starting decoding with M2M.

We have already worked on handling the case of requiring a single
destination buffer for the different slices, by having a flag to
indicate whether the destination buffer should be held.

Cheers,

Paul


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-21 16:23       ` Nicolas Dufresne
  2019-05-22  6:39         ` Tomasz Figa
@ 2019-05-22 10:08         ` Thierry Reding
  2019-05-22 18:37           ` Nicolas Dufresne
  1 sibling, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-05-22 10:08 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Paul Kocialkowski, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 16633 bytes --]

On Tue, May 21, 2019 at 12:23:46PM -0400, Nicolas Dufresne wrote:
> Le mardi 21 mai 2019 à 17:43 +0200, Thierry Reding a écrit :
> > On Wed, May 15, 2019 at 07:42:50PM +0200, Paul Kocialkowski wrote:
> > > Hi,
> > > 
> > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > better idea of what the situation is like on platforms other than
> > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > 
> > > > > - Per-slice decoding
> > > > > 
> > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > to implement the required core bits. When we agree it looks good, we
> > > > > should lift the restriction that all slices must be concatenated and
> > > > > have them submitted as individual requests.
> > > > > 
> > > > > One question is what to do about other controls. I feel like it would
> > > > > make sense to always pass all the required controls for decoding the
> > > > > slice, including the ones that don't change across slices. But there
> > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > and relying on the "control cache" can work, but we need to specify
> > > > > that only a single stream can be decoded per opened instance of the
> > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > 
> > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > should be responsible of doing time-division multiplexing across
> > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > userspace would require some sort of daemon to work properly across
> > > > processes. I also think the kernel is better place for doing resource
> > > > access scheduling in general.
> > > 
> > > I agree with that yes. We always have a single m2m context and specific
> > > controls per opened device so keeping cached values works out well.
> > > 
> > > So maybe we shall explicitly require that the request with the first
> > > slice for a frame also contains the per-frame controls.
> > > 
> > > > > - Annex-B formats
> > > > > 
> > > > > I don't think we have really reached a conclusion on the pixel formats
> > > > > we want to expose. The main issue is how to deal with codecs that need
> > > > > the full slice NALU with start code, where the slice_header is
> > > > > duplicated in raw bitstream, when others are fine with just the encoded
> > > > > slice data and the parsed slice header control.
> > > > > 
> > > > > My initial thinking was that we'd need 3 formats:
> > > > > - One that only takes only the slice compressed data (without raw slice
> > > > > header and start code);
> > > > > - One that takes both the NALU data (including start code, raw header
> > > > > and compressed data) and slice header controls;
> > > > > - One that takes the NALU data but no slice header.
> > > > > 
> > > > > But I no longer think the latter really makes sense in the context of
> > > > > stateless video decoding.
> > > > > 
> > > > > A side-note: I think we should definitely have data offsets in every
> > > > > case, so that implementations can just push the whole NALU regardless
> > > > > of the format if they're lazy.
> > > > 
> > > > I realize that I didn't share our latest research on the subject. So a
> > > > slice in the original bitstream is formed of the following blocks
> > > > (simplified):
> > > > 
> > > >   [nal_header][nal_type][slice_header][slice]
> > > 
> > > Thanks for the details!
> > > 
> > > > nal_header:
> > > > This one is a header used to locate the start and the end of the of a
> > > > NAL. There is two standard forms, the ANNEX B / start code, a sequence
> > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first byte
> > > > would be a leading 0 from the previous NAL padding, but this is also
> > > > totally valid start code. The second form is the AVC form, notably used
> > > > in ISOMP4 container. It simply is the size of the NAL. You must keep
> > > > your buffer aligned to NALs in this case as you cannot scan from random
> > > > location.
> > > > 
> > > > nal_type:
> > > > It's a bit more then just the type, but it contains at least the
> > > > information of the nal type. This has different size on H.264 and HEVC
> > > > but I know it's size is in bytes.
> > > > 
> > > > slice_header:
> > > > This contains per slice parameters, like the modification lists to
> > > > apply on the references. This one has a size in bits, not in bytes.
> > > > 
> > > > slice:
> > > > I don't really know what is in it exactly, but this is the data used to
> > > > decode. This bit has a special coding called the anti-emulation, which
> > > > prevents a start-code from appearing in it. This coding is present in
> > > > both forms, ANNEX-B or AVC (in GStreamer and some reference manual they
> > > > call ANNEX-B the bytestream format).
> > > > 
> > > > So, what we notice is that what is currently passed through Cedrus
> > > > driver:
> > > >   [nal_type][slice_header][slice]
> > > > 
> > > > This matches what is being passed through VA-API. We can understand
> > > > that stripping off the slice_header would be hard, since it's size is
> > > > in bits. Instead we pass size and header_bit_size in slice_params.
> > > 
> > > True, there is that.
> > > 
> > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > start_code_e, when you turn this off, you don't need start code. As a
> > > > side effect, the bitstream becomes identical. We do now know that it
> > > > works with the ffmpeg branch implement for cedrus.
> > > 
> > > Oh great, that makes life easier in the short term, but I guess the
> > > issue could arise on another decoder sooner or later.
> > > 
> > > > Now what's special about Hantro G1 (also found on IMX8M) is that it
> > > > take care for us of reading and executing the modification lists found
> > > > in the slice header. Mostly because I very disliked having to pass the
> > > > p/b0/b1 parameters, is that Boris implemented in the driver the
> > > > transformation from the DPB entries into this p/b0/b1 list. These list
> > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2. the
> > > > following section is the execution of the modification list. As this
> > > > list is not modified, it only need to be calculated per frame. As a
> > > > result, we don't need these new lists, and we can work with the same
> > > > H264_SLICE format as Cedrus is using.
> > > 
> > > Yes but I definitely think it makes more sense to pass the list
> > > modifications rather than reconstructing those in the driver from a
> > > full list. IMO controls should stick to the bitstream as close as
> > > possible.
> > > 
> > > > Now, this is just a start. For RK3399, we have a different CODEC
> > > > design. This one does not have the start_code_e bit. What the IP does,
> > > > is that you give it one or more slice per buffer, setup the params,
> > > > start decoding, but the decoder then return the location of the
> > > > following NAL. So basically you could offload the scanning of start
> > > > code to the HW. That being said, with the driver layer in between, that
> > > > would be amazingly inconvenient to use, and with Boyer-more algorithm,
> > > > it is pretty cheap to scan this type of start-code on CPU. But the
> > > > feature that this allows is to operate in frame mode. In this mode, you
> > > > have 1 interrupt per frame.
> > > 
> > > I'm not sure there is any interest in exposing that from userspace and
> > > my current feeling is that we should just ditch support for per-frame
> > > decoding altogether. I think it mixes decoding with notions that are
> > > higher-level than decoding, but I agree it's a blurry line.
> > 
> > I'm not sure ditching support for per-frame decoding would be a wise
> > decision. What if some device comes around that only supports frame
> > decoding and can't handle individual slices?
> > 
> > We have such a situation on Tegra, for example. I think the hardware can
> > technically decode individual slices, but it can also be set up to do a
> > lot more and operate in basically a per-frame mode where you just pass
> > it a buffer containing the complete bitstream for one frame and it'll
> > just raise an interrupt when it's done decoding.
> > 
> > Per-frame mode is what's currently implemented in the staging driver and
> > as far as I can tell it's also what's implemented in the downstream
> > driver, which uses a completely different architecture (it uploads a
> > firmware that processes a command stream). I have seen registers that
> > seem to be related to a slice-decoding mode, but honestly I have no idea
> > how to program them to achieve that.
> > 
> > Now the VDE IP that I'm dealing with is pretty old, but from what I know
> > of newer IP, they follow a similar command stream architecture as the
> > downstream VDE driver, so I'm not sure those support per-slice decoding
> > either. They typically have a firmware that processes command streams
> > and userspace typically just passes a single bitstream buffer along with
> > reference frames and gets back the decoded frame. I'd have to
> > investigate further to understand if slice-level decoding is supported
> > on the newer hardware.
> > 
> > I'm not familiar with any other decoders, but per-frame decoding doesn't
> > strike me as a very exotic idea. Excluding such decoders from the ABI
> > sounds a bit premature.
> 
> It would be premature to state that we are excluding. We are just
> trying to find one format to get things upstream, and make sure we have
> a plan how to extend it. Trying to support everything on the first try
> is not going to work so well.

Okay that sounds reasonable. I must have misinterpreted what you were
discussing. Sorry.

> What is interesting to provide is how does you IP achieve multi-slice
> decoding per frame. That's what we are studying on the RK/Hantro chip.
> Typical questions are:
> 
>   1. Do all slices have to be contiguous in memory

All of the systems that integrate VDE have an SMMU, though on many of
them that SMMU is very limited (on one generation of Tegra it's really
only a GART and on others the number of virtual address spaces is so
small that it's not always practical to rely on the SMMU). So if SMMU
support is enabled, then slices can be scattered in memory, but they
will have to be I/O virtually contiguous. The VDE itself does not
support SG.

>   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?

My understanding is that there's a "syntax engine" whose job it is to
parse the bitstream that you point it at (using the "bitstream engine"
to extract individual elements). The syntax elements parsed are used to
control the "macro-block engine" via a set of commands. The syntax
engine needs the start-code in order to work and will generate an error
ohterwise. I haven't come across a way to disable this, so it looks like
the start code is always required. Or I should say, the decoder always
requires Annex B format. This also happens to be what for example VDPAU
will generate. I suppose it's a fairly natural choice, given that that's
the byte stream format recommended by the H.264 standard.

>   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)

Yeah, we definitely do get a single interrupt at the end of a frame, or
when an error occurs. Looking a bit at the register documentation it
looks like this can be more fine-grained. We can for example get an
interrupt at the end of a slice or a row of macro blocks.

> And other things like this. The more data we have, the better the
> initial interface will be.
> 
> > 
> > > > But it also support slice mode, with an
> > > > interrupt per slice, which is what we decided to use.
> > > 
> > > Easier for everyone and probably better for latency as well :)
> > 
> > I'm not sure I understand what's easier about slice-level decoding or
> > how this would improve latency. If anything getting less interrupts is
> > good, isn't it?
> > 
> > If we can offload more to hardware, certainly that's something we want
> > to take advantage of, no?
> 
> In H.264, pretty much all stream have single slice per frame. That's
> because it gives the highest quality. But in live streaming, like for
> webrtc, it's getting more common to actually encode with multiple
> slices (it's group of macroblocks usually in raster order). Usually
> it's a very small amount of slices, 4, 8, something in this range.
> 
> When a slice is encoded, the encoder will let it go before it starts
> the following, this allow network transfer to happen in parallel of
> decoding.
> 
> On the receiver, as soon as a slice is available, the decoder will be
> started immediately, which allow the receiving of buffer and the
> decoding of the slices to happen in parallel. You end up with a lot
> less delay between the reception of the last slice and having a full
> frame ready.

Okay, that clarifies things. I'm not sure I fully agree with "a lot less
delay". Hardware decoders are usually capable of decoding in realtime so
in most cases I would expect the decoder latency to be somewhere on the
order of 16-40 ms, and network latency can't be much higher than that to
ensure smooth playback, so worst case the total latency should be on the
order of 32-80 ms. Even assuming 100 ms worst case latency, that's not
too bad in my experience. Unless you're aiming for some application like
game streaming, in which case you'd be more on the lower end of that
range anyway because of the required framerate.

Anyway, I'm not trying to argue that slice-level decoding is a bad thing
or unnecessary. I'm merely trying to point out that for many use-cases
frame-level decoding is more than good enough for people's needs.

> So that's how slices are used to reduce latency. Now, if you are
> decoding from a container like ISOMP4, you'll have full frame, so it
> make sense to queue all these frame, and le the decoder bundle that if
> possible, if the HW allow to enable mode where you have single IRQ per
> frame. Though, it's pretty rare that you'll find such a file with
> slices. What we'd like to resolve is how these are resolved. There is
> nothing that prevents it right now in the uAPI, but you'd have to copy
> the input into another buffer, adding the separators if needed.
> 
> What we are trying to achieve in this thread is to find a compromise
> that makes uAPI sane, but also makes decoding efficient on all the HW
> we know at least.

It's been some time since I looked at this in detail, but my
recollection is that things like MPEG TS use what is basically the annex
B byte stream format. On the other hand I recall that ffmpeg has a
filter that can be used to add a start code if the input stream doesn't
have one (e.g. if you are playing back from an MP4 container) but the
decoder requires one (e.g. VDPAU). I'm not familiar with VAAPI or things
like gstreamer, but I suspect that they have something similar in place.
Perhaps somebody with more knowledge of those can share their wisdom. If
there are any commonalities between all of those maybe that could serve
as guidance on what a V4L2 interface should be providing in terms of
input format.

Naively I would consider more information (rather than less) easier to
deal with. If you have more information than necessary it's usually
pretty easy to skip it (hardware may already be able to do so, or you
can rewrite some pointer/offset to do that). On the other hand, if you
have too little information it's not always easy to add it. I guess you
could argue that it's not a big issue for something like a start code,
but it still means you have to concatenate in order to prepend the data,
which usually means you need a copy in software if you don't have SG
capabilities.

Of course I may be somewhat biased because this happens to coincide with
what VDE expects...

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  8:26                         ` Paul Kocialkowski
@ 2019-05-22 10:42                           ` Thierry Reding
  2019-05-22 10:55                             ` Hans Verkuil
  0 siblings, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-05-22 10:42 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Tomasz Figa, Nicolas Dufresne, Jernej Škrabec,
	Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 23761 bytes --]

On Wed, May 22, 2019 at 10:26:28AM +0200, Paul Kocialkowski wrote:
> Hi,
> 
> Le mercredi 22 mai 2019 à 15:48 +0900, Tomasz Figa a écrit :
> > On Sat, May 18, 2019 at 11:09 PM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > > Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> > > > > Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> > > > > > Hi,
> > > > > > 
> > > > > > On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> > > > > > > Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> > > > > > > > > Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> > > > > > > > > > > Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> > > > > > > > > > > > Hi,
> > > > > > > > > > > > 
> > > > > > > > > > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit
> > > > > :
> > > > > > > > > > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a
> > > > > écrit :
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > With the Rockchip stateless VPU driver in the works, we now
> > > > > > > > > > > > > > have a
> > > > > > > > > > > > > > better idea of what the situation is like on platforms other
> > > > > > > > > > > > > > than
> > > > > > > > > > > > > > Allwinner. This email shares my conclusions about the
> > > > > > > > > > > > > > situation and how
> > > > > > > > > > > > > > we should update the MPEG-2, H.264 and H.265 controls
> > > > > > > > > > > > > > accordingly.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > - Per-slice decoding
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > We've discussed this one already[0] and Hans has submitted a
> > > > > > > > > > > > > > patch[1]
> > > > > > > > > > > > > > to implement the required core bits. When we agree it looks
> > > > > > > > > > > > > > good, we
> > > > > > > > > > > > > > should lift the restriction that all slices must be
> > > > > > > > > > > > > > concatenated and
> > > > > > > > > > > > > > have them submitted as individual requests.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > One question is what to do about other controls. I feel like
> > > > > > > > > > > > > > it would
> > > > > > > > > > > > > > make sense to always pass all the required controls for
> > > > > > > > > > > > > > decoding the
> > > > > > > > > > > > > > slice, including the ones that don't change across slices.
> > > > > > > > > > > > > > But there
> > > > > > > > > > > > > > may be no particular advantage to this and only downsides.
> > > > > > > > > > > > > > Not doing it
> > > > > > > > > > > > > > and relying on the "control cache" can work, but we need to
> > > > > > > > > > > > > > specify
> > > > > > > > > > > > > > that only a single stream can be decoded per opened instance
> > > > > > > > > > > > > > of the
> > > > > > > > > > > > > > v4l2 device. This is the assumption we're going with for
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > My opinion on this is that the m2m instance is a state, and
> > > > > > > > > > > > > the driver
> > > > > > > > > > > > > should be responsible of doing time-division multiplexing
> > > > > > > > > > > > > across
> > > > > > > > > > > > > multiple m2m instance jobs. Doing the time-division
> > > > > > > > > > > > > multiplexing in
> > > > > > > > > > > > > userspace would require some sort of daemon to work properly
> > > > > > > > > > > > > across
> > > > > > > > > > > > > processes. I also think the kernel is better place for doing
> > > > > > > > > > > > > resource
> > > > > > > > > > > > > access scheduling in general.
> > > > > > > > > > > > 
> > > > > > > > > > > > I agree with that yes. We always have a single m2m context and
> > > > > > > > > > > > specific
> > > > > > > > > > > > controls per opened device so keeping cached values works out
> > > > > > > > > > > > well.
> > > > > > > > > > > > 
> > > > > > > > > > > > So maybe we shall explicitly require that the request with the
> > > > > > > > > > > > first
> > > > > > > > > > > > slice for a frame also contains the per-frame controls.
> > > > > > > > > > > > 
> > > > > > > > > > > > > > - Annex-B formats
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I don't think we have really reached a conclusion on the
> > > > > > > > > > > > > > pixel formats
> > > > > > > > > > > > > > we want to expose. The main issue is how to deal with codecs
> > > > > > > > > > > > > > that need
> > > > > > > > > > > > > > the full slice NALU with start code, where the slice_header
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > duplicated in raw bitstream, when others are fine with just
> > > > > > > > > > > > > > the encoded
> > > > > > > > > > > > > > slice data and the parsed slice header control.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > My initial thinking was that we'd need 3 formats:
> > > > > > > > > > > > > > - One that only takes only the slice compressed data
> > > > > > > > > > > > > > (without raw slice
> > > > > > > > > > > > > > header and start code);
> > > > > > > > > > > > > > - One that takes both the NALU data (including start code,
> > > > > > > > > > > > > > raw header
> > > > > > > > > > > > > > and compressed data) and slice header controls;
> > > > > > > > > > > > > > - One that takes the NALU data but no slice header.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > But I no longer think the latter really makes sense in the
> > > > > > > > > > > > > > context of
> > > > > > > > > > > > > > stateless video decoding.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > A side-note: I think we should definitely have data offsets
> > > > > > > > > > > > > > in every
> > > > > > > > > > > > > > case, so that implementations can just push the whole NALU
> > > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > of the format if they're lazy.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I realize that I didn't share our latest research on the
> > > > > > > > > > > > > subject. So a
> > > > > > > > > > > > > slice in the original bitstream is formed of the following
> > > > > > > > > > > > > blocks
> > > > > > > > > > > > > 
> > > > > > > > > > > > > (simplified):
> > > > > > > > > > > > >   [nal_header][nal_type][slice_header][slice]
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks for the details!
> > > > > > > > > > > > 
> > > > > > > > > > > > > nal_header:
> > > > > > > > > > > > > This one is a header used to locate the start and the end of
> > > > > > > > > > > > > the of a
> > > > > > > > > > > > > NAL. There is two standard forms, the ANNEX B / start code, a
> > > > > > > > > > > > > sequence
> > > > > > > > > > > > > of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> > > > > > > > > > > > > byte
> > > > > > > > > > > > > would be a leading 0 from the previous NAL padding, but this
> > > > > > > > > > > > > is also
> > > > > > > > > > > > > totally valid start code. The second form is the AVC form,
> > > > > > > > > > > > > notably used
> > > > > > > > > > > > > in ISOMP4 container. It simply is the size of the NAL. You
> > > > > > > > > > > > > must keep
> > > > > > > > > > > > > your buffer aligned to NALs in this case as you cannot scan
> > > > > > > > > > > > > from random
> > > > > > > > > > > > > location.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > nal_type:
> > > > > > > > > > > > > It's a bit more then just the type, but it contains at least
> > > > > > > > > > > > > the
> > > > > > > > > > > > > information of the nal type. This has different size on H.264
> > > > > > > > > > > > > and HEVC
> > > > > > > > > > > > > but I know it's size is in bytes.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > slice_header:
> > > > > > > > > > > > > This contains per slice parameters, like the modification
> > > > > > > > > > > > > lists to
> > > > > > > > > > > > > apply on the references. This one has a size in bits, not in
> > > > > > > > > > > > > bytes.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > slice:
> > > > > > > > > > > > > I don't really know what is in it exactly, but this is the
> > > > > > > > > > > > > data used to
> > > > > > > > > > > > > decode. This bit has a special coding called the
> > > > > > > > > > > > > anti-emulation, which
> > > > > > > > > > > > > prevents a start-code from appearing in it. This coding is
> > > > > > > > > > > > > present in
> > > > > > > > > > > > > both forms, ANNEX-B or AVC (in GStreamer and some reference
> > > > > > > > > > > > > manual they
> > > > > > > > > > > > > call ANNEX-B the bytestream format).
> > > > > > > > > > > > > 
> > > > > > > > > > > > > So, what we notice is that what is currently passed through
> > > > > > > > > > > > > Cedrus
> > > > > > > > > > > > > 
> > > > > > > > > > > > > driver:
> > > > > > > > > > > > >   [nal_type][slice_header][slice]
> > > > > > > > > > > > > 
> > > > > > > > > > > > > This matches what is being passed through VA-API. We can
> > > > > > > > > > > > > understand
> > > > > > > > > > > > > that stripping off the slice_header would be hard, since it's
> > > > > > > > > > > > > size is
> > > > > > > > > > > > > in bits. Instead we pass size and header_bit_size in
> > > > > > > > > > > > > slice_params.
> > > > > > > > > > > > 
> > > > > > > > > > > > True, there is that.
> > > > > > > > > > > > 
> > > > > > > > > > > > > About Rockchip. RK3288 is a Hantro G1 and has a bit called
> > > > > > > > > > > > > start_code_e, when you turn this off, you don't need start
> > > > > > > > > > > > > code. As a
> > > > > > > > > > > > > side effect, the bitstream becomes identical. We do now know
> > > > > > > > > > > > > that it
> > > > > > > > > > > > > works with the ffmpeg branch implement for cedrus.
> > > > > > > > > > > > 
> > > > > > > > > > > > Oh great, that makes life easier in the short term, but I guess
> > > > > > > > > > > > the
> > > > > > > > > > > > issue could arise on another decoder sooner or later.
> > > > > > > > > > > > 
> > > > > > > > > > > > > Now what's special about Hantro G1 (also found on IMX8M) is
> > > > > > > > > > > > > that it
> > > > > > > > > > > > > take care for us of reading and executing the modification
> > > > > > > > > > > > > lists found
> > > > > > > > > > > > > in the slice header. Mostly because I very disliked having to
> > > > > > > > > > > > > pass the
> > > > > > > > > > > > > p/b0/b1 parameters, is that Boris implemented in the driver
> > > > > > > > > > > > > the
> > > > > > > > > > > > > transformation from the DPB entries into this p/b0/b1 list.
> > > > > > > > > > > > > These list
> > > > > > > > > > > > > a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> > > > > > > > > > > > > the
> > > > > > > > > > > > > following section is the execution of the modification list.
> > > > > > > > > > > > > As this
> > > > > > > > > > > > > list is not modified, it only need to be calculated per frame.
> > > > > > > > > > > > > As a
> > > > > > > > > > > > > result, we don't need these new lists, and we can work with
> > > > > > > > > > > > > the same
> > > > > > > > > > > > > H264_SLICE format as Cedrus is using.
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes but I definitely think it makes more sense to pass the list
> > > > > > > > > > > > modifications rather than reconstructing those in the driver
> > > > > > > > > > > > from a
> > > > > > > > > > > > full list. IMO controls should stick to the bitstream as close
> > > > > > > > > > > > as
> > > > > > > > > > > > possible.
> > > > > > > > > > > 
> > > > > > > > > > > For Hantro and RKVDEC, the list of modification is parsed by the
> > > > > > > > > > > IP
> > > > > > > > > > > from the slice header bits. Just to make sure, because I myself
> > > > > > > > > > > was
> > > > > > > > > > > confused on this before, the slice header does not contain a list
> > > > > > > > > > > of
> > > > > > > > > > > references, instead it contains a list modification to be applied
> > > > > > > > > > > to
> > > > > > > > > > > the reference list. I need to check again, but to execute these
> > > > > > > > > > > modification, you need to filter and sort the references in a
> > > > > > > > > > > specific
> > > > > > > > > > > order. This should be what is defined in the spec as 8.2.4.1 and
> > > > > > > > > > > 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> > > > > > > > > > > 
> > > > > > > > > > > The list of references is deduced from the DPB. The DPB, which I
> > > > > > > > > > > thinks
> > > > > > > > > > > should be rename as "references", seems more useful then p/b0/b1,
> > > > > > > > > > > since
> > > > > > > > > > > this is the data that gives use the ability to implementing glue
> > > > > > > > > > > in the
> > > > > > > > > > > driver to compensate some HW differences.
> > > > > > > > > > > 
> > > > > > > > > > > In the case of Hantro / RKVDEC, we think it's natural to build the
> > > > > > > > > > > HW
> > > > > > > > > > > specific lists (p/b0/b1) from the references rather then adding HW
> > > > > > > > > > > specific list in the decode_params structure. The fact these lists
> > > > > > > > > > > are
> > > > > > > > > > > standard intermediate step of the standard is not that important.
> > > > > > > > > > 
> > > > > > > > > > Sorry I got confused (once more) about it. Boris just explained the
> > > > > > > > > > same thing to me over IRC :) Anyway my point is that we want to pass
> > > > > > > > > > what's in ffmpeg's short and long term ref lists, and name them that
> > > > > > > > > > instead of dpb.
> > > > > > > > > > 
> > > > > > > > > > > > > Now, this is just a start. For RK3399, we have a different
> > > > > > > > > > > > > CODEC
> > > > > > > > > > > > > design. This one does not have the start_code_e bit. What the
> > > > > > > > > > > > > IP does,
> > > > > > > > > > > > > is that you give it one or more slice per buffer, setup the
> > > > > > > > > > > > > params,
> > > > > > > > > > > > > start decoding, but the decoder then return the location of
> > > > > > > > > > > > > the
> > > > > > > > > > > > > following NAL. So basically you could offload the scanning of
> > > > > > > > > > > > > start
> > > > > > > > > > > > > code to the HW. That being said, with the driver layer in
> > > > > > > > > > > > > between, that
> > > > > > > > > > > > > would be amazingly inconvenient to use, and with Boyer-more
> > > > > > > > > > > > > algorithm,
> > > > > > > > > > > > > it is pretty cheap to scan this type of start-code on CPU. But
> > > > > > > > > > > > > the
> > > > > > > > > > > > > feature that this allows is to operate in frame mode. In this
> > > > > > > > > > > > > mode, you
> > > > > > > > > > > > > have 1 interrupt per frame.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm not sure there is any interest in exposing that from
> > > > > > > > > > > > userspace and
> > > > > > > > > > > > my current feeling is that we should just ditch support for
> > > > > > > > > > > > per-frame
> > > > > > > > > > > > decoding altogether. I think it mixes decoding with notions that
> > > > > > > > > > > > are
> > > > > > > > > > > > higher-level than decoding, but I agree it's a blurry line.
> > > > > > > > > > > 
> > > > > > > > > > > I'm not worried about this either. We can already support that by
> > > > > > > > > > > copying the bitstream internally to the driver, though zero-copy
> > > > > > > > > > > with
> > > > > > > > > > > this would require a new format, the one we talked about,
> > > > > > > > > > > SLICE_ANNEX_B.
> > > > > > > > > > 
> > > > > > > > > > Right, but what I'm thinking about is making that the one and only
> > > > > > > > > > format. The rationale is that it's always easier to just append a
> > > > > > > > > > start
> > > > > > > > > > code from userspace if needed. And we need a bit offset to the slice
> > > > > > > > > > data part anyway, so it doesn't hurt to require a few extra bits to
> > > > > > > > > > have the whole thing that will work in every situation.
> > > > > > > > > 
> > > > > > > > > What I'd like is to eventually allow zero-copy (aka userptr) into the
> > > > > > > > > driver. If you make the start code mandatory, any decoding from ISOMP4
> > > > > > > > > (.mp4, .mov) will require a full bitstream copy in userspace to add
> > > > > > > > > the
> > > > > > > > > start code (unless you hack your allocation in your demuxer, but it's
> > > > > > > > > a
> > > > > > > > > bit complicated since this code might come from two libraries). In
> > > > > > > > > ISOMP4, you have an AVC header, which is just the size of the NAL that
> > > > > > > > > follows.
> > > > > > > > 
> > > > > > > > Well, I think we have to do a copy from system memory to the buffer
> > > > > > > > allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> > > > > > > > expected not to have any MMU unit and not allow sg import anyway.
> > > > > > > 
> > > > > > > The Rockchip has an mmu. You need one copy at least indeed,
> > > > > > 
> > > > > > Is the MMU in use currently? That can make things troublesome if we run
> > > > > > into a case where the VPU has MMU and deals with scatter-gather while
> > > > > > the display part doesn't. As far as I know, there's no way for
> > > > > > userspace to know whether a dma-buf-exported buffer is backed by CMA or
> > > > > > by scatter-gather memory. This feels like a major issue for using dma-
> > > > > > buf, since userspace can't predict whether a buffer exported on one
> > > > > > device can be imported on another when building its pipeline.
> > > > > 
> > > > > FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver
> > > > > for it yet. It is supported for display, both VPUs and some other devices. I
> > > > > think no sane SoC designer would left out one or another unit without IOMMU
> > > > > support, that just calls for troubles, as you pointed out.
> > > > 
> > > > Right right, I've been following that from a distance :)
> > > > 
> > > > Indeed I think it's realistic to expect that for now, but it may not
> > > > play out so well in the long term. For instance, maybe connecting a USB
> > > > display would require CMA when the rest of the system can do with sg.
> > > > 
> > > > I think it would really be useful for userspace to have a way to test
> > > > whether a buffer can be imported from one device to another. It feels
> > > > better than indicating where the memory lives, since there are
> > > > countless cases where additional restrictions apply too.
> > > 
> > > I don't know for the integration on the Rockchip, but I did notice the
> > > register documentation for it.
> > 
> > All the important components in the SoC have their IOMMUs as well -
> > display controller, GPU.
> > 
> > There is a blitter called RGA that is not behind an IOMMU, but has
> > some scatter-gather capability (with a need for the hardware sg table
> > to be physically contiguous). 
> 
> That's definitely good to know and justfies the need to introduce a way
> for userspace to check if a buffer can be imported from one device to
> another.

There's been a lot of discussion about this before. You may be aware of
James Jones' attempt to create an allocator library for this:

	https://github.com/cubanismo/allocator

I haven't heard an update on this for quite some time and I think it's
stagnated due to a lack of interest. However, I think the lack of
interest could be an indicator that the issue might not be pressing
enough. Luckily most SoCs are reasonably integrated, so there's usually
no issue sharing buffers between different hardware blocks.

Technically it's already possible to check for compatibility of buffers
at import time.

In the tegra-vde driver we do something along the lines of:

	sgt = dma_buf_map_attachment(...);
	...
	if (sgt->nents != 1)
		return -EINVAL;

because we don't support an IOMMU currently. Of course its still up to
userspace to react to that in a sensible way and it may not be obvious
what to do when the import fails.

> > That said, significance of such blitters
> > nowadays is rather low, as most of the time you need a compositor on
> > the GPU anyway, which can do any transformation in the same pass as
> > the composition.
> 
> I think that is a crucial mistake and the way I see things, this will
> have to change eventually. We cannot keep under-using the fast and
> efficient hardware components and going with the war machine that is
> the GPU in all situations. This has caused enough trouble in the
> GNU/Linux userspace display stack already and I strongly believe it has
> to stop.

Unfortunately there's really no good API to develop drivers against. All
of the 2D APIs that exist are not really efficient when implemented via
hardware-accelerated drivers. And none of the attempts at defining an
API for hardware-accelerated 2D haven't really gained any momentum.

I had looked a bit at ways to make use of some compositing hardware that
we have on Tegra (which is like a blender/blitter of a sort) and the
best thing I could find would've been to accelerate some paths in Mesa.
However that would require quite a bit of infrastructure work because it
currently completely relies on GPU shaders to accelerate those paths.

Daniel has written a very interesting bit about this, in case you
haven't seen it yet:

	https://blog.ffwll.ch/2018/08/no-2d-in-drm.html

> > > In general, the most significant gain
> > > with having iommu for CODECs is that it makes start up (and re-init)
> > > time much shorter, but also in a much more predictable duration. I do
> > > believe that the Venus driver (qualcomm) is one with solid support for
> > > this, and it's quite noticably more snappy then the others.
> > 
> > Obviously you also get support for USERPTR if you have an IOMMU, but
> > that also has some costs - you need to pin the user pages and map to
> > the IOMMU before each frame and unmap and unpin after each frame,
> > which sometimes is more costly than actually having the userspace copy
> > to a preallocated and premapped buffer, especially for relatively
> > small contents, such as compressed bitstream.
> 
> Heh, interesting point!

I share the same experience. Bitstream buffers are usually so small that
you can always find a physically contiguous memory region for them and a
memcpy() will be faster than the overhead of getting an IOMMU involved.
This obviously depends on the specific hardware, but there's always some
threshold before which mapping through an IOMMU just doesn't make sense
from a fragmentation and/or performance point of view.

I wonder, though, if it's not possible to keep userptr buffers around
and avoid the constant mapping/unmapping. If we only performed cache
maintenance on them as necessary, perhaps that could provide a viable,
maybe even good, zero-copy mechanism.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22 10:42                           ` Thierry Reding
@ 2019-05-22 10:55                             ` Hans Verkuil
  2019-05-22 11:55                               ` Thierry Reding
  2019-06-07  6:11                               ` Tomasz Figa
  0 siblings, 2 replies; 55+ messages in thread
From: Hans Verkuil @ 2019-05-22 10:55 UTC (permalink / raw)
  To: Thierry Reding, Paul Kocialkowski
  Cc: Tomasz Figa, Nicolas Dufresne, Jernej Škrabec,
	Linux Media Mailing List, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Ezequiel Garcia, Jonas Karlman

On 5/22/19 12:42 PM, Thierry Reding wrote:
> On Wed, May 22, 2019 at 10:26:28AM +0200, Paul Kocialkowski wrote:
>> Hi,
>>
>> Le mercredi 22 mai 2019 à 15:48 +0900, Tomasz Figa a écrit :
>>> On Sat, May 18, 2019 at 11:09 PM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>>>> Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
>>>>> Hi,
>>>>>
>>>>> Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
>>>>>> Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
>>>>>>>> Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
>>>>>>>>>> Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
>>>>>>>>>>>> Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit
>>>>>> :
>>>>>>>>>>>>>> Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a
>>>>>> écrit :
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With the Rockchip stateless VPU driver in the works, we now
>>>>>>>>>>>>>>> have a
>>>>>>>>>>>>>>> better idea of what the situation is like on platforms other
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>> Allwinner. This email shares my conclusions about the
>>>>>>>>>>>>>>> situation and how
>>>>>>>>>>>>>>> we should update the MPEG-2, H.264 and H.265 controls
>>>>>>>>>>>>>>> accordingly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Per-slice decoding
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We've discussed this one already[0] and Hans has submitted a
>>>>>>>>>>>>>>> patch[1]
>>>>>>>>>>>>>>> to implement the required core bits. When we agree it looks
>>>>>>>>>>>>>>> good, we
>>>>>>>>>>>>>>> should lift the restriction that all slices must be
>>>>>>>>>>>>>>> concatenated and
>>>>>>>>>>>>>>> have them submitted as individual requests.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One question is what to do about other controls. I feel like
>>>>>>>>>>>>>>> it would
>>>>>>>>>>>>>>> make sense to always pass all the required controls for
>>>>>>>>>>>>>>> decoding the
>>>>>>>>>>>>>>> slice, including the ones that don't change across slices.
>>>>>>>>>>>>>>> But there
>>>>>>>>>>>>>>> may be no particular advantage to this and only downsides.
>>>>>>>>>>>>>>> Not doing it
>>>>>>>>>>>>>>> and relying on the "control cache" can work, but we need to
>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>> that only a single stream can be decoded per opened instance
>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>> v4l2 device. This is the assumption we're going with for
>>>>>>>>>>>>>>> handling
>>>>>>>>>>>>>>> multi-slice anyway, so it shouldn't be an issue.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My opinion on this is that the m2m instance is a state, and
>>>>>>>>>>>>>> the driver
>>>>>>>>>>>>>> should be responsible of doing time-division multiplexing
>>>>>>>>>>>>>> across
>>>>>>>>>>>>>> multiple m2m instance jobs. Doing the time-division
>>>>>>>>>>>>>> multiplexing in
>>>>>>>>>>>>>> userspace would require some sort of daemon to work properly
>>>>>>>>>>>>>> across
>>>>>>>>>>>>>> processes. I also think the kernel is better place for doing
>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>> access scheduling in general.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with that yes. We always have a single m2m context and
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> controls per opened device so keeping cached values works out
>>>>>>>>>>>>> well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So maybe we shall explicitly require that the request with the
>>>>>>>>>>>>> first
>>>>>>>>>>>>> slice for a frame also contains the per-frame controls.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - Annex-B formats
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think we have really reached a conclusion on the
>>>>>>>>>>>>>>> pixel formats
>>>>>>>>>>>>>>> we want to expose. The main issue is how to deal with codecs
>>>>>>>>>>>>>>> that need
>>>>>>>>>>>>>>> the full slice NALU with start code, where the slice_header
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> duplicated in raw bitstream, when others are fine with just
>>>>>>>>>>>>>>> the encoded
>>>>>>>>>>>>>>> slice data and the parsed slice header control.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My initial thinking was that we'd need 3 formats:
>>>>>>>>>>>>>>> - One that only takes only the slice compressed data
>>>>>>>>>>>>>>> (without raw slice
>>>>>>>>>>>>>>> header and start code);
>>>>>>>>>>>>>>> - One that takes both the NALU data (including start code,
>>>>>>>>>>>>>>> raw header
>>>>>>>>>>>>>>> and compressed data) and slice header controls;
>>>>>>>>>>>>>>> - One that takes the NALU data but no slice header.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I no longer think the latter really makes sense in the
>>>>>>>>>>>>>>> context of
>>>>>>>>>>>>>>> stateless video decoding.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A side-note: I think we should definitely have data offsets
>>>>>>>>>>>>>>> in every
>>>>>>>>>>>>>>> case, so that implementations can just push the whole NALU
>>>>>>>>>>>>>>> regardless
>>>>>>>>>>>>>>> of the format if they're lazy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I realize that I didn't share our latest research on the
>>>>>>>>>>>>>> subject. So a
>>>>>>>>>>>>>> slice in the original bitstream is formed of the following
>>>>>>>>>>>>>> blocks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (simplified):
>>>>>>>>>>>>>>   [nal_header][nal_type][slice_header][slice]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the details!
>>>>>>>>>>>>>
>>>>>>>>>>>>>> nal_header:
>>>>>>>>>>>>>> This one is a header used to locate the start and the end of
>>>>>>>>>>>>>> the of a
>>>>>>>>>>>>>> NAL. There is two standard forms, the ANNEX B / start code, a
>>>>>>>>>>>>>> sequence
>>>>>>>>>>>>>> of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
>>>>>>>>>>>>>> byte
>>>>>>>>>>>>>> would be a leading 0 from the previous NAL padding, but this
>>>>>>>>>>>>>> is also
>>>>>>>>>>>>>> totally valid start code. The second form is the AVC form,
>>>>>>>>>>>>>> notably used
>>>>>>>>>>>>>> in ISOMP4 container. It simply is the size of the NAL. You
>>>>>>>>>>>>>> must keep
>>>>>>>>>>>>>> your buffer aligned to NALs in this case as you cannot scan
>>>>>>>>>>>>>> from random
>>>>>>>>>>>>>> location.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nal_type:
>>>>>>>>>>>>>> It's a bit more then just the type, but it contains at least
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> information of the nal type. This has different size on H.264
>>>>>>>>>>>>>> and HEVC
>>>>>>>>>>>>>> but I know it's size is in bytes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> slice_header:
>>>>>>>>>>>>>> This contains per slice parameters, like the modification
>>>>>>>>>>>>>> lists to
>>>>>>>>>>>>>> apply on the references. This one has a size in bits, not in
>>>>>>>>>>>>>> bytes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> slice:
>>>>>>>>>>>>>> I don't really know what is in it exactly, but this is the
>>>>>>>>>>>>>> data used to
>>>>>>>>>>>>>> decode. This bit has a special coding called the
>>>>>>>>>>>>>> anti-emulation, which
>>>>>>>>>>>>>> prevents a start-code from appearing in it. This coding is
>>>>>>>>>>>>>> present in
>>>>>>>>>>>>>> both forms, ANNEX-B or AVC (in GStreamer and some reference
>>>>>>>>>>>>>> manual they
>>>>>>>>>>>>>> call ANNEX-B the bytestream format).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, what we notice is that what is currently passed through
>>>>>>>>>>>>>> Cedrus
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> driver:
>>>>>>>>>>>>>>   [nal_type][slice_header][slice]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This matches what is being passed through VA-API. We can
>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>> that stripping off the slice_header would be hard, since it's
>>>>>>>>>>>>>> size is
>>>>>>>>>>>>>> in bits. Instead we pass size and header_bit_size in
>>>>>>>>>>>>>> slice_params.
>>>>>>>>>>>>>
>>>>>>>>>>>>> True, there is that.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> About Rockchip. RK3288 is a Hantro G1 and has a bit called
>>>>>>>>>>>>>> start_code_e, when you turn this off, you don't need start
>>>>>>>>>>>>>> code. As a
>>>>>>>>>>>>>> side effect, the bitstream becomes identical. We do now know
>>>>>>>>>>>>>> that it
>>>>>>>>>>>>>> works with the ffmpeg branch implement for cedrus.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Oh great, that makes life easier in the short term, but I guess
>>>>>>>>>>>>> the
>>>>>>>>>>>>> issue could arise on another decoder sooner or later.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now what's special about Hantro G1 (also found on IMX8M) is
>>>>>>>>>>>>>> that it
>>>>>>>>>>>>>> take care for us of reading and executing the modification
>>>>>>>>>>>>>> lists found
>>>>>>>>>>>>>> in the slice header. Mostly because I very disliked having to
>>>>>>>>>>>>>> pass the
>>>>>>>>>>>>>> p/b0/b1 parameters, is that Boris implemented in the driver
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> transformation from the DPB entries into this p/b0/b1 list.
>>>>>>>>>>>>>> These list
>>>>>>>>>>>>>> a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> following section is the execution of the modification list.
>>>>>>>>>>>>>> As this
>>>>>>>>>>>>>> list is not modified, it only need to be calculated per frame.
>>>>>>>>>>>>>> As a
>>>>>>>>>>>>>> result, we don't need these new lists, and we can work with
>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>> H264_SLICE format as Cedrus is using.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes but I definitely think it makes more sense to pass the list
>>>>>>>>>>>>> modifications rather than reconstructing those in the driver
>>>>>>>>>>>>> from a
>>>>>>>>>>>>> full list. IMO controls should stick to the bitstream as close
>>>>>>>>>>>>> as
>>>>>>>>>>>>> possible.
>>>>>>>>>>>>
>>>>>>>>>>>> For Hantro and RKVDEC, the list of modification is parsed by the
>>>>>>>>>>>> IP
>>>>>>>>>>>> from the slice header bits. Just to make sure, because I myself
>>>>>>>>>>>> was
>>>>>>>>>>>> confused on this before, the slice header does not contain a list
>>>>>>>>>>>> of
>>>>>>>>>>>> references, instead it contains a list modification to be applied
>>>>>>>>>>>> to
>>>>>>>>>>>> the reference list. I need to check again, but to execute these
>>>>>>>>>>>> modification, you need to filter and sort the references in a
>>>>>>>>>>>> specific
>>>>>>>>>>>> order. This should be what is defined in the spec as 8.2.4.1 and
>>>>>>>>>>>> 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
>>>>>>>>>>>>
>>>>>>>>>>>> The list of references is deduced from the DPB. The DPB, which I
>>>>>>>>>>>> thinks
>>>>>>>>>>>> should be rename as "references", seems more useful then p/b0/b1,
>>>>>>>>>>>> since
>>>>>>>>>>>> this is the data that gives use the ability to implementing glue
>>>>>>>>>>>> in the
>>>>>>>>>>>> driver to compensate some HW differences.
>>>>>>>>>>>>
>>>>>>>>>>>> In the case of Hantro / RKVDEC, we think it's natural to build the
>>>>>>>>>>>> HW
>>>>>>>>>>>> specific lists (p/b0/b1) from the references rather then adding HW
>>>>>>>>>>>> specific list in the decode_params structure. The fact these lists
>>>>>>>>>>>> are
>>>>>>>>>>>> standard intermediate step of the standard is not that important.
>>>>>>>>>>>
>>>>>>>>>>> Sorry I got confused (once more) about it. Boris just explained the
>>>>>>>>>>> same thing to me over IRC :) Anyway my point is that we want to pass
>>>>>>>>>>> what's in ffmpeg's short and long term ref lists, and name them that
>>>>>>>>>>> instead of dpb.
>>>>>>>>>>>
>>>>>>>>>>>>>> Now, this is just a start. For RK3399, we have a different
>>>>>>>>>>>>>> CODEC
>>>>>>>>>>>>>> design. This one does not have the start_code_e bit. What the
>>>>>>>>>>>>>> IP does,
>>>>>>>>>>>>>> is that you give it one or more slice per buffer, setup the
>>>>>>>>>>>>>> params,
>>>>>>>>>>>>>> start decoding, but the decoder then return the location of
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> following NAL. So basically you could offload the scanning of
>>>>>>>>>>>>>> start
>>>>>>>>>>>>>> code to the HW. That being said, with the driver layer in
>>>>>>>>>>>>>> between, that
>>>>>>>>>>>>>> would be amazingly inconvenient to use, and with Boyer-more
>>>>>>>>>>>>>> algorithm,
>>>>>>>>>>>>>> it is pretty cheap to scan this type of start-code on CPU. But
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> feature that this allows is to operate in frame mode. In this
>>>>>>>>>>>>>> mode, you
>>>>>>>>>>>>>> have 1 interrupt per frame.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure there is any interest in exposing that from
>>>>>>>>>>>>> userspace and
>>>>>>>>>>>>> my current feeling is that we should just ditch support for
>>>>>>>>>>>>> per-frame
>>>>>>>>>>>>> decoding altogether. I think it mixes decoding with notions that
>>>>>>>>>>>>> are
>>>>>>>>>>>>> higher-level than decoding, but I agree it's a blurry line.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not worried about this either. We can already support that by
>>>>>>>>>>>> copying the bitstream internally to the driver, though zero-copy
>>>>>>>>>>>> with
>>>>>>>>>>>> this would require a new format, the one we talked about,
>>>>>>>>>>>> SLICE_ANNEX_B.
>>>>>>>>>>>
>>>>>>>>>>> Right, but what I'm thinking about is making that the one and only
>>>>>>>>>>> format. The rationale is that it's always easier to just append a
>>>>>>>>>>> start
>>>>>>>>>>> code from userspace if needed. And we need a bit offset to the slice
>>>>>>>>>>> data part anyway, so it doesn't hurt to require a few extra bits to
>>>>>>>>>>> have the whole thing that will work in every situation.
>>>>>>>>>>
>>>>>>>>>> What I'd like is to eventually allow zero-copy (aka userptr) into the
>>>>>>>>>> driver. If you make the start code mandatory, any decoding from ISOMP4
>>>>>>>>>> (.mp4, .mov) will require a full bitstream copy in userspace to add
>>>>>>>>>> the
>>>>>>>>>> start code (unless you hack your allocation in your demuxer, but it's
>>>>>>>>>> a
>>>>>>>>>> bit complicated since this code might come from two libraries). In
>>>>>>>>>> ISOMP4, you have an AVC header, which is just the size of the NAL that
>>>>>>>>>> follows.
>>>>>>>>>
>>>>>>>>> Well, I think we have to do a copy from system memory to the buffer
>>>>>>>>> allocated by v4l2 anyway. Our hardware pipelines can reasonably be
>>>>>>>>> expected not to have any MMU unit and not allow sg import anyway.
>>>>>>>>
>>>>>>>> The Rockchip has an mmu. You need one copy at least indeed,
>>>>>>>
>>>>>>> Is the MMU in use currently? That can make things troublesome if we run
>>>>>>> into a case where the VPU has MMU and deals with scatter-gather while
>>>>>>> the display part doesn't. As far as I know, there's no way for
>>>>>>> userspace to know whether a dma-buf-exported buffer is backed by CMA or
>>>>>>> by scatter-gather memory. This feels like a major issue for using dma-
>>>>>>> buf, since userspace can't predict whether a buffer exported on one
>>>>>>> device can be imported on another when building its pipeline.
>>>>>>
>>>>>> FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver
>>>>>> for it yet. It is supported for display, both VPUs and some other devices. I
>>>>>> think no sane SoC designer would left out one or another unit without IOMMU
>>>>>> support, that just calls for troubles, as you pointed out.
>>>>>
>>>>> Right right, I've been following that from a distance :)
>>>>>
>>>>> Indeed I think it's realistic to expect that for now, but it may not
>>>>> play out so well in the long term. For instance, maybe connecting a USB
>>>>> display would require CMA when the rest of the system can do with sg.
>>>>>
>>>>> I think it would really be useful for userspace to have a way to test
>>>>> whether a buffer can be imported from one device to another. It feels
>>>>> better than indicating where the memory lives, since there are
>>>>> countless cases where additional restrictions apply too.
>>>>
>>>> I don't know for the integration on the Rockchip, but I did notice the
>>>> register documentation for it.
>>>
>>> All the important components in the SoC have their IOMMUs as well -
>>> display controller, GPU.
>>>
>>> There is a blitter called RGA that is not behind an IOMMU, but has
>>> some scatter-gather capability (with a need for the hardware sg table
>>> to be physically contiguous). 
>>
>> That's definitely good to know and justfies the need to introduce a way
>> for userspace to check if a buffer can be imported from one device to
>> another.
> 
> There's been a lot of discussion about this before. You may be aware of
> James Jones' attempt to create an allocator library for this:
> 
> 	https://github.com/cubanismo/allocator
> 
> I haven't heard an update on this for quite some time and I think it's
> stagnated due to a lack of interest. However, I think the lack of
> interest could be an indicator that the issue might not be pressing
> enough. Luckily most SoCs are reasonably integrated, so there's usually
> no issue sharing buffers between different hardware blocks.
> 
> Technically it's already possible to check for compatibility of buffers
> at import time.
> 
> In the tegra-vde driver we do something along the lines of:
> 
> 	sgt = dma_buf_map_attachment(...);
> 	...
> 	if (sgt->nents != 1)
> 		return -EINVAL;
> 
> because we don't support an IOMMU currently. Of course its still up to
> userspace to react to that in a sensible way and it may not be obvious
> what to do when the import fails.
> 
>>> That said, significance of such blitters
>>> nowadays is rather low, as most of the time you need a compositor on
>>> the GPU anyway, which can do any transformation in the same pass as
>>> the composition.
>>
>> I think that is a crucial mistake and the way I see things, this will
>> have to change eventually. We cannot keep under-using the fast and
>> efficient hardware components and going with the war machine that is
>> the GPU in all situations. This has caused enough trouble in the
>> GNU/Linux userspace display stack already and I strongly believe it has
>> to stop.
> 
> Unfortunately there's really no good API to develop drivers against. All
> of the 2D APIs that exist are not really efficient when implemented via
> hardware-accelerated drivers. And none of the attempts at defining an
> API for hardware-accelerated 2D haven't really gained any momentum.
> 
> I had looked a bit at ways to make use of some compositing hardware that
> we have on Tegra (which is like a blender/blitter of a sort) and the
> best thing I could find would've been to accelerate some paths in Mesa.
> However that would require quite a bit of infrastructure work because it
> currently completely relies on GPU shaders to accelerate those paths.
> 
> Daniel has written a very interesting bit about this, in case you
> haven't seen it yet:
> 
> 	https://blog.ffwll.ch/2018/08/no-2d-in-drm.html
> 
>>>> In general, the most significant gain
>>>> with having iommu for CODECs is that it makes start up (and re-init)
>>>> time much shorter, but also in a much more predictable duration. I do
>>>> believe that the Venus driver (qualcomm) is one with solid support for
>>>> this, and it's quite noticably more snappy then the others.
>>>
>>> Obviously you also get support for USERPTR if you have an IOMMU, but
>>> that also has some costs - you need to pin the user pages and map to
>>> the IOMMU before each frame and unmap and unpin after each frame,
>>> which sometimes is more costly than actually having the userspace copy
>>> to a preallocated and premapped buffer, especially for relatively
>>> small contents, such as compressed bitstream.
>>
>> Heh, interesting point!
> 
> I share the same experience. Bitstream buffers are usually so small that
> you can always find a physically contiguous memory region for them and a
> memcpy() will be faster than the overhead of getting an IOMMU involved.
> This obviously depends on the specific hardware, but there's always some
> threshold before which mapping through an IOMMU just doesn't make sense
> from a fragmentation and/or performance point of view.
> 
> I wonder, though, if it's not possible to keep userptr buffers around
> and avoid the constant mapping/unmapping. If we only performed cache
> maintenance on them as necessary, perhaps that could provide a viable,
> maybe even good, zero-copy mechanism.

The vb2 framework will keep the mapping for a userptr as long as userspace
uses the same userptr for every buffer.

I.e. the first time a buffer with index I is queued the userptr is mapped.
If that buffer is later dequeued and then requeued again with the same
userptr the vb2 core will reuse the old mapping. Otherwise it will unmap
and map again with the new userptr.

The same is done for dmabuf, BTW. So if userspace keeps changing dmabuf
fds for each buffer, then that is not optimal.

Regards,

	Hans

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  9:29               ` Paul Kocialkowski
@ 2019-05-22 11:39                 ` Thierry Reding
  2019-05-22 18:31                   ` Nicolas Dufresne
  2019-05-22 18:26                 ` Nicolas Dufresne
  1 sibling, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-05-22 11:39 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Boris Brezillon, Tomasz Figa, Nicolas Dufresne,
	Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Maxime Ripard, Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 12767 bytes --]

On Wed, May 22, 2019 at 11:29:13AM +0200, Paul Kocialkowski wrote:
> Le mercredi 22 mai 2019 à 10:32 +0200, Thierry Reding a écrit :
> > On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote:
> > > On Wed, 22 May 2019 15:39:37 +0900
> > > Tomasz Figa <tfiga@chromium.org> wrote:
> > > 
> > > > > It would be premature to state that we are excluding. We are just
> > > > > trying to find one format to get things upstream, and make sure we have
> > > > > a plan how to extend it. Trying to support everything on the first try
> > > > > is not going to work so well.
> > > > > 
> > > > > What is interesting to provide is how does you IP achieve multi-slice
> > > > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > > > Typical questions are:
> > > > > 
> > > > >   1. Do all slices have to be contiguous in memory
> > > > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > > > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> > > > 
> > > > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > > > driver that was created mostly by RK themselves, we've been assuming
> > > > that the interrupt is for the complete frame, without any problems.
> > > 
> > > I confirm that's what happens when all slices forming a frame are packed
> > > in a single output buffer: you only get one interrupt at the end of the
> > > decoding process (in that case, when the frame is decoded). Of course,
> > > if you split things up and do per-slice decoding instead (one slice per
> > > buffer) you get an interrupt per slice, though I didn't manage to make
> > > that work.
> > > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> > > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> > > interrupt on the second slice, which makes me think some states are
> > > reset between the 2 operations leading the engine to think that the
> > > second slice is part of a new frame.
> > 
> > That sounds a lot like how this works on Tegra. My understanding is that
> > for slice decoding you'd also get an interrupt every time a full slice
> > has been decoded perhaps coupled with another "frame done" interrupt
> > when the full frame has been decoded after the last slice.
> > 
> > In frame-level decode mode you don't get interrupts in between and
> > instead only get the "frame done" interrupt. Unless something went wrong
> > during decoding, in which case you also get an interrupt but with error
> > flags and status registers that help determine what exactly happened.
> > 
> > > Anyway, it doesn't sound like a crazy idea to support both per-slice
> > > and per-frame decoding and maybe have a way to expose what a
> > > specific codec can do (through an extra cap mechanism).
> > 
> > Yeah, I think it makes sense to support both for devices that can do
> > both. From what Nicolas said it may make sense for an application to
> > want to do slice-level decoding if receiving a stream from the network
> > and frame-level decoding if playing back from a local file. If a driver
> > supports both, the application could detect that and choose the
> > appropriate format.
> > 
> > It sounds to me like using different input formats for that would be a
> > very natural way to describe it. Applications can already detect the set
> > of supported input formats and set the format when they allocate buffers
> > so that should work very nicely.
> 
> Pixel formats are indeed the natural way to go about this, but I have
> some reservations in this case. Slices are the natural unit of video
> streams, just like frames are to display hardware. Part of the pipeline
> configuration is slice-specific, so in theory, the pipeline needs to be
> reconfigured with each slice.
> 
> What we have been doing in Cedrus is to currently gather all the slices
> and use the last slice's specific configuration for the pipeline, which
> sort of works, but is very likely not a good idea.

To be honest, my testing has been very minimal, so it's quite possible
that I've always only run into examples with either only a single slice
or multiple slices with the same configuration. Or perhaps with
differing configurations but non-significant (or non-noticable)
differences.

> You mentionned that the Tegra VPU currentyl always operates in frame
> mode (even when the stream actually has multiple slices, which I assume
> are gathered at some point). I wonder how it goes about configuring
> different slice parameters (which are specific to each slice, not
> frame) for the different slices.

That's part of the beauty of the frame-level decoding mode (I think
that's call SXE-P). The syntax engine has access to the complete
bitstream and can parse all the information that it needs. There's some
data that we pass into the decoder from the SPS and PPS, but other than
that the VDE will do everything by itself.

> I believe we should at least always expose per-slice granularity in the
> pixel format and requests. Maybe we could have a way to allow multiple
> slices to be gathered in the source buffer and have a control slice
> array for each request. In that case, we'd have a single request queued
> for the series of slices, with a bit offset in each control to the
> matching slice.
> 
> Then we could specify that such slices must be appended in a way that
> suits most decoders that would have to operate per-frame (so we need to
> figure this out) and worst case, we'll always have offsets in the
> controls if we need to setup a bounce buffer in the driver because
> things are not laid out the way we specified.
> 
> Then we introduce a specific cap to indicate which mode is supported
> (per-slice and/or per-frame) and adapt our ffmpeg reference to be able
> to operate in both modes.
> 
> That adds some complexity for userspace, but I don't think we can avoid
> it at this point and it feels better than having two different pixel
> formats (which would probably be even more complex to manage for
> userspace).
> 
> What do you think?

I'm not sure I understand why this would be simpler than exposing two
different pixel formats. It sounds like essentially the same thing, just
with a different method.

One advantage I see with your approach is that it more formally defines
how slices are passed. This might be a good thing to do anyway. I'm not
sure if software stacks provide that information anyway. If they do this
would be trivial to achieve. If they don't this could be an extra burden
on userspace for decoder that don't need it.

Would it perhaps be possible to make this slice meta data optional? For
example, could we just provide an H.264 slice pixel format and then let
userspace fill in buffers in whatever way they want, provided that they
follow some rules (must be annex B or something else, concatenated
slices, ...) and then if there's an extra control specifying the offsets
of individual slices drivers can use that, if not they just pass the
bitstream buffer to the hardware if frame-level decoding is supported
and let the hardware do its thing?

Hardware that has requirements different from that could require the
meta data to be present and fail otherwise.

On the other hand, userspace would have to be prepared to deal with this
type of hardware anyway, so it basically needs to provide the meta data
in any case. Perhaps the meta data could be optional if a buffer
contains a single slice.

One other thing that occurred to me is that the meta data could perhaps
contain a more elaborate description of the data in the slice. But that
has the problem that it can't be detected upfront, so userspace can't
discover whether the decoder can handle that data until an error is
returned from the decoder upon receiving the meta data.

To answer your question: I don't feel strongly one way or the other. The
above is really just discussing the specifics of how the data is passed,
but we don't really know what exactly the data is that we need to pass.

> > > The other option would be to support only per-slice decoding with a
> > > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> > > only support per-frame decoding know when they should trigger the
> > > decoding operation. The downside is that it implies having a bounce
> > > buffer where the driver can pack slices to be decoded on the END_FRAME
> > > event.
> > 
> > I vaguely remember that that's what the video codec abstraction does in
> > Mesa/Gallium. 
> 
> Well, if it's exposed through VDPAU or VAAPI, the interface already
> operates per-slice and it would certainly not be a big issue to change
> that.

The video pipe callbacks can implement a ->decode_bitstream() callback
that gets a number of buffer/size pairs along with a picture description
(which corresponds roughly to the SPS/PPS). The buffer/size pairs are
exactly what's passed in from VDPAU or VAAPI. It looks like VDPAU can
pass multiple slices, each per VdpBitstreamBuffer, whereas VAAPI passes
only a single buffer at a time at the driver level.

(Interesting side-note: VDPAU seems to require the start code to be part
of the bitstream, whereas the VAAPI state tracker in Mesa will go and
check whether a buffer contains the start code and prepend it via SG if
not. So at the pipe_video_codec level it seems the decision was made to
use annex B as the lowest common denominator).

> Talking about the mesa/gallium video decoding stuff, I think it would
> be worth having V4L2 interfaces for that now that we have the Request
> API.

Yeah, I think that'd be nice, but I'm not sure that you're going to find
someone to redo all the work...

> Basically, Nvidia GPUs have video decoding blocks (which could be
> similar to the ones present on Tegra) that are accessed through a
> firmware running on a Falcon MCU on the GPU side.

Yeah, the video decoding blocks on GPUs are very similar to the ones
found on more recent Tegra. The big difference, of course, is that on
Tegra they are separate (platform) devices, whereas on the GPU they are
part of the PCI device's register space. It'd be nice if we could
somehow share drivers between the two, but I'm not sure that that's
possible. Besides the different bus there are also difference is how
memory is managed (video RAM on GPU vs. system memory on Tegra) and so
on.

> Having a standardized firmware interface for these and a V4L2 M2M
> driver for the interface would certainly make it easier for everyone to
> handle that. I don't really see why these video decoding hardware has
> to be exposed through the display stack anyway and one could want to
> use the GPU's video decoder without bringing up the shading cores.

Are you saying that it might be possible to structure this as basically
two "backend" drivers that each expose the command stream interface and
then build a "frontend" driver that could talk to either backend? That
sounds like a really nice idea, but I'm not sure that it'd work.

> > I'm not very familiar with V4L2, but this seems like it
> > could be problematic to integrate with the way that V4L2 works in
> > general. Perhaps sending a special buffer (0 length or whatever) to mark
> > the end of a frame would work. But this is probably something that
> > others have already thought about, since slice-level decoding is what
> > most people are using, hence there must already be a way for userspace
> > to somehow synchronize input vs. output buffers. Or does this currently
> > just work by queueing bitstream buffers as fast as possible and then
> > dequeueing frame buffers as they become available?
> 
> We have a Request API mechanism where we group controls (parsed
> bitstream meta-data) and source (OUTPUT) buffers together and submit
> them tied. When each request gets processed its buffer enters the
> OUTPUT queue, which gets picked up by the driver and associated with
> the first destination (CAPTURE) buffer available. Then the driver grabs
> the buffers and applies the controls matching the source buffer's
> request before starting decoding with M2M.
> 
> We have already worked on handling the case of requiring a single
> destination buffer for the different slices, by having a flag to
> indicate whether the destination buffer should be held.

Right. So sounds like the request is the natural boundary here. I guess
that would allow drivers to manually concatenate accumulated bitstream
buffers into a single one.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22 10:55                             ` Hans Verkuil
@ 2019-05-22 11:55                               ` Thierry Reding
  2019-06-07  6:11                               ` Tomasz Figa
  1 sibling, 0 replies; 55+ messages in thread
From: Thierry Reding @ 2019-05-22 11:55 UTC (permalink / raw)
  To: Hans Verkuil
  Cc: Paul Kocialkowski, Tomasz Figa, Nicolas Dufresne,
	Jernej Škrabec, Linux Media Mailing List, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 22354 bytes --]

On Wed, May 22, 2019 at 12:55:53PM +0200, Hans Verkuil wrote:
> On 5/22/19 12:42 PM, Thierry Reding wrote:
> > On Wed, May 22, 2019 at 10:26:28AM +0200, Paul Kocialkowski wrote:
> >> Hi,
> >>
> >> Le mercredi 22 mai 2019 à 15:48 +0900, Tomasz Figa a écrit :
> >>> On Sat, May 18, 2019 at 11:09 PM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >>>> Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
> >>>>> Hi,
> >>>>>
> >>>>> Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> >>>>>> Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> >>>>>>>> Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> >>>>>>>>>> Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> >>>>>>>>>>>> Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit
> >>>>>> :
> >>>>>>>>>>>>>> Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a
> >>>>>> écrit :
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With the Rockchip stateless VPU driver in the works, we now
> >>>>>>>>>>>>>>> have a
> >>>>>>>>>>>>>>> better idea of what the situation is like on platforms other
> >>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>> Allwinner. This email shares my conclusions about the
> >>>>>>>>>>>>>>> situation and how
> >>>>>>>>>>>>>>> we should update the MPEG-2, H.264 and H.265 controls
> >>>>>>>>>>>>>>> accordingly.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Per-slice decoding
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We've discussed this one already[0] and Hans has submitted a
> >>>>>>>>>>>>>>> patch[1]
> >>>>>>>>>>>>>>> to implement the required core bits. When we agree it looks
> >>>>>>>>>>>>>>> good, we
> >>>>>>>>>>>>>>> should lift the restriction that all slices must be
> >>>>>>>>>>>>>>> concatenated and
> >>>>>>>>>>>>>>> have them submitted as individual requests.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One question is what to do about other controls. I feel like
> >>>>>>>>>>>>>>> it would
> >>>>>>>>>>>>>>> make sense to always pass all the required controls for
> >>>>>>>>>>>>>>> decoding the
> >>>>>>>>>>>>>>> slice, including the ones that don't change across slices.
> >>>>>>>>>>>>>>> But there
> >>>>>>>>>>>>>>> may be no particular advantage to this and only downsides.
> >>>>>>>>>>>>>>> Not doing it
> >>>>>>>>>>>>>>> and relying on the "control cache" can work, but we need to
> >>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>> that only a single stream can be decoded per opened instance
> >>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>> v4l2 device. This is the assumption we're going with for
> >>>>>>>>>>>>>>> handling
> >>>>>>>>>>>>>>> multi-slice anyway, so it shouldn't be an issue.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My opinion on this is that the m2m instance is a state, and
> >>>>>>>>>>>>>> the driver
> >>>>>>>>>>>>>> should be responsible of doing time-division multiplexing
> >>>>>>>>>>>>>> across
> >>>>>>>>>>>>>> multiple m2m instance jobs. Doing the time-division
> >>>>>>>>>>>>>> multiplexing in
> >>>>>>>>>>>>>> userspace would require some sort of daemon to work properly
> >>>>>>>>>>>>>> across
> >>>>>>>>>>>>>> processes. I also think the kernel is better place for doing
> >>>>>>>>>>>>>> resource
> >>>>>>>>>>>>>> access scheduling in general.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree with that yes. We always have a single m2m context and
> >>>>>>>>>>>>> specific
> >>>>>>>>>>>>> controls per opened device so keeping cached values works out
> >>>>>>>>>>>>> well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So maybe we shall explicitly require that the request with the
> >>>>>>>>>>>>> first
> >>>>>>>>>>>>> slice for a frame also contains the per-frame controls.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Annex-B formats
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think we have really reached a conclusion on the
> >>>>>>>>>>>>>>> pixel formats
> >>>>>>>>>>>>>>> we want to expose. The main issue is how to deal with codecs
> >>>>>>>>>>>>>>> that need
> >>>>>>>>>>>>>>> the full slice NALU with start code, where the slice_header
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> duplicated in raw bitstream, when others are fine with just
> >>>>>>>>>>>>>>> the encoded
> >>>>>>>>>>>>>>> slice data and the parsed slice header control.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> My initial thinking was that we'd need 3 formats:
> >>>>>>>>>>>>>>> - One that only takes only the slice compressed data
> >>>>>>>>>>>>>>> (without raw slice
> >>>>>>>>>>>>>>> header and start code);
> >>>>>>>>>>>>>>> - One that takes both the NALU data (including start code,
> >>>>>>>>>>>>>>> raw header
> >>>>>>>>>>>>>>> and compressed data) and slice header controls;
> >>>>>>>>>>>>>>> - One that takes the NALU data but no slice header.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> But I no longer think the latter really makes sense in the
> >>>>>>>>>>>>>>> context of
> >>>>>>>>>>>>>>> stateless video decoding.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A side-note: I think we should definitely have data offsets
> >>>>>>>>>>>>>>> in every
> >>>>>>>>>>>>>>> case, so that implementations can just push the whole NALU
> >>>>>>>>>>>>>>> regardless
> >>>>>>>>>>>>>>> of the format if they're lazy.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I realize that I didn't share our latest research on the
> >>>>>>>>>>>>>> subject. So a
> >>>>>>>>>>>>>> slice in the original bitstream is formed of the following
> >>>>>>>>>>>>>> blocks
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> (simplified):
> >>>>>>>>>>>>>>   [nal_header][nal_type][slice_header][slice]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the details!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> nal_header:
> >>>>>>>>>>>>>> This one is a header used to locate the start and the end of
> >>>>>>>>>>>>>> the of a
> >>>>>>>>>>>>>> NAL. There is two standard forms, the ANNEX B / start code, a
> >>>>>>>>>>>>>> sequence
> >>>>>>>>>>>>>> of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> >>>>>>>>>>>>>> byte
> >>>>>>>>>>>>>> would be a leading 0 from the previous NAL padding, but this
> >>>>>>>>>>>>>> is also
> >>>>>>>>>>>>>> totally valid start code. The second form is the AVC form,
> >>>>>>>>>>>>>> notably used
> >>>>>>>>>>>>>> in ISOMP4 container. It simply is the size of the NAL. You
> >>>>>>>>>>>>>> must keep
> >>>>>>>>>>>>>> your buffer aligned to NALs in this case as you cannot scan
> >>>>>>>>>>>>>> from random
> >>>>>>>>>>>>>> location.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> nal_type:
> >>>>>>>>>>>>>> It's a bit more then just the type, but it contains at least
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> information of the nal type. This has different size on H.264
> >>>>>>>>>>>>>> and HEVC
> >>>>>>>>>>>>>> but I know it's size is in bytes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> slice_header:
> >>>>>>>>>>>>>> This contains per slice parameters, like the modification
> >>>>>>>>>>>>>> lists to
> >>>>>>>>>>>>>> apply on the references. This one has a size in bits, not in
> >>>>>>>>>>>>>> bytes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> slice:
> >>>>>>>>>>>>>> I don't really know what is in it exactly, but this is the
> >>>>>>>>>>>>>> data used to
> >>>>>>>>>>>>>> decode. This bit has a special coding called the
> >>>>>>>>>>>>>> anti-emulation, which
> >>>>>>>>>>>>>> prevents a start-code from appearing in it. This coding is
> >>>>>>>>>>>>>> present in
> >>>>>>>>>>>>>> both forms, ANNEX-B or AVC (in GStreamer and some reference
> >>>>>>>>>>>>>> manual they
> >>>>>>>>>>>>>> call ANNEX-B the bytestream format).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So, what we notice is that what is currently passed through
> >>>>>>>>>>>>>> Cedrus
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> driver:
> >>>>>>>>>>>>>>   [nal_type][slice_header][slice]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This matches what is being passed through VA-API. We can
> >>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>> that stripping off the slice_header would be hard, since it's
> >>>>>>>>>>>>>> size is
> >>>>>>>>>>>>>> in bits. Instead we pass size and header_bit_size in
> >>>>>>>>>>>>>> slice_params.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> True, there is that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> About Rockchip. RK3288 is a Hantro G1 and has a bit called
> >>>>>>>>>>>>>> start_code_e, when you turn this off, you don't need start
> >>>>>>>>>>>>>> code. As a
> >>>>>>>>>>>>>> side effect, the bitstream becomes identical. We do now know
> >>>>>>>>>>>>>> that it
> >>>>>>>>>>>>>> works with the ffmpeg branch implement for cedrus.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Oh great, that makes life easier in the short term, but I guess
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> issue could arise on another decoder sooner or later.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Now what's special about Hantro G1 (also found on IMX8M) is
> >>>>>>>>>>>>>> that it
> >>>>>>>>>>>>>> take care for us of reading and executing the modification
> >>>>>>>>>>>>>> lists found
> >>>>>>>>>>>>>> in the slice header. Mostly because I very disliked having to
> >>>>>>>>>>>>>> pass the
> >>>>>>>>>>>>>> p/b0/b1 parameters, is that Boris implemented in the driver
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> transformation from the DPB entries into this p/b0/b1 list.
> >>>>>>>>>>>>>> These list
> >>>>>>>>>>>>>> a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> following section is the execution of the modification list.
> >>>>>>>>>>>>>> As this
> >>>>>>>>>>>>>> list is not modified, it only need to be calculated per frame.
> >>>>>>>>>>>>>> As a
> >>>>>>>>>>>>>> result, we don't need these new lists, and we can work with
> >>>>>>>>>>>>>> the same
> >>>>>>>>>>>>>> H264_SLICE format as Cedrus is using.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes but I definitely think it makes more sense to pass the list
> >>>>>>>>>>>>> modifications rather than reconstructing those in the driver
> >>>>>>>>>>>>> from a
> >>>>>>>>>>>>> full list. IMO controls should stick to the bitstream as close
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>> possible.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For Hantro and RKVDEC, the list of modification is parsed by the
> >>>>>>>>>>>> IP
> >>>>>>>>>>>> from the slice header bits. Just to make sure, because I myself
> >>>>>>>>>>>> was
> >>>>>>>>>>>> confused on this before, the slice header does not contain a list
> >>>>>>>>>>>> of
> >>>>>>>>>>>> references, instead it contains a list modification to be applied
> >>>>>>>>>>>> to
> >>>>>>>>>>>> the reference list. I need to check again, but to execute these
> >>>>>>>>>>>> modification, you need to filter and sort the references in a
> >>>>>>>>>>>> specific
> >>>>>>>>>>>> order. This should be what is defined in the spec as 8.2.4.1 and
> >>>>>>>>>>>> 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The list of references is deduced from the DPB. The DPB, which I
> >>>>>>>>>>>> thinks
> >>>>>>>>>>>> should be rename as "references", seems more useful then p/b0/b1,
> >>>>>>>>>>>> since
> >>>>>>>>>>>> this is the data that gives use the ability to implementing glue
> >>>>>>>>>>>> in the
> >>>>>>>>>>>> driver to compensate some HW differences.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In the case of Hantro / RKVDEC, we think it's natural to build the
> >>>>>>>>>>>> HW
> >>>>>>>>>>>> specific lists (p/b0/b1) from the references rather then adding HW
> >>>>>>>>>>>> specific list in the decode_params structure. The fact these lists
> >>>>>>>>>>>> are
> >>>>>>>>>>>> standard intermediate step of the standard is not that important.
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry I got confused (once more) about it. Boris just explained the
> >>>>>>>>>>> same thing to me over IRC :) Anyway my point is that we want to pass
> >>>>>>>>>>> what's in ffmpeg's short and long term ref lists, and name them that
> >>>>>>>>>>> instead of dpb.
> >>>>>>>>>>>
> >>>>>>>>>>>>>> Now, this is just a start. For RK3399, we have a different
> >>>>>>>>>>>>>> CODEC
> >>>>>>>>>>>>>> design. This one does not have the start_code_e bit. What the
> >>>>>>>>>>>>>> IP does,
> >>>>>>>>>>>>>> is that you give it one or more slice per buffer, setup the
> >>>>>>>>>>>>>> params,
> >>>>>>>>>>>>>> start decoding, but the decoder then return the location of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> following NAL. So basically you could offload the scanning of
> >>>>>>>>>>>>>> start
> >>>>>>>>>>>>>> code to the HW. That being said, with the driver layer in
> >>>>>>>>>>>>>> between, that
> >>>>>>>>>>>>>> would be amazingly inconvenient to use, and with Boyer-more
> >>>>>>>>>>>>>> algorithm,
> >>>>>>>>>>>>>> it is pretty cheap to scan this type of start-code on CPU. But
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> feature that this allows is to operate in frame mode. In this
> >>>>>>>>>>>>>> mode, you
> >>>>>>>>>>>>>> have 1 interrupt per frame.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm not sure there is any interest in exposing that from
> >>>>>>>>>>>>> userspace and
> >>>>>>>>>>>>> my current feeling is that we should just ditch support for
> >>>>>>>>>>>>> per-frame
> >>>>>>>>>>>>> decoding altogether. I think it mixes decoding with notions that
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>> higher-level than decoding, but I agree it's a blurry line.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not worried about this either. We can already support that by
> >>>>>>>>>>>> copying the bitstream internally to the driver, though zero-copy
> >>>>>>>>>>>> with
> >>>>>>>>>>>> this would require a new format, the one we talked about,
> >>>>>>>>>>>> SLICE_ANNEX_B.
> >>>>>>>>>>>
> >>>>>>>>>>> Right, but what I'm thinking about is making that the one and only
> >>>>>>>>>>> format. The rationale is that it's always easier to just append a
> >>>>>>>>>>> start
> >>>>>>>>>>> code from userspace if needed. And we need a bit offset to the slice
> >>>>>>>>>>> data part anyway, so it doesn't hurt to require a few extra bits to
> >>>>>>>>>>> have the whole thing that will work in every situation.
> >>>>>>>>>>
> >>>>>>>>>> What I'd like is to eventually allow zero-copy (aka userptr) into the
> >>>>>>>>>> driver. If you make the start code mandatory, any decoding from ISOMP4
> >>>>>>>>>> (.mp4, .mov) will require a full bitstream copy in userspace to add
> >>>>>>>>>> the
> >>>>>>>>>> start code (unless you hack your allocation in your demuxer, but it's
> >>>>>>>>>> a
> >>>>>>>>>> bit complicated since this code might come from two libraries). In
> >>>>>>>>>> ISOMP4, you have an AVC header, which is just the size of the NAL that
> >>>>>>>>>> follows.
> >>>>>>>>>
> >>>>>>>>> Well, I think we have to do a copy from system memory to the buffer
> >>>>>>>>> allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> >>>>>>>>> expected not to have any MMU unit and not allow sg import anyway.
> >>>>>>>>
> >>>>>>>> The Rockchip has an mmu. You need one copy at least indeed,
> >>>>>>>
> >>>>>>> Is the MMU in use currently? That can make things troublesome if we run
> >>>>>>> into a case where the VPU has MMU and deals with scatter-gather while
> >>>>>>> the display part doesn't. As far as I know, there's no way for
> >>>>>>> userspace to know whether a dma-buf-exported buffer is backed by CMA or
> >>>>>>> by scatter-gather memory. This feels like a major issue for using dma-
> >>>>>>> buf, since userspace can't predict whether a buffer exported on one
> >>>>>>> device can be imported on another when building its pipeline.
> >>>>>>
> >>>>>> FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver
> >>>>>> for it yet. It is supported for display, both VPUs and some other devices. I
> >>>>>> think no sane SoC designer would left out one or another unit without IOMMU
> >>>>>> support, that just calls for troubles, as you pointed out.
> >>>>>
> >>>>> Right right, I've been following that from a distance :)
> >>>>>
> >>>>> Indeed I think it's realistic to expect that for now, but it may not
> >>>>> play out so well in the long term. For instance, maybe connecting a USB
> >>>>> display would require CMA when the rest of the system can do with sg.
> >>>>>
> >>>>> I think it would really be useful for userspace to have a way to test
> >>>>> whether a buffer can be imported from one device to another. It feels
> >>>>> better than indicating where the memory lives, since there are
> >>>>> countless cases where additional restrictions apply too.
> >>>>
> >>>> I don't know for the integration on the Rockchip, but I did notice the
> >>>> register documentation for it.
> >>>
> >>> All the important components in the SoC have their IOMMUs as well -
> >>> display controller, GPU.
> >>>
> >>> There is a blitter called RGA that is not behind an IOMMU, but has
> >>> some scatter-gather capability (with a need for the hardware sg table
> >>> to be physically contiguous). 
> >>
> >> That's definitely good to know and justfies the need to introduce a way
> >> for userspace to check if a buffer can be imported from one device to
> >> another.
> > 
> > There's been a lot of discussion about this before. You may be aware of
> > James Jones' attempt to create an allocator library for this:
> > 
> > 	https://github.com/cubanismo/allocator
> > 
> > I haven't heard an update on this for quite some time and I think it's
> > stagnated due to a lack of interest. However, I think the lack of
> > interest could be an indicator that the issue might not be pressing
> > enough. Luckily most SoCs are reasonably integrated, so there's usually
> > no issue sharing buffers between different hardware blocks.
> > 
> > Technically it's already possible to check for compatibility of buffers
> > at import time.
> > 
> > In the tegra-vde driver we do something along the lines of:
> > 
> > 	sgt = dma_buf_map_attachment(...);
> > 	...
> > 	if (sgt->nents != 1)
> > 		return -EINVAL;
> > 
> > because we don't support an IOMMU currently. Of course its still up to
> > userspace to react to that in a sensible way and it may not be obvious
> > what to do when the import fails.
> > 
> >>> That said, significance of such blitters
> >>> nowadays is rather low, as most of the time you need a compositor on
> >>> the GPU anyway, which can do any transformation in the same pass as
> >>> the composition.
> >>
> >> I think that is a crucial mistake and the way I see things, this will
> >> have to change eventually. We cannot keep under-using the fast and
> >> efficient hardware components and going with the war machine that is
> >> the GPU in all situations. This has caused enough trouble in the
> >> GNU/Linux userspace display stack already and I strongly believe it has
> >> to stop.
> > 
> > Unfortunately there's really no good API to develop drivers against. All
> > of the 2D APIs that exist are not really efficient when implemented via
> > hardware-accelerated drivers. And none of the attempts at defining an
> > API for hardware-accelerated 2D haven't really gained any momentum.
> > 
> > I had looked a bit at ways to make use of some compositing hardware that
> > we have on Tegra (which is like a blender/blitter of a sort) and the
> > best thing I could find would've been to accelerate some paths in Mesa.
> > However that would require quite a bit of infrastructure work because it
> > currently completely relies on GPU shaders to accelerate those paths.
> > 
> > Daniel has written a very interesting bit about this, in case you
> > haven't seen it yet:
> > 
> > 	https://blog.ffwll.ch/2018/08/no-2d-in-drm.html
> > 
> >>>> In general, the most significant gain
> >>>> with having iommu for CODECs is that it makes start up (and re-init)
> >>>> time much shorter, but also in a much more predictable duration. I do
> >>>> believe that the Venus driver (qualcomm) is one with solid support for
> >>>> this, and it's quite noticably more snappy then the others.
> >>>
> >>> Obviously you also get support for USERPTR if you have an IOMMU, but
> >>> that also has some costs - you need to pin the user pages and map to
> >>> the IOMMU before each frame and unmap and unpin after each frame,
> >>> which sometimes is more costly than actually having the userspace copy
> >>> to a preallocated and premapped buffer, especially for relatively
> >>> small contents, such as compressed bitstream.
> >>
> >> Heh, interesting point!
> > 
> > I share the same experience. Bitstream buffers are usually so small that
> > you can always find a physically contiguous memory region for them and a
> > memcpy() will be faster than the overhead of getting an IOMMU involved.
> > This obviously depends on the specific hardware, but there's always some
> > threshold before which mapping through an IOMMU just doesn't make sense
> > from a fragmentation and/or performance point of view.
> > 
> > I wonder, though, if it's not possible to keep userptr buffers around
> > and avoid the constant mapping/unmapping. If we only performed cache
> > maintenance on them as necessary, perhaps that could provide a viable,
> > maybe even good, zero-copy mechanism.
> 
> The vb2 framework will keep the mapping for a userptr as long as userspace
> uses the same userptr for every buffer.
> 
> I.e. the first time a buffer with index I is queued the userptr is mapped.
> If that buffer is later dequeued and then requeued again with the same
> userptr the vb2 core will reuse the old mapping. Otherwise it will unmap
> and map again with the new userptr.
> 
> The same is done for dmabuf, BTW. So if userspace keeps changing dmabuf
> fds for each buffer, then that is not optimal.

Right. That sounds like userptr could be made to be fairly efficient.
Still, given the small amount of data involved it may not be worth it
considering the extra requirement of needing an IOMMU.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  6:01         ` Tomasz Figa
@ 2019-05-22 18:15           ` Nicolas Dufresne
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-22 18:15 UTC (permalink / raw)
  To: Tomasz Figa, Paul Kocialkowski
  Cc: Linux Media Mailing List, Hans Verkuil, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Thierry Reding, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 3965 bytes --]

Le mercredi 22 mai 2019 à 15:01 +0900, Tomasz Figa a écrit :
> On Tue, May 21, 2019 at 8:45 PM Paul Kocialkowski
> <paul.kocialkowski@bootlin.com> wrote:
> > Hi,
> > 
> > On Tue, 2019-05-21 at 19:27 +0900, Tomasz Figa wrote:
> > > On Thu, May 16, 2019 at 2:43 AM Paul Kocialkowski
> > > <paul.kocialkowski@bootlin.com> wrote:
> > > > Hi,
> > > > 
> > > > Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit :
> > > > > Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > With the Rockchip stateless VPU driver in the works, we now have a
> > > > > > better idea of what the situation is like on platforms other than
> > > > > > Allwinner. This email shares my conclusions about the situation and how
> > > > > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > > > > 
> > > > > > - Per-slice decoding
> > > > > > 
> > > > > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > > > > to implement the required core bits. When we agree it looks good, we
> > > > > > should lift the restriction that all slices must be concatenated and
> > > > > > have them submitted as individual requests.
> > > > > > 
> > > > > > One question is what to do about other controls. I feel like it would
> > > > > > make sense to always pass all the required controls for decoding the
> > > > > > slice, including the ones that don't change across slices. But there
> > > > > > may be no particular advantage to this and only downsides. Not doing it
> > > > > > and relying on the "control cache" can work, but we need to specify
> > > > > > that only a single stream can be decoded per opened instance of the
> > > > > > v4l2 device. This is the assumption we're going with for handling
> > > > > > multi-slice anyway, so it shouldn't be an issue.
> > > > > 
> > > > > My opinion on this is that the m2m instance is a state, and the driver
> > > > > should be responsible of doing time-division multiplexing across
> > > > > multiple m2m instance jobs. Doing the time-division multiplexing in
> > > > > userspace would require some sort of daemon to work properly across
> > > > > processes. I also think the kernel is better place for doing resource
> > > > > access scheduling in general.
> > > > 
> > > > I agree with that yes. We always have a single m2m context and specific
> > > > controls per opened device so keeping cached values works out well.
> > > > 
> > > > So maybe we shall explicitly require that the request with the first
> > > > slice for a frame also contains the per-frame controls.
> > > > 
> > > 
> > > Agreed.
> > > 
> > > One more argument not to allow such multiplexing is that despite the
> 
> ^^ Here I meant the "userspace multiplexing".

Thanks, I was confused for a moment (specially that browser is your use
case).

> 
> > > API being called "stateless", there is actually some state saved
> > > between frames, e.g. the Rockchip decoder writes some intermediate
> > > data to some local buffers which need to be given to the decoder to
> > > decode the next frame. Actually, on Rockchip there is even a
> > > requirement to keep the reference list entries in the same order
> > > between frames.
> > 
> > Well, what I'm suggesting is to have one stream per m2m context, but it
> > should certainly be possible to have multiple m2m contexts (multiple
> > userspace open calls) that decode different streams concurrently.
> > 
> > Is that really going to be a problem for Rockchip? If so, then the
> > driver should probably enforce allowing a single userspace open and m2m
> > context at a time.
> 
> No, that's not what I meant. Obviously the driver can switch between
> different sets of private buffers when scheduling different contexts,
> as long as the userspace doesn't attempt to do any multiplexing
> itself.
> 
> Best regards,
> Tomasz

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  8:20             ` Boris Brezillon
@ 2019-05-22 18:18               ` Nicolas Dufresne
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-22 18:18 UTC (permalink / raw)
  To: Boris Brezillon, Tomasz Figa
  Cc: Thierry Reding, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]

Le mercredi 22 mai 2019 à 10:20 +0200, Boris Brezillon a écrit :
> On Wed, 22 May 2019 09:29:24 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
> > On Wed, 22 May 2019 15:39:37 +0900
> > Tomasz Figa <tfiga@chromium.org> wrote:
> > 
> > > > It would be premature to state that we are excluding. We are just
> > > > trying to find one format to get things upstream, and make sure we have
> > > > a plan how to extend it. Trying to support everything on the first try
> > > > is not going to work so well.
> > > > 
> > > > What is interesting to provide is how does you IP achieve multi-slice
> > > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > > Typical questions are:
> > > > 
> > > >   1. Do all slices have to be contiguous in memory
> > > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)    
> > > 
> > > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > > driver that was created mostly by RK themselves, we've been assuming
> > > that the interrupt is for the complete frame, without any problems.  
> > 
> > I confirm that's what happens when all slices forming a frame are packed
> > in a single output buffer: you only get one interrupt at the end of the
> > decoding process (in that case, when the frame is decoded). Of course,
> > if you split things up and do per-slice decoding instead (one slice per
> > buffer) you get an interrupt per slice, though I didn't manage to make
> > that work.
> > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> > interrupt on the second slice, which makes me think some states are
> > reset between the 2 operations leading the engine to think that the
> > second slice is part of a new frame.
> > 
> > Anyway, it doesn't sound like a crazy idea to support both per-slice
> > and per-frame decoding and maybe have a way to expose what a
> > specific codec can do (through an extra cap mechanism).
> > The other option would be to support only per-slice decoding with a
> > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> > only support per-frame decoding know when they should trigger the
> > decoding operation.
> 
> Just to clarify, we can use Hans' V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF
> work to identify start/end frame boundaries, the only problem I see is
> that users are not required to clear the flag on the last slice of a
> frame, so there's no way for the driver to know when it should trigger
> the decode-frame operation. I guess we could trigger this decode
> operation when v4l2_m2m_release_capture_buf() returns true, but I
> wonder if it's not too late to do that.

If the flag is gone, you can schedule immediatly, otherwise you'll know
by the timestamp change on the following slice.

> 
> > The downside is that it implies having a bounce
> > buffer where the driver can pack slices to be decoded on the END_FRAME
> > event.
> > 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22  9:29               ` Paul Kocialkowski
  2019-05-22 11:39                 ` Thierry Reding
@ 2019-05-22 18:26                 ` Nicolas Dufresne
  1 sibling, 0 replies; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-22 18:26 UTC (permalink / raw)
  To: Paul Kocialkowski, Thierry Reding, Boris Brezillon
  Cc: Tomasz Figa, Linux Media Mailing List, Hans Verkuil,
	Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 9076 bytes --]

Le mercredi 22 mai 2019 à 11:29 +0200, Paul Kocialkowski a écrit :
> Le mercredi 22 mai 2019 à 10:32 +0200, Thierry Reding a écrit :
> > On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote:
> > > On Wed, 22 May 2019 15:39:37 +0900
> > > Tomasz Figa <tfiga@chromium.org> wrote:
> > > 
> > > > > It would be premature to state that we are excluding. We are just
> > > > > trying to find one format to get things upstream, and make sure we have
> > > > > a plan how to extend it. Trying to support everything on the first try
> > > > > is not going to work so well.
> > > > > 
> > > > > What is interesting to provide is how does you IP achieve multi-slice
> > > > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > > > Typical questions are:
> > > > > 
> > > > >   1. Do all slices have to be contiguous in memory
> > > > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > > > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> > > > 
> > > > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > > > driver that was created mostly by RK themselves, we've been assuming
> > > > that the interrupt is for the complete frame, without any problems.
> > > 
> > > I confirm that's what happens when all slices forming a frame are packed
> > > in a single output buffer: you only get one interrupt at the end of the
> > > decoding process (in that case, when the frame is decoded). Of course,
> > > if you split things up and do per-slice decoding instead (one slice per
> > > buffer) you get an interrupt per slice, though I didn't manage to make
> > > that work.
> > > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> > > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> > > interrupt on the second slice, which makes me think some states are
> > > reset between the 2 operations leading the engine to think that the
> > > second slice is part of a new frame.
> > 
> > That sounds a lot like how this works on Tegra. My understanding is that
> > for slice decoding you'd also get an interrupt every time a full slice
> > has been decoded perhaps coupled with another "frame done" interrupt
> > when the full frame has been decoded after the last slice.
> > 
> > In frame-level decode mode you don't get interrupts in between and
> > instead only get the "frame done" interrupt. Unless something went wrong
> > during decoding, in which case you also get an interrupt but with error
> > flags and status registers that help determine what exactly happened.
> > 
> > > Anyway, it doesn't sound like a crazy idea to support both per-slice
> > > and per-frame decoding and maybe have a way to expose what a
> > > specific codec can do (through an extra cap mechanism).
> > 
> > Yeah, I think it makes sense to support both for devices that can do
> > both. From what Nicolas said it may make sense for an application to
> > want to do slice-level decoding if receiving a stream from the network
> > and frame-level decoding if playing back from a local file. If a driver
> > supports both, the application could detect that and choose the
> > appropriate format.
> > 
> > It sounds to me like using different input formats for that would be a
> > very natural way to describe it. Applications can already detect the set
> > of supported input formats and set the format when they allocate buffers
> > so that should work very nicely.
> 
> Pixel formats are indeed the natural way to go about this, but I have
> some reservations in this case. Slices are the natural unit of video
> streams, just like frames are to display hardware. Part of the pipeline
> configuration is slice-specific, so in theory, the pipeline needs to be
> reconfigured with each slice.
> 
> What we have been doing in Cedrus is to currently gather all the slices
> and use the last slice's specific configuration for the pipeline, which
> sort of works, but is very likely not a good idea.
> 
> You mentionned that the Tegra VPU currentyl always operates in frame
> mode (even when the stream actually has multiple slices, which I assume
> are gathered at some point). I wonder how it goes about configuring
> different slice parameters (which are specific to each slice, not
> frame) for the different slices. 

Per-frame CODEC won't ask for the l0/l1 list, which is slice specific.
This is the case for the RK3288, we don't pass that information.
Instead we build a list from the DPB entries, this is the list before
the applying the modifications found in the slice header. The HW will
do the rest.

> 
> I believe we should at least always expose per-slice granularity in the
> pixel format and requests. Maybe we could have a way to allow multiple
> slices to be gathered in the source buffer and have a control slice
> array for each request. In that case, we'd have a single request queued
> for the series of slices, with a bit offset in each control to the
> matching slice.
> 
> Then we could specify that such slices must be appended in a way that
> suits most decoders that would have to operate per-frame (so we need to
> figure this out) and worst case, we'll always have offsets in the
> controls if we need to setup a bounce buffer in the driver because
> things are not laid out the way we specified.
> 
> Then we introduce a specific cap to indicate which mode is supported
> (per-slice and/or per-frame) and adapt our ffmpeg reference to be able
> to operate in both modes.
> 
> That adds some complexity for userspace, but I don't think we can avoid
> it at this point and it feels better than having two different pixel
> formats (which would probably be even more complex to manage for
> userspace).
> 
> What do you think?
> 
> > > The other option would be to support only per-slice decoding with a
> > > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> > > only support per-frame decoding know when they should trigger the
> > > decoding operation. The downside is that it implies having a bounce
> > > buffer where the driver can pack slices to be decoded on the END_FRAME
> > > event.
> > 
> > I vaguely remember that that's what the video codec abstraction does in
> > Mesa/Gallium. 
> 
> Well, if it's exposed through VDPAU or VAAPI, the interface already
> operates per-slice and it would certainly not be a big issue to change
> that.

VDPAU seem to be per-frame to me (I only read that API recently once).
I believe this is the main difference between both. But most VDPAU
drivers needs to do their own final bit of parsing. The VAAPI has a
start/end call, so with a bounce buffer you can implement the other way
too in your driver. But then the cons is that userspace may be doing
parsing that won't be used by the driver.

> 
> Talking about the mesa/gallium video decoding stuff, I think it would
> be worth having V4L2 interfaces for that now that we have the Request
> API.
> 
> Basically, Nvidia GPUs have video decoding blocks (which could be
> similar to the ones present on Tegra) that are accessed through a
> firmware running on a Falcon MCU on the GPU side.
> 
> Having a standardized firmware interface for these and a V4L2 M2M
> driver for the interface would certainly make it easier for everyone to
> handle that. I don't really see why these video decoding hardware has
> to be exposed through the display stack anyway and one could want to
> use the GPU's video decoder without bringing up the shading cores.
> 
> > I'm not very familiar with V4L2, but this seems like it
> > could be problematic to integrate with the way that V4L2 works in
> > general. Perhaps sending a special buffer (0 length or whatever) to mark
> > the end of a frame would work. But this is probably something that
> > others have already thought about, since slice-level decoding is what
> > most people are using, hence there must already be a way for userspace
> > to somehow synchronize input vs. output buffers. Or does this currently
> > just work by queueing bitstream buffers as fast as possible and then
> > dequeueing frame buffers as they become available?
> 
> We have a Request API mechanism where we group controls (parsed
> bitstream meta-data) and source (OUTPUT) buffers together and submit
> them tied. When each request gets processed its buffer enters the
> OUTPUT queue, which gets picked up by the driver and associated with
> the first destination (CAPTURE) buffer available. Then the driver grabs
> the buffers and applies the controls matching the source buffer's
> request before starting decoding with M2M.
> 
> We have already worked on handling the case of requiring a single
> destination buffer for the different slices, by having a flag to
> indicate whether the destination buffer should be held.
> 
> Cheers,
> 
> Paul
> 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22 11:39                 ` Thierry Reding
@ 2019-05-22 18:31                   ` Nicolas Dufresne
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-22 18:31 UTC (permalink / raw)
  To: Thierry Reding, Paul Kocialkowski
  Cc: Boris Brezillon, Tomasz Figa, Linux Media Mailing List,
	Hans Verkuil, Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 14160 bytes --]

Le mercredi 22 mai 2019 à 13:39 +0200, Thierry Reding a écrit :
> On Wed, May 22, 2019 at 11:29:13AM +0200, Paul Kocialkowski wrote:
> > Le mercredi 22 mai 2019 à 10:32 +0200, Thierry Reding a écrit :
> > > On Wed, May 22, 2019 at 09:29:24AM +0200, Boris Brezillon wrote:
> > > > On Wed, 22 May 2019 15:39:37 +0900
> > > > Tomasz Figa <tfiga@chromium.org> wrote:
> > > > 
> > > > > > It would be premature to state that we are excluding. We are just
> > > > > > trying to find one format to get things upstream, and make sure we have
> > > > > > a plan how to extend it. Trying to support everything on the first try
> > > > > > is not going to work so well.
> > > > > > 
> > > > > > What is interesting to provide is how does you IP achieve multi-slice
> > > > > > decoding per frame. That's what we are studying on the RK/Hantro chip.
> > > > > > Typical questions are:
> > > > > > 
> > > > > >   1. Do all slices have to be contiguous in memory
> > > > > >   2. If 1., do you place start-code, AVC header or pass a seperate index to let the HW locate the start of each NAL ?
> > > > > >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)  
> > > > > 
> > > > > AFAICT, the bit about RK3288 isn't true. At least in our downstream
> > > > > driver that was created mostly by RK themselves, we've been assuming
> > > > > that the interrupt is for the complete frame, without any problems.
> > > > 
> > > > I confirm that's what happens when all slices forming a frame are packed
> > > > in a single output buffer: you only get one interrupt at the end of the
> > > > decoding process (in that case, when the frame is decoded). Of course,
> > > > if you split things up and do per-slice decoding instead (one slice per
> > > > buffer) you get an interrupt per slice, though I didn't manage to make
> > > > that work.
> > > > I get a DEC_BUFFER interrupt (AKA, "buffer is empty but frame is not
> > > > fully decoded") on the first slice and an ASO (Arbitrary Slice Ordering)
> > > > interrupt on the second slice, which makes me think some states are
> > > > reset between the 2 operations leading the engine to think that the
> > > > second slice is part of a new frame.
> > > 
> > > That sounds a lot like how this works on Tegra. My understanding is that
> > > for slice decoding you'd also get an interrupt every time a full slice
> > > has been decoded perhaps coupled with another "frame done" interrupt
> > > when the full frame has been decoded after the last slice.
> > > 
> > > In frame-level decode mode you don't get interrupts in between and
> > > instead only get the "frame done" interrupt. Unless something went wrong
> > > during decoding, in which case you also get an interrupt but with error
> > > flags and status registers that help determine what exactly happened.
> > > 
> > > > Anyway, it doesn't sound like a crazy idea to support both per-slice
> > > > and per-frame decoding and maybe have a way to expose what a
> > > > specific codec can do (through an extra cap mechanism).
> > > 
> > > Yeah, I think it makes sense to support both for devices that can do
> > > both. From what Nicolas said it may make sense for an application to
> > > want to do slice-level decoding if receiving a stream from the network
> > > and frame-level decoding if playing back from a local file. If a driver
> > > supports both, the application could detect that and choose the
> > > appropriate format.
> > > 
> > > It sounds to me like using different input formats for that would be a
> > > very natural way to describe it. Applications can already detect the set
> > > of supported input formats and set the format when they allocate buffers
> > > so that should work very nicely.
> > 
> > Pixel formats are indeed the natural way to go about this, but I have
> > some reservations in this case. Slices are the natural unit of video
> > streams, just like frames are to display hardware. Part of the pipeline
> > configuration is slice-specific, so in theory, the pipeline needs to be
> > reconfigured with each slice.
> > 
> > What we have been doing in Cedrus is to currently gather all the slices
> > and use the last slice's specific configuration for the pipeline, which
> > sort of works, but is very likely not a good idea.
> 
> To be honest, my testing has been very minimal, so it's quite possible
> that I've always only run into examples with either only a single slice
> or multiple slices with the same configuration. Or perhaps with
> differing configurations but non-significant (or non-noticable)
> differences.
> 
> > You mentionned that the Tegra VPU currentyl always operates in frame
> > mode (even when the stream actually has multiple slices, which I assume
> > are gathered at some point). I wonder how it goes about configuring
> > different slice parameters (which are specific to each slice, not
> > frame) for the different slices.
> 
> That's part of the beauty of the frame-level decoding mode (I think
> that's call SXE-P). The syntax engine has access to the complete
> bitstream and can parse all the information that it needs. There's some
> data that we pass into the decoder from the SPS and PPS, but other than
> that the VDE will do everything by itself.
> 
> > I believe we should at least always expose per-slice granularity in the
> > pixel format and requests. Maybe we could have a way to allow multiple
> > slices to be gathered in the source buffer and have a control slice
> > array for each request. In that case, we'd have a single request queued
> > for the series of slices, with a bit offset in each control to the
> > matching slice.
> > 
> > Then we could specify that such slices must be appended in a way that
> > suits most decoders that would have to operate per-frame (so we need to
> > figure this out) and worst case, we'll always have offsets in the
> > controls if we need to setup a bounce buffer in the driver because
> > things are not laid out the way we specified.
> > 
> > Then we introduce a specific cap to indicate which mode is supported
> > (per-slice and/or per-frame) and adapt our ffmpeg reference to be able
> > to operate in both modes.
> > 
> > That adds some complexity for userspace, but I don't think we can avoid
> > it at this point and it feels better than having two different pixel
> > formats (which would probably be even more complex to manage for
> > userspace).
> > 
> > What do you think?
> 
> I'm not sure I understand why this would be simpler than exposing two
> different pixel formats. It sounds like essentially the same thing, just
> with a different method.
> 
> One advantage I see with your approach is that it more formally defines
> how slices are passed. This might be a good thing to do anyway. I'm not
> sure if software stacks provide that information anyway. If they do this
> would be trivial to achieve. If they don't this could be an extra burden
> on userspace for decoder that don't need it.

Just to feed the discussion, in GStreamer it would be exposed like this
(except that this is full bitstream, not just slices):

/* FULL Frame */
video/x-h264,stream-format=byte-stream,alignment=au

/* One of more NAL per memory buffer */
video/x-h264,stream-format=byte-stream,alignment=nal

"stream-format=byte-stream" means with start-code, where you could AVC
or AVC3 bitstream too. We do that, so you have a common format, with
variant. I'm worried having too many formats will not scale in the long
term, that's all, I still think this solution works too. But note that
we already have _H264 and _H264_NOSC format. And then, how do you call
a stream that only has slice nals, but all all slice of a frame per
buffer ...

p.s. In Tegra OMX, there is a control to pick between AU/NAL, so I'm
pretty sure the HW support both ways.

> 
> Would it perhaps be possible to make this slice meta data optional? For
> example, could we just provide an H.264 slice pixel format and then let
> userspace fill in buffers in whatever way they want, provided that they
> follow some rules (must be annex B or something else, concatenated
> slices, ...) and then if there's an extra control specifying the offsets
> of individual slices drivers can use that, if not they just pass the
> bitstream buffer to the hardware if frame-level decoding is supported
> and let the hardware do its thing?
> 
> Hardware that has requirements different from that could require the
> meta data to be present and fail otherwise.
> 
> On the other hand, userspace would have to be prepared to deal with this
> type of hardware anyway, so it basically needs to provide the meta data
> in any case. Perhaps the meta data could be optional if a buffer
> contains a single slice.
> 
> One other thing that occurred to me is that the meta data could perhaps
> contain a more elaborate description of the data in the slice. But that
> has the problem that it can't be detected upfront, so userspace can't
> discover whether the decoder can handle that data until an error is
> returned from the decoder upon receiving the meta data.
> 
> To answer your question: I don't feel strongly one way or the other. The
> above is really just discussing the specifics of how the data is passed,
> but we don't really know what exactly the data is that we need to pass.
> 
> > > > The other option would be to support only per-slice decoding with a
> > > > mandatory START_FRAME/END_FRAME sequence to let drivers for HW that
> > > > only support per-frame decoding know when they should trigger the
> > > > decoding operation. The downside is that it implies having a bounce
> > > > buffer where the driver can pack slices to be decoded on the END_FRAME
> > > > event.
> > > 
> > > I vaguely remember that that's what the video codec abstraction does in
> > > Mesa/Gallium. 
> > 
> > Well, if it's exposed through VDPAU or VAAPI, the interface already
> > operates per-slice and it would certainly not be a big issue to change
> > that.
> 
> The video pipe callbacks can implement a ->decode_bitstream() callback
> that gets a number of buffer/size pairs along with a picture description
> (which corresponds roughly to the SPS/PPS). The buffer/size pairs are
> exactly what's passed in from VDPAU or VAAPI. It looks like VDPAU can
> pass multiple slices, each per VdpBitstreamBuffer, whereas VAAPI passes
> only a single buffer at a time at the driver level.
> 
> (Interesting side-note: VDPAU seems to require the start code to be part
> of the bitstream, whereas the VAAPI state tracker in Mesa will go and
> check whether a buffer contains the start code and prepend it via SG if
> not. So at the pipe_video_codec level it seems the decision was made to
> use annex B as the lowest common denominator).
> 
> > Talking about the mesa/gallium video decoding stuff, I think it would
> > be worth having V4L2 interfaces for that now that we have the Request
> > API.
> 
> Yeah, I think that'd be nice, but I'm not sure that you're going to find
> someone to redo all the work...
> 
> > Basically, Nvidia GPUs have video decoding blocks (which could be
> > similar to the ones present on Tegra) that are accessed through a
> > firmware running on a Falcon MCU on the GPU side.
> 
> Yeah, the video decoding blocks on GPUs are very similar to the ones
> found on more recent Tegra. The big difference, of course, is that on
> Tegra they are separate (platform) devices, whereas on the GPU they are
> part of the PCI device's register space. It'd be nice if we could
> somehow share drivers between the two, but I'm not sure that that's
> possible. Besides the different bus there are also difference is how
> memory is managed (video RAM on GPU vs. system memory on Tegra) and so
> on.
> 
> > Having a standardized firmware interface for these and a V4L2 M2M
> > driver for the interface would certainly make it easier for everyone to
> > handle that. I don't really see why these video decoding hardware has
> > to be exposed through the display stack anyway and one could want to
> > use the GPU's video decoder without bringing up the shading cores.
> 
> Are you saying that it might be possible to structure this as basically
> two "backend" drivers that each expose the command stream interface and
> then build a "frontend" driver that could talk to either backend? That
> sounds like a really nice idea, but I'm not sure that it'd work.
> 
> > > I'm not very familiar with V4L2, but this seems like it
> > > could be problematic to integrate with the way that V4L2 works in
> > > general. Perhaps sending a special buffer (0 length or whatever) to mark
> > > the end of a frame would work. But this is probably something that
> > > others have already thought about, since slice-level decoding is what
> > > most people are using, hence there must already be a way for userspace
> > > to somehow synchronize input vs. output buffers. Or does this currently
> > > just work by queueing bitstream buffers as fast as possible and then
> > > dequeueing frame buffers as they become available?
> > 
> > We have a Request API mechanism where we group controls (parsed
> > bitstream meta-data) and source (OUTPUT) buffers together and submit
> > them tied. When each request gets processed its buffer enters the
> > OUTPUT queue, which gets picked up by the driver and associated with
> > the first destination (CAPTURE) buffer available. Then the driver grabs
> > the buffers and applies the controls matching the source buffer's
> > request before starting decoding with M2M.
> > 
> > We have already worked on handling the case of requiring a single
> > destination buffer for the different slices, by having a flag to
> > indicate whether the destination buffer should be held.
> 
> Right. So sounds like the request is the natural boundary here. I guess
> that would allow drivers to manually concatenate accumulated bitstream
> buffers into a single one.
> 
> Thierry

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22 10:08         ` Thierry Reding
@ 2019-05-22 18:37           ` Nicolas Dufresne
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolas Dufresne @ 2019-05-22 18:37 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Paul Kocialkowski, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

Le mercredi 22 mai 2019 à 12:08 +0200, Thierry Reding a écrit :
> >   3. Does the HW do support single interrupt per frame (RK3288 as an example does not, but RK3399 do)
> 
> Yeah, we definitely do get a single interrupt at the end of a frame, or
> when an error occurs. Looking a bit at the register documentation it
> looks like this can be more fine-grained. We can for example get an
> interrupt at the end of a slice or a row of macro blocks.

This last one is really fancy. I've been working on some HW where they
do synchronization between decoder and encoder so they process data
with one macro-block distance. I know chips&media have similar feature,
and now Tegra, would be nice to find some convergence on this in the
future.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 10:09 Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support Paul Kocialkowski
  2019-05-15 14:42 ` Nicolas Dufresne
@ 2019-05-23 21:04 ` Jonas Karlman
  2019-06-03 11:24 ` Thierry Reding
  2 siblings, 0 replies; 55+ messages in thread
From: Jonas Karlman @ 2019-05-23 21:04 UTC (permalink / raw)
  To: Paul Kocialkowski, Linux Media Mailing List
  Cc: Hans Verkuil, Tomasz Figa, Nicolas Dufresne, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Thierry Reding, Jernej Skrabec,
	Ezequiel Garcia

On 2019-05-15 12:09, Paul Kocialkowski wrote:
> Hi,
>
> With the Rockchip stateless VPU driver in the works, we now have a
> better idea of what the situation is like on platforms other than
> Allwinner. This email shares my conclusions about the situation and how
> we should update the MPEG-2, H.264 and H.265 controls accordingly.
>
> [...]
>
> - Clear split of controls and terminology
>
> Some codecs have explicit NAL units that are good fits to match as
> controls: e.g. slice header, pps, sps. I think we should stick to the
> bitstream element names for those.
>
> For H.264, that would suggest the following changes:
> - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> - killing v4l2_ctrl_h264_decode_param and having the reference lists
> where they belong, which seems to be slice_header;

I have two more changes and/or clarifications that is needed for v4l2_ctrl_h264_scaling_matrix,
the expected order of scaling_list elements needs to be defined and documented.

In cedrus driver the expected order of elements is after the inverse scanning process as been applied.
This is in the order the hardware expects and what both ffmpeg use internally and vaapi expects,
allows for simple memcpy/sram write in both userspace and driver.

The rockchip vpu h264 driver from chromeos was expecting elements in scaling list order and would apply
the inverse zig-zag scan in driver. Side note: it would also wrongly apply zig-zag scan instead of field scan on field coded content.

I propose a clarification that the scaling lists element order should be after the inverse scanning process as been applied,
the order that cedrus, rockchip and vaapi expects.

Secondly the order of the six scaling_list_8x8 lists is currently using "ffmpeg order" where Intra Y is in [0] and Inter Y in [3].
Table 7-2 in H.264 specification list them in following order (index 6-11): Intra Y, Inter Y, Intra Cb, Inter Cb, Intra Cr and Inter Cr.
The 8x8 Cb/Cr lists should only be needed for 4:4:4 content.

Rockchip was expecting Intra/Inter Y to be in [0] and [1], cedrus use list [0] and [3].
VA-API only seem to support Intra/Inter Y, ffmpeg vaapi hwaccel copies [0] and [3] into vaapi [0] and [1].

I propose a clarification that the 8x8 scaling lists use the same order as they are listed in Table 7-2,
and that cedrus driver is changed to use 8x8 lists from [0] and [1] instead of [0] and [3].

Regards,
Jonas

> I'm up for preparing and submitting these control changes and updating
> cedrus if they seem agreeable.
>
> What do you think?
>
> Cheers,
>
> Paul

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-15 10:09 Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support Paul Kocialkowski
  2019-05-15 14:42 ` Nicolas Dufresne
  2019-05-23 21:04 ` Jonas Karlman
@ 2019-06-03 11:24 ` Thierry Reding
  2019-06-03 18:52   ` Nicolas Dufresne
  2 siblings, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-06-03 11:24 UTC (permalink / raw)
  To: Paul Kocialkowski
  Cc: Linux Media Mailing List, Hans Verkuil, Tomasz Figa,
	Nicolas Dufresne, Alexandre Courbot, Boris Brezillon,
	Maxime Ripard, Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 10671 bytes --]

On Wed, May 15, 2019 at 12:09:45PM +0200, Paul Kocialkowski wrote:
> Hi,
> 
> With the Rockchip stateless VPU driver in the works, we now have a
> better idea of what the situation is like on platforms other than
> Allwinner. This email shares my conclusions about the situation and how
> we should update the MPEG-2, H.264 and H.265 controls accordingly.
> 
> - Per-slice decoding
> 
> We've discussed this one already[0] and Hans has submitted a patch[1]
> to implement the required core bits. When we agree it looks good, we
> should lift the restriction that all slices must be concatenated and
> have them submitted as individual requests.
> 
> One question is what to do about other controls. I feel like it would
> make sense to always pass all the required controls for decoding the
> slice, including the ones that don't change across slices. But there
> may be no particular advantage to this and only downsides. Not doing it
> and relying on the "control cache" can work, but we need to specify
> that only a single stream can be decoded per opened instance of the
> v4l2 device. This is the assumption we're going with for handling
> multi-slice anyway, so it shouldn't be an issue.
> 
> - Annex-B formats
> 
> I don't think we have really reached a conclusion on the pixel formats
> we want to expose. The main issue is how to deal with codecs that need
> the full slice NALU with start code, where the slice_header is
> duplicated in raw bitstream, when others are fine with just the encoded
> slice data and the parsed slice header control.
> 
> My initial thinking was that we'd need 3 formats:
> - One that only takes only the slice compressed data (without raw slice
> header and start code);
> - One that takes both the NALU data (including start code, raw header
> and compressed data) and slice header controls;
> - One that takes the NALU data but no slice header.
> 
> But I no longer think the latter really makes sense in the context of
> stateless video decoding.
> 
> A side-note: I think we should definitely have data offsets in every
> case, so that implementations can just push the whole NALU regardless
> of the format if they're lazy.

I started an NVIDIA internal discussion about this to get some thoughts
from our local experts and to fill in my gaps in understanding of NVIDIA
hardware that we might want to support.

As far as input format goes, there was pretty broad consensus that in
order for the ABI to be most broadly useful we need to settle on the
lowest common denominator, while drawing some inspiration from existing
APIs because they've already gone through a lot of these discussions and
came up with standard interfaces to deal with the differences between
decoders.

In more concrete terms this means that we'll want to provide as much
data to the kernel as possible. On one hand that means that we need to
do all of the header parsing etc. in userspace and pass it to the kernel
to support hardware that can't parse this data by itself. At the same
time we want to provide the full bitstream to the kernel to make sure
that hardware that does some (or all) of the parsing itself has access
to this. We absolutely want to avoid having to reconstruct some of the
bitstream that userspace may not have passed in order to optimize for
some usecases.

Also, all bitstream parsing should be done in userspace, we don't want
to require the kernel to have to deal with this. There's nothing in the
bitstream that would be hardware-specific, so can all be done perfectly
fine in userspace.

As for an interface on what to pass along, most people suggested that we
pass both the raw bitstream along with a descriptor of what's contained
in that bitstream. That descriptor would contain the number of slices
contained in the bitstream chunk as well as per-slice data (such as the
offset in the bitstream chunk for that slice and the number/ID of the
slice). This is in addition to the extra meta data that we already pass
for the codecs (PPS, SPS, ...). The slice information would allow
drivers to point the hardware directly at the slice data if that's all
it needs, but it can also be used for error concealment if corrupted
slices are encountered. This would obviously require that controls can
be passed on a per-buffer basis. I'm not sure if that's possible since
the request API was already introduced to allow controls to be passed in
a more fine-grained manner than setting them globally. I'm not sure how
to pass per-buffer data in a nice way otherwise, but perhaps this is not
even a real problem?

The above is pretty close to the Mesa pipe_video infrastructure as well
as VAAPI and VDPAU, so I would expect most userspace to be able to deal
well with such an ABI.

Userspace applications would decide what the appropriate way is to pass
bitstream data. Network streaming applications may decide to send slices
individually, so number of slices = 1 and the descriptor would contain a
single entry for only that slice. For file-based playback userspace may
decide to forward the complete bitstream for a full frame, in which case
the descriptor would contain entries for all slices that make up the
frame.

On the driver side, the above interface gives drivers the flexibility to
implement what works best for them. Decoders that support scatter-gather
can be programmed for each buffer that they receive. IOMMU-capable
decoders can use the IOMMU to achieve something similar. And for very
simple decoders, the driver can always decide to concatenate in the
kernel and pass a single buffer with the complete bitstream for a full
frame if that's what the hardware needs.

The kernel can obviously assist userspace with making smart decisions by
advertising the capabilities of the kernel driver, but I think the above
is flexible enough to at least work in all cases, even if perhaps not at
the absolute best efficiency.

Allowing userspace to be flexible about how to pass this data will help
avoid situations like where data is contiguous in memory (such as would
be common for file-based playback) and then userspace having to break
this up into individual slices and the kernel having to concatenate all
of the slices, where it would be much easier to just use the contiguous
data directly (via something like userptr).

> - Dropping the DPB concept in H.264/H.265
> 
> As far as I could understand, the decoded picture buffer (DPB) is a
> concept that only makes sense relative to a decoder implementation. The
> spec mentions how to manage it with the Hypothetical reference decoder
> (Annex C), but that's about it.
> 
> What's really in the bitstream is the list of modified short-term and
> long-term references, which is enough for every decoder.
> 
> For this reason, I strongly believe we should stop talking about DPB in
> the controls and just pass these lists agremented with relevant
> information for userspace.
> 
> I think it should be up to the driver to maintain a DPB and we could
> have helpers for common cases. For instance, the rockchip decoder needs
> to keep unused entries around[2] and cedrus has the same requirement
> for H.264. However for cedrus/H.265, we don't need to do any book-
> keeping in particular and can manage with the lists from the bitstream
> directly.

There was a bit of concern regarding this. Given that DPB maintenance is
purely a software construct, this doesn't really belong in the kernel. A
DPB will be the same no matter what hardware operates on the bitstream.
Depending on the hardware it may use the DPB differently (or maybe not
at all), but that's beside the point, really. This is pretty much the
same rationale as discussed above for meta data.

Again, VAAPI and VDPAU don't require drivers to deal with this. Instead
they just get the final list of reference pictures, ready to use.

> - Using flags
> 
> The current MPEG-2 controls have lots of u8 values that can be
> represented as flags. Using flags also helps with padding.
> It's unlikely that we'll get more than 64 flags, so using a u64 by
> default for that sounds fine (we definitely do want to keep some room
> available and I don't think using 32 bits as a default is good enough).
> 
> I think H.264/HEVC per-control flags should also be moved to u64.

There was also some concensus on this, that u64 should be good enough
for anything out there, though we obviously don't know what the future
will hold, so perhaps adding some way for possible extending this in the
future might be good. I guess we'll get new controls for new codecs
anyway, so we can punt on this until then.

> - Clear split of controls and terminology
> 
> Some codecs have explicit NAL units that are good fits to match as
> controls: e.g. slice header, pps, sps. I think we should stick to the
> bitstream element names for those.
> 
> For H.264, that would suggest the following changes:
> - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> - killing v4l2_ctrl_h264_decode_param and having the reference lists
> where they belong, which seems to be slice_header;
> 
> I'm up for preparing and submitting these control changes and updating
> cedrus if they seem agreeable.
> 
> What do you think?

One other thing that came up was with regards to frame-level vs. slice-
level decoding support. Turns out that we indeed have support for slice-
level decoding on old Tegra devices. Very recent ones also support it,
though only for HEVC but no other codecs. Everything between old and
very new chips (same goes for GPUs) doesn't have slice-level decoding.

Very old GPU decoders apparently also have an additional quirk in that
they provide two types of output frames. Generally what comes out of the
decoder as reference frames can't be used directly for display, so these
"shadow" frames have to be internally converted into something that can
be displayed. This restriction only applies to reference frames. I think
that's generally something that can be implemented in the driver, we
only need to make sure that whatever we hand back to userspace is data
that can be displayed. The shadow frame itself can be kept in internal
data structures and used as appropriate.

One final remark is that it might be interesting to reach out to people
that work on Vulkan video processing specification. I don't think much
of this is public yet, but I'm going to try to find some people
internally to compare notes. It'd be nice if V4L2 could serve as a
backend to the Vulkan Video API.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-03 11:24 ` Thierry Reding
@ 2019-06-03 18:52   ` Nicolas Dufresne
  2019-06-03 19:41     ` Boris Brezillon
                       ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Nicolas Dufresne @ 2019-06-03 18:52 UTC (permalink / raw)
  To: Thierry Reding, Paul Kocialkowski
  Cc: Linux Media Mailing List, Hans Verkuil, Tomasz Figa,
	Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 14416 bytes --]

Le lundi 03 juin 2019 à 13:24 +0200, Thierry Reding a écrit :
> On Wed, May 15, 2019 at 12:09:45PM +0200, Paul Kocialkowski wrote:
> > Hi,
> > 
> > With the Rockchip stateless VPU driver in the works, we now have a
> > better idea of what the situation is like on platforms other than
> > Allwinner. This email shares my conclusions about the situation and how
> > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > 
> > - Per-slice decoding
> > 
> > We've discussed this one already[0] and Hans has submitted a patch[1]
> > to implement the required core bits. When we agree it looks good, we
> > should lift the restriction that all slices must be concatenated and
> > have them submitted as individual requests.
> > 
> > One question is what to do about other controls. I feel like it would
> > make sense to always pass all the required controls for decoding the
> > slice, including the ones that don't change across slices. But there
> > may be no particular advantage to this and only downsides. Not doing it
> > and relying on the "control cache" can work, but we need to specify
> > that only a single stream can be decoded per opened instance of the
> > v4l2 device. This is the assumption we're going with for handling
> > multi-slice anyway, so it shouldn't be an issue.
> > 
> > - Annex-B formats
> > 
> > I don't think we have really reached a conclusion on the pixel formats
> > we want to expose. The main issue is how to deal with codecs that need
> > the full slice NALU with start code, where the slice_header is
> > duplicated in raw bitstream, when others are fine with just the encoded
> > slice data and the parsed slice header control.
> > 
> > My initial thinking was that we'd need 3 formats:
> > - One that only takes only the slice compressed data (without raw slice
> > header and start code);
> > - One that takes both the NALU data (including start code, raw header
> > and compressed data) and slice header controls;
> > - One that takes the NALU data but no slice header.
> > 
> > But I no longer think the latter really makes sense in the context of
> > stateless video decoding.
> > 
> > A side-note: I think we should definitely have data offsets in every
> > case, so that implementations can just push the whole NALU regardless
> > of the format if they're lazy.
> 
> I started an NVIDIA internal discussion about this to get some thoughts
> from our local experts and to fill in my gaps in understanding of NVIDIA
> hardware that we might want to support.
> 
> As far as input format goes, there was pretty broad consensus that in
> order for the ABI to be most broadly useful we need to settle on the
> lowest common denominator, while drawing some inspiration from existing
> APIs because they've already gone through a lot of these discussions and
> came up with standard interfaces to deal with the differences between
> decoders.

Note that we are making a statement with the sateless/stateful split.
The userspace overhead is non-negligible if you start passing all this
useless data to a stateful HW. About other implementation, that's what
we went through in order to reach the state we are at now.

It's interesting that you have this dicussion with NVIDIA specialist,
that being said, I think it would be better to provide with the actual
data (how different generation of HW works) before providing
conclusions made by your team. Right now, we have deeply studied
Cedrus, Hantro and Rockchip IP, and that's how we manage to reach this
low overhead compromise. What we really want to see, is if there exist
NVidia HW, that does not fit any of the two interface, and why.

> 
> In more concrete terms this means that we'll want to provide as much
> data to the kernel as possible. On one hand that means that we need to
> do all of the header parsing etc. in userspace and pass it to the kernel
> to support hardware that can't parse this data by itself. At the same
> time we want to provide the full bitstream to the kernel to make sure
> that hardware that does some (or all) of the parsing itself has access
> to this. We absolutely want to avoid having to reconstruct some of the
> bitstream that userspace may not have passed in order to optimize for
> some usecases.

Passing the entire bitstream without reconstruction is near impossible
for a VDPAU or VAAPI driver. Even for FFMPEG, it makes everything much
more complex. I think at some point we need to draw a line what this
new API should cover.

An example here, we have decided to support a new format H264_SLICE,
and this format has been defined as "slice only" stream where pps,sps
et. would be described in C structure. There is nothing that prevents
adding other formats in the future. What we would like is that this
remains as inclusive as possible to the "slice" accelerators we know,
hence adding "per-frame" decoding, since we know the "per-slice"
decoding is compatible. We also know that this does not add more work
to existing userspace code the supports similar accelerator.

In fact, the first thing we kept in mind in our work is that it's very
difficult to implement this userspace, so keeping in mind compatibility
with VAAPI/VDPAU existing userspace (like the accelerator in FFMPEG and
GStreamer) felt like essential to lead toward fully Open Source
solution.

> 
> Also, all bitstream parsing should be done in userspace, we don't want
> to require the kernel to have to deal with this. There's nothing in the
> bitstream that would be hardware-specific, so can all be done perfectly
> fine in userspace.
> 
> As for an interface on what to pass along, most people suggested that we
> pass both the raw bitstream along with a descriptor of what's contained
> in that bitstream. That descriptor would contain the number of slices
> contained in the bitstream chunk as well as per-slice data (such as the
> offset in the bitstream chunk for that slice and the number/ID of the
> slice). This is in addition to the extra meta data that we already pass
> for the codecs (PPS, SPS, ...). The slice information would allow
> drivers to point the hardware directly at the slice data if that's all
> it needs, but it can also be used for error concealment if corrupted
> slices are encountered. This would obviously require that controls can
> be passed on a per-buffer basis. I'm not sure if that's possible since
> the request API was already introduced to allow controls to be passed in
> a more fine-grained manner than setting them globally. I'm not sure how
> to pass per-buffer data in a nice way otherwise, but perhaps this is not
> even a real problem?
> 
> The above is pretty close to the Mesa pipe_video infrastructure as well
> as VAAPI and VDPAU, so I would expect most userspace to be able to deal
> well with such an ABI.
> 
> Userspace applications would decide what the appropriate way is to pass
> bitstream data. Network streaming applications may decide to send slices
> individually, so number of slices = 1 and the descriptor would contain a
> single entry for only that slice. For file-based playback userspace may
> decide to forward the complete bitstream for a full frame, in which case
> the descriptor would contain entries for all slices that make up the
> frame.
> 
> On the driver side, the above interface gives drivers the flexibility to
> implement what works best for them. Decoders that support scatter-gather
> can be programmed for each buffer that they receive. IOMMU-capable
> decoders can use the IOMMU to achieve something similar. And for very
> simple decoders, the driver can always decide to concatenate in the
> kernel and pass a single buffer with the complete bitstream for a full
> frame if that's what the hardware needs.
> 
> The kernel can obviously assist userspace with making smart decisions by
> advertising the capabilities of the kernel driver, but I think the above
> is flexible enough to at least work in all cases, even if perhaps not at
> the absolute best efficiency.
> 
> Allowing userspace to be flexible about how to pass this data will help
> avoid situations like where data is contiguous in memory (such as would
> be common for file-based playback) and then userspace having to break
> this up into individual slices and the kernel having to concatenate all
> of the slices, where it would be much easier to just use the contiguous
> data directly (via something like userptr).



> 
> > - Dropping the DPB concept in H.264/H.265
> > 
> > As far as I could understand, the decoded picture buffer (DPB) is a
> > concept that only makes sense relative to a decoder implementation. The
> > spec mentions how to manage it with the Hypothetical reference decoder
> > (Annex C), but that's about it.
> > 
> > What's really in the bitstream is the list of modified short-term and
> > long-term references, which is enough for every decoder.
> > 
> > For this reason, I strongly believe we should stop talking about DPB in
> > the controls and just pass these lists agremented with relevant
> > information for userspace.
> > 
> > I think it should be up to the driver to maintain a DPB and we could
> > have helpers for common cases. For instance, the rockchip decoder needs
> > to keep unused entries around[2] and cedrus has the same requirement
> > for H.264. However for cedrus/H.265, we don't need to do any book-
> > keeping in particular and can manage with the lists from the bitstream
> > directly.
> 
> There was a bit of concern regarding this. Given that DPB maintenance is
> purely a software construct, this doesn't really belong in the kernel. A
> DPB will be the same no matter what hardware operates on the bitstream.
> Depending on the hardware it may use the DPB differently (or maybe not
> at all), but that's beside the point, really. This is pretty much the
> same rationale as discussed above for meta data.
> 
> Again, VAAPI and VDPAU don't require drivers to deal with this. Instead
> they just get the final list of reference pictures, ready to use.

I think we need a bit of clarification from Boris, as what I read here
is a bit contradictory (or at least I am a bit confused). When I first
read this, I understood that this was just about renaming the dpb as
being the references list and only require the active references to be
there.

So what I'm not clear is where exactly this "active reference list"
comes from. In VAAPI it is describe "per-frame" ....

> 
> > - Using flags
> > 
> > The current MPEG-2 controls have lots of u8 values that can be
> > represented as flags. Using flags also helps with padding.
> > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > default for that sounds fine (we definitely do want to keep some room
> > available and I don't think using 32 bits as a default is good enough).
> > 
> > I think H.264/HEVC per-control flags should also be moved to u64.
> 
> There was also some concensus on this, that u64 should be good enough
> for anything out there, though we obviously don't know what the future
> will hold, so perhaps adding some way for possible extending this in the
> future might be good. I guess we'll get new controls for new codecs
> anyway, so we can punt on this until then.
> 
> > - Clear split of controls and terminology
> > 
> > Some codecs have explicit NAL units that are good fits to match as
> > controls: e.g. slice header, pps, sps. I think we should stick to the
> > bitstream element names for those.
> > 
> > For H.264, that would suggest the following changes:
> > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > where they belong, which seems to be slice_header;

But now here it's being described per slice. When I look at the slice
header, I only see list of modifications and when I look at userspace,
That list is simply built from DPB, the modifications list found in the
slice header seems to be only used to craft the l0/l1 list.

There is one thing that come up though, if we enable per-frame decoding
on top of per-slice decoder (like Cedrus), won't we force userspace to
always compute l0/l1 even though the HW might be handling that ? Shall
we instead pass the modification list and implement the non-parsing
bits of applying the modifications in the kernel ?

> > 
> > I'm up for preparing and submitting these control changes and updating
> > cedrus if they seem agreeable.
> > 
> > What do you think?
> 
> One other thing that came up was with regards to frame-level vs. slice-
> level decoding support. Turns out that we indeed have support for slice-
> level decoding on old Tegra devices. Very recent ones also support it,
> though only for HEVC but no other codecs. Everything between old and
> very new chips (same goes for GPUs) doesn't have slice-level decoding.
> 
> Very old GPU decoders apparently also have an additional quirk in that
> they provide two types of output frames. Generally what comes out of the
> decoder as reference frames can't be used directly for display, so these
> "shadow" frames have to be internally converted into something that can
> be displayed. This restriction only applies to reference frames. I think
> that's generally something that can be implemented in the driver, we
> only need to make sure that whatever we hand back to userspace is data
> that can be displayed. The shadow frame itself can be kept in internal
> data structures and used as appropriate.

This looks similar to what CODA driver do, maybe that could be used as
inspiration. Though in their case it's simple, the CODA tiling is not
supported by anything else in the SoC it was integrated on, so in this
case they use some IMX specific feature to output linear layout.

That gives me the impression that this can be done by the driver as you
said.

> 
> One final remark is that it might be interesting to reach out to people
> that work on Vulkan video processing specification. I don't think much
> of this is public yet, but I'm going to try to find some people
> internally to compare notes. It'd be nice if V4L2 could serve as a
> backend to the Vulkan Video API.
> 
> Thierry

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-03 18:52   ` Nicolas Dufresne
@ 2019-06-03 19:41     ` Boris Brezillon
  2019-06-04  8:31       ` Thierry Reding
  2019-06-04  8:50     ` Thierry Reding
  2019-06-04  8:55     ` Thierry Reding
  2 siblings, 1 reply; 55+ messages in thread
From: Boris Brezillon @ 2019-06-03 19:41 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Thierry Reding, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Tomasz Figa, Alexandre Courbot, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

On Mon, 03 Jun 2019 14:52:44 -0400
Nicolas Dufresne <nicolas@ndufresne.ca> wrote:

> > > - Dropping the DPB concept in H.264/H.265
> > > 
> > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > concept that only makes sense relative to a decoder implementation. The
> > > spec mentions how to manage it with the Hypothetical reference decoder
> > > (Annex C), but that's about it.
> > > 
> > > What's really in the bitstream is the list of modified short-term and
> > > long-term references, which is enough for every decoder.
> > > 
> > > For this reason, I strongly believe we should stop talking about DPB in
> > > the controls and just pass these lists agremented with relevant
> > > information for userspace.
> > > 
> > > I think it should be up to the driver to maintain a DPB and we could
> > > have helpers for common cases. For instance, the rockchip decoder needs
> > > to keep unused entries around[2] and cedrus has the same requirement
> > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > keeping in particular and can manage with the lists from the bitstream
> > > directly.  
> > 
> > There was a bit of concern regarding this. Given that DPB maintenance is
> > purely a software construct, this doesn't really belong in the kernel. A
> > DPB will be the same no matter what hardware operates on the bitstream.
> > Depending on the hardware it may use the DPB differently (or maybe not
> > at all), but that's beside the point, really. This is pretty much the
> > same rationale as discussed above for meta data.
> > 
> > Again, VAAPI and VDPAU don't require drivers to deal with this. Instead
> > they just get the final list of reference pictures, ready to use.  
> 
> I think we need a bit of clarification from Boris, as what I read here
> is a bit contradictory (or at least I am a bit confused). When I first
> read this, I understood that this was just about renaming the dpb as
> being the references list and only require the active references to be
> there.

It's really just about renaming the field, it would contain exactly the
same data.

> 
> So what I'm not clear is where exactly this "active reference list"
> comes from. In VAAPI it is describe "per-frame" ....

That's my understanding as well.

> 
> >   
> > > - Using flags
> > > 
> > > The current MPEG-2 controls have lots of u8 values that can be
> > > represented as flags. Using flags also helps with padding.
> > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > default for that sounds fine (we definitely do want to keep some room
> > > available and I don't think using 32 bits as a default is good enough).
> > > 
> > > I think H.264/HEVC per-control flags should also be moved to u64.  
> > 
> > There was also some concensus on this, that u64 should be good enough
> > for anything out there, though we obviously don't know what the future
> > will hold, so perhaps adding some way for possible extending this in the
> > future might be good. I guess we'll get new controls for new codecs
> > anyway, so we can punt on this until then.
> >   
> > > - Clear split of controls and terminology
> > > 
> > > Some codecs have explicit NAL units that are good fits to match as
> > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > bitstream element names for those.
> > > 
> > > For H.264, that would suggest the following changes:
> > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > where they belong, which seems to be slice_header;  
> 
> But now here it's being described per slice. When I look at the slice
> header, I only see list of modifications and when I look at userspace,
> That list is simply built from DPB, the modifications list found in the
> slice header seems to be only used to craft the l0/l1 list.

Yes, I think there was a misunderstanding which was then clarified
(unfortunately it happened on IRC, so we don't have a trace of this
discussion). The reference list should definitely be per-frame, and the
L0/L1 slice reflists are referring to the per-frame reference list (it's
just a sub-set of the per-frame reflist re-ordered differently).

> 
> There is one thing that come up though, if we enable per-frame decoding
> on top of per-slice decoder (like Cedrus), won't we force userspace to
> always compute l0/l1 even though the HW might be handling that ?

That's true, the question is, what's the cost of this extra re-ordering?

> Shall
> we instead pass the modification list and implement the non-parsing
> bits of applying the modifications in the kernel ?

I'd be fine with that option too.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-03 19:41     ` Boris Brezillon
@ 2019-06-04  8:31       ` Thierry Reding
  2019-06-04  8:49         ` Boris Brezillon
  0 siblings, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-06-04  8:31 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Dufresne, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Tomasz Figa, Alexandre Courbot, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 5977 bytes --]

On Mon, Jun 03, 2019 at 09:41:17PM +0200, Boris Brezillon wrote:
> On Mon, 03 Jun 2019 14:52:44 -0400
> Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> 
> > > > - Dropping the DPB concept in H.264/H.265
> > > > 
> > > > As far as I could understand, the decoded picture buffer (DPB) is a
> > > > concept that only makes sense relative to a decoder implementation. The
> > > > spec mentions how to manage it with the Hypothetical reference decoder
> > > > (Annex C), but that's about it.
> > > > 
> > > > What's really in the bitstream is the list of modified short-term and
> > > > long-term references, which is enough for every decoder.
> > > > 
> > > > For this reason, I strongly believe we should stop talking about DPB in
> > > > the controls and just pass these lists agremented with relevant
> > > > information for userspace.
> > > > 
> > > > I think it should be up to the driver to maintain a DPB and we could
> > > > have helpers for common cases. For instance, the rockchip decoder needs
> > > > to keep unused entries around[2] and cedrus has the same requirement
> > > > for H.264. However for cedrus/H.265, we don't need to do any book-
> > > > keeping in particular and can manage with the lists from the bitstream
> > > > directly.  
> > > 
> > > There was a bit of concern regarding this. Given that DPB maintenance is
> > > purely a software construct, this doesn't really belong in the kernel. A
> > > DPB will be the same no matter what hardware operates on the bitstream.
> > > Depending on the hardware it may use the DPB differently (or maybe not
> > > at all), but that's beside the point, really. This is pretty much the
> > > same rationale as discussed above for meta data.
> > > 
> > > Again, VAAPI and VDPAU don't require drivers to deal with this. Instead
> > > they just get the final list of reference pictures, ready to use.  
> > 
> > I think we need a bit of clarification from Boris, as what I read here
> > is a bit contradictory (or at least I am a bit confused). When I first
> > read this, I understood that this was just about renaming the dpb as
> > being the references list and only require the active references to be
> > there.
> 
> It's really just about renaming the field, it would contain exactly the
> same data.
> 
> > 
> > So what I'm not clear is where exactly this "active reference list"
> > comes from. In VAAPI it is describe "per-frame" ....
> 
> That's my understanding as well.
> 
> > 
> > >   
> > > > - Using flags
> > > > 
> > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > represented as flags. Using flags also helps with padding.
> > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > default for that sounds fine (we definitely do want to keep some room
> > > > available and I don't think using 32 bits as a default is good enough).
> > > > 
> > > > I think H.264/HEVC per-control flags should also be moved to u64.  
> > > 
> > > There was also some concensus on this, that u64 should be good enough
> > > for anything out there, though we obviously don't know what the future
> > > will hold, so perhaps adding some way for possible extending this in the
> > > future might be good. I guess we'll get new controls for new codecs
> > > anyway, so we can punt on this until then.
> > >   
> > > > - Clear split of controls and terminology
> > > > 
> > > > Some codecs have explicit NAL units that are good fits to match as
> > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > bitstream element names for those.
> > > > 
> > > > For H.264, that would suggest the following changes:
> > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > where they belong, which seems to be slice_header;  
> > 
> > But now here it's being described per slice. When I look at the slice
> > header, I only see list of modifications and when I look at userspace,
> > That list is simply built from DPB, the modifications list found in the
> > slice header seems to be only used to craft the l0/l1 list.
> 
> Yes, I think there was a misunderstanding which was then clarified
> (unfortunately it happened on IRC, so we don't have a trace of this
> discussion). The reference list should definitely be per-frame, and the
> L0/L1 slice reflists are referring to the per-frame reference list (it's
> just a sub-set of the per-frame reflist re-ordered differently).
> 
> > 
> > There is one thing that come up though, if we enable per-frame decoding
> > on top of per-slice decoder (like Cedrus), won't we force userspace to
> > always compute l0/l1 even though the HW might be handling that ?
> 
> That's true, the question is, what's the cost of this extra re-ordering?

I think ultimately userspace is already forced to compute these lists
even if some hardware may be able to do it in hardware. There's going to
be other hardware that userspace wants to support that can't do it by
itself, so userspace has to at least have the code anyway. What it could
do on top of that decide not to run that code if it somehow detects that
hardware can do it already. On the other hand this means that we have to
expose a whole lot of capabilities to userspace and userspace has to go
and detect all of them in order to parameterize all of the code.

Ultimately I suspect many applications will just choose to pass the data
all the time out of simplicity. I mean drivers that don't need it will
already ignore it (i.e. they must not break if they get the extra data)
so other than the potential runtime savings on some hardware, there are
no advantages.

Given that other APIs don't bother exposing this level of control to
applications makes me think that it's just not worth it from a
performance point of view.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  8:31       ` Thierry Reding
@ 2019-06-04  8:49         ` Boris Brezillon
  2019-06-04  9:06           ` Thierry Reding
  0 siblings, 1 reply; 55+ messages in thread
From: Boris Brezillon @ 2019-06-04  8:49 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Nicolas Dufresne, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Tomasz Figa, Alexandre Courbot, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

On Tue, 4 Jun 2019 10:31:57 +0200
Thierry Reding <thierry.reding@gmail.com> wrote:

> > > > > - Using flags
> > > > > 
> > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > represented as flags. Using flags also helps with padding.
> > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > 
> > > > > I think H.264/HEVC per-control flags should also be moved to u64.    
> > > > 
> > > > There was also some concensus on this, that u64 should be good enough
> > > > for anything out there, though we obviously don't know what the future
> > > > will hold, so perhaps adding some way for possible extending this in the
> > > > future might be good. I guess we'll get new controls for new codecs
> > > > anyway, so we can punt on this until then.
> > > >     
> > > > > - Clear split of controls and terminology
> > > > > 
> > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > bitstream element names for those.
> > > > > 
> > > > > For H.264, that would suggest the following changes:
> > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > where they belong, which seems to be slice_header;    
> > > 
> > > But now here it's being described per slice. When I look at the slice
> > > header, I only see list of modifications and when I look at userspace,
> > > That list is simply built from DPB, the modifications list found in the
> > > slice header seems to be only used to craft the l0/l1 list.  
> > 
> > Yes, I think there was a misunderstanding which was then clarified
> > (unfortunately it happened on IRC, so we don't have a trace of this
> > discussion). The reference list should definitely be per-frame, and the
> > L0/L1 slice reflists are referring to the per-frame reference list (it's
> > just a sub-set of the per-frame reflist re-ordered differently).
> >   
> > > 
> > > There is one thing that come up though, if we enable per-frame decoding
> > > on top of per-slice decoder (like Cedrus), won't we force userspace to
> > > always compute l0/l1 even though the HW might be handling that ?  
> > 
> > That's true, the question is, what's the cost of this extra re-ordering?  
> 
> I think ultimately userspace is already forced to compute these lists
> even if some hardware may be able to do it in hardware. There's going to
> be other hardware that userspace wants to support that can't do it by
> itself, so userspace has to at least have the code anyway. What it could
> do on top of that decide not to run that code if it somehow detects that
> hardware can do it already. On the other hand this means that we have to
> expose a whole lot of capabilities to userspace and userspace has to go
> and detect all of them in order to parameterize all of the code.
> 
> Ultimately I suspect many applications will just choose to pass the data
> all the time out of simplicity. I mean drivers that don't need it will
> already ignore it (i.e. they must not break if they get the extra data)
> so other than the potential runtime savings on some hardware, there are
> no advantages.
> 
> Given that other APIs don't bother exposing this level of control to
> applications makes me think that it's just not worth it from a
> performance point of view.

That's not exactly what Nicolas proposed. He was suggesting that we
build those reflists kernel-side: V4L would provide an helper and
drivers that need those lists would use it, others won't. This way we
have no useless computation done, and userspace doesn't even have to
bother checking the device caps to avoid this extra step.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-03 18:52   ` Nicolas Dufresne
  2019-06-03 19:41     ` Boris Brezillon
@ 2019-06-04  8:50     ` Thierry Reding
  2019-06-04  8:55     ` Thierry Reding
  2 siblings, 0 replies; 55+ messages in thread
From: Thierry Reding @ 2019-06-04  8:50 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Paul Kocialkowski, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 6738 bytes --]

On Mon, Jun 03, 2019 at 02:52:44PM -0400, Nicolas Dufresne wrote:
> Le lundi 03 juin 2019 à 13:24 +0200, Thierry Reding a écrit :
> > On Wed, May 15, 2019 at 12:09:45PM +0200, Paul Kocialkowski wrote:
> > > Hi,
> > > 
> > > With the Rockchip stateless VPU driver in the works, we now have a
> > > better idea of what the situation is like on platforms other than
> > > Allwinner. This email shares my conclusions about the situation and how
> > > we should update the MPEG-2, H.264 and H.265 controls accordingly.
> > > 
> > > - Per-slice decoding
> > > 
> > > We've discussed this one already[0] and Hans has submitted a patch[1]
> > > to implement the required core bits. When we agree it looks good, we
> > > should lift the restriction that all slices must be concatenated and
> > > have them submitted as individual requests.
> > > 
> > > One question is what to do about other controls. I feel like it would
> > > make sense to always pass all the required controls for decoding the
> > > slice, including the ones that don't change across slices. But there
> > > may be no particular advantage to this and only downsides. Not doing it
> > > and relying on the "control cache" can work, but we need to specify
> > > that only a single stream can be decoded per opened instance of the
> > > v4l2 device. This is the assumption we're going with for handling
> > > multi-slice anyway, so it shouldn't be an issue.
> > > 
> > > - Annex-B formats
> > > 
> > > I don't think we have really reached a conclusion on the pixel formats
> > > we want to expose. The main issue is how to deal with codecs that need
> > > the full slice NALU with start code, where the slice_header is
> > > duplicated in raw bitstream, when others are fine with just the encoded
> > > slice data and the parsed slice header control.
> > > 
> > > My initial thinking was that we'd need 3 formats:
> > > - One that only takes only the slice compressed data (without raw slice
> > > header and start code);
> > > - One that takes both the NALU data (including start code, raw header
> > > and compressed data) and slice header controls;
> > > - One that takes the NALU data but no slice header.
> > > 
> > > But I no longer think the latter really makes sense in the context of
> > > stateless video decoding.
> > > 
> > > A side-note: I think we should definitely have data offsets in every
> > > case, so that implementations can just push the whole NALU regardless
> > > of the format if they're lazy.
> > 
> > I started an NVIDIA internal discussion about this to get some thoughts
> > from our local experts and to fill in my gaps in understanding of NVIDIA
> > hardware that we might want to support.
> > 
> > As far as input format goes, there was pretty broad consensus that in
> > order for the ABI to be most broadly useful we need to settle on the
> > lowest common denominator, while drawing some inspiration from existing
> > APIs because they've already gone through a lot of these discussions and
> > came up with standard interfaces to deal with the differences between
> > decoders.
> 
> Note that we are making a statement with the sateless/stateful split.
> The userspace overhead is non-negligible if you start passing all this
> useless data to a stateful HW. About other implementation, that's what
> we went through in order to reach the state we are at now.
> 
> It's interesting that you have this dicussion with NVIDIA specialist,
> that being said, I think it would be better to provide with the actual
> data (how different generation of HW works) before providing
> conclusions made by your team. Right now, we have deeply studied
> Cedrus, Hantro and Rockchip IP, and that's how we manage to reach this
> low overhead compromise. What we really want to see, is if there exist
> NVidia HW, that does not fit any of the two interface, and why.

Sorry if I was being condescending, that was not my intention. I was
trying to share what I was able to learn in the short time while the
discussion was happening.

If I understand correctly, I think NVIDIA hardware falls in the category
covered by the second interface, that is: NALU data (start code, raw
header, compressed data) and slice header controls.

I'm trying to get some other things out of the way first, but then I
hope to have time to go back to porting the VDE driver to V4L2 so that I
have something more concrete to contribute.

> > In more concrete terms this means that we'll want to provide as much
> > data to the kernel as possible. On one hand that means that we need to
> > do all of the header parsing etc. in userspace and pass it to the kernel
> > to support hardware that can't parse this data by itself. At the same
> > time we want to provide the full bitstream to the kernel to make sure
> > that hardware that does some (or all) of the parsing itself has access
> > to this. We absolutely want to avoid having to reconstruct some of the
> > bitstream that userspace may not have passed in order to optimize for
> > some usecases.
> 
> Passing the entire bitstream without reconstruction is near impossible
> for a VDPAU or VAAPI driver. Even for FFMPEG, it makes everything much
> more complex. I think at some point we need to draw a line what this
> new API should cover.

I think that's totally reasonable. I'm just trying to make sure that
this is something that will work for Tegra. It'd be very unfortunate
if we had to do something else entirely because V4L2 didn't cover what
we need.

> An example here, we have decided to support a new format H264_SLICE,
> and this format has been defined as "slice only" stream where pps,sps
> et. would be described in C structure. There is nothing that prevents
> adding other formats in the future. What we would like is that this
> remains as inclusive as possible to the "slice" accelerators we know,
> hence adding "per-frame" decoding, since we know the "per-slice"
> decoding is compatible. We also know that this does not add more work
> to existing userspace code the supports similar accelerator.
> 
> In fact, the first thing we kept in mind in our work is that it's very
> difficult to implement this userspace, so keeping in mind compatibility
> with VAAPI/VDPAU existing userspace (like the accelerator in FFMPEG and
> GStreamer) felt like essential to lead toward fully Open Source
> solution.

Okay, thanks for clarifying that. Sounds like I was misinterpreting
where the discussion was headed.

We'll most likely need something other than the H264_SLICE format for
Tegra, so as long as that's something you guys will remain open to, that
sounds good to me.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-03 18:52   ` Nicolas Dufresne
  2019-06-03 19:41     ` Boris Brezillon
  2019-06-04  8:50     ` Thierry Reding
@ 2019-06-04  8:55     ` Thierry Reding
  2019-06-04  9:05       ` Boris Brezillon
  2 siblings, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-06-04  8:55 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Paul Kocialkowski, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Boris Brezillon, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

On Mon, Jun 03, 2019 at 02:52:44PM -0400, Nicolas Dufresne wrote:
[...]
> There is one thing that come up though, if we enable per-frame decoding
> on top of per-slice decoder (like Cedrus), won't we force userspace to
> always compute l0/l1 even though the HW might be handling that ? Shall
> we instead pass the modification list and implement the non-parsing
> bits of applying the modifications in the kernel ?

Applying the modifications is a standard procedure, right? If it's
completely driver-agnostic, it sounds to me like the right place to
perform the operation is in userspace.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  8:55     ` Thierry Reding
@ 2019-06-04  9:05       ` Boris Brezillon
  2019-06-04  9:09         ` Paul Kocialkowski
  0 siblings, 1 reply; 55+ messages in thread
From: Boris Brezillon @ 2019-06-04  9:05 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Nicolas Dufresne, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Tomasz Figa, Alexandre Courbot, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

On Tue, 4 Jun 2019 10:55:03 +0200
Thierry Reding <thierry.reding@gmail.com> wrote:

> On Mon, Jun 03, 2019 at 02:52:44PM -0400, Nicolas Dufresne wrote:
> [...]
> > There is one thing that come up though, if we enable per-frame decoding
> > on top of per-slice decoder (like Cedrus), won't we force userspace to
> > always compute l0/l1 even though the HW might be handling that ? Shall
> > we instead pass the modification list and implement the non-parsing
> > bits of applying the modifications in the kernel ?  
> 
> Applying the modifications is a standard procedure, right? If it's
> completely driver-agnostic, it sounds to me like the right place to
> perform the operation is in userspace.

Well, the counter argument to that is "drivers know better what's
needed by the HW", and if we want to avoid doing useless work without
having complex caps checking done in userspace, doing this task
kenel-side makes sense.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  8:49         ` Boris Brezillon
@ 2019-06-04  9:06           ` Thierry Reding
  2019-06-04  9:15             ` Jonas Karlman
  0 siblings, 1 reply; 55+ messages in thread
From: Thierry Reding @ 2019-06-04  9:06 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Nicolas Dufresne, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Tomasz Figa, Alexandre Courbot, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia, Jonas Karlman

[-- Attachment #1: Type: text/plain, Size: 4756 bytes --]

On Tue, Jun 04, 2019 at 10:49:21AM +0200, Boris Brezillon wrote:
> On Tue, 4 Jun 2019 10:31:57 +0200
> Thierry Reding <thierry.reding@gmail.com> wrote:
> 
> > > > > > - Using flags
> > > > > > 
> > > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > > represented as flags. Using flags also helps with padding.
> > > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > > 
> > > > > > I think H.264/HEVC per-control flags should also be moved to u64.    
> > > > > 
> > > > > There was also some concensus on this, that u64 should be good enough
> > > > > for anything out there, though we obviously don't know what the future
> > > > > will hold, so perhaps adding some way for possible extending this in the
> > > > > future might be good. I guess we'll get new controls for new codecs
> > > > > anyway, so we can punt on this until then.
> > > > >     
> > > > > > - Clear split of controls and terminology
> > > > > > 
> > > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > > bitstream element names for those.
> > > > > > 
> > > > > > For H.264, that would suggest the following changes:
> > > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > > where they belong, which seems to be slice_header;    
> > > > 
> > > > But now here it's being described per slice. When I look at the slice
> > > > header, I only see list of modifications and when I look at userspace,
> > > > That list is simply built from DPB, the modifications list found in the
> > > > slice header seems to be only used to craft the l0/l1 list.  
> > > 
> > > Yes, I think there was a misunderstanding which was then clarified
> > > (unfortunately it happened on IRC, so we don't have a trace of this
> > > discussion). The reference list should definitely be per-frame, and the
> > > L0/L1 slice reflists are referring to the per-frame reference list (it's
> > > just a sub-set of the per-frame reflist re-ordered differently).
> > >   
> > > > 
> > > > There is one thing that come up though, if we enable per-frame decoding
> > > > on top of per-slice decoder (like Cedrus), won't we force userspace to
> > > > always compute l0/l1 even though the HW might be handling that ?  
> > > 
> > > That's true, the question is, what's the cost of this extra re-ordering?  
> > 
> > I think ultimately userspace is already forced to compute these lists
> > even if some hardware may be able to do it in hardware. There's going to
> > be other hardware that userspace wants to support that can't do it by
> > itself, so userspace has to at least have the code anyway. What it could
> > do on top of that decide not to run that code if it somehow detects that
> > hardware can do it already. On the other hand this means that we have to
> > expose a whole lot of capabilities to userspace and userspace has to go
> > and detect all of them in order to parameterize all of the code.
> > 
> > Ultimately I suspect many applications will just choose to pass the data
> > all the time out of simplicity. I mean drivers that don't need it will
> > already ignore it (i.e. they must not break if they get the extra data)
> > so other than the potential runtime savings on some hardware, there are
> > no advantages.
> > 
> > Given that other APIs don't bother exposing this level of control to
> > applications makes me think that it's just not worth it from a
> > performance point of view.
> 
> That's not exactly what Nicolas proposed. He was suggesting that we
> build those reflists kernel-side: V4L would provide an helper and
> drivers that need those lists would use it, others won't. This way we
> have no useless computation done, and userspace doesn't even have to
> bother checking the device caps to avoid this extra step.

Oh yeah, that sounds much better. I suppose one notable differences to
other APIs is that they have to pass in buffers for all the frames in
the DPB, so they basically have to build the lists in userspace. Since
we'll end up looking up the frames in the kernel, it sounds reasonable
to also build the lists in the kernel.

On that note, it would probably be useful to have some sort of helper
to get at all the buffers that make up the DPB in the kernel. That's got
to be something that everybody wants.

Thierry

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  9:05       ` Boris Brezillon
@ 2019-06-04  9:09         ` Paul Kocialkowski
  0 siblings, 0 replies; 55+ messages in thread
From: Paul Kocialkowski @ 2019-06-04  9:09 UTC (permalink / raw)
  To: Boris Brezillon, Thierry Reding
  Cc: Nicolas Dufresne, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia, Jonas Karlman

Hi,

On Tue, 2019-06-04 at 11:05 +0200, Boris Brezillon wrote:
> On Tue, 4 Jun 2019 10:55:03 +0200
> Thierry Reding <thierry.reding@gmail.com> wrote:
> 
> > On Mon, Jun 03, 2019 at 02:52:44PM -0400, Nicolas Dufresne wrote:
> > [...]
> > > There is one thing that come up though, if we enable per-frame decoding
> > > on top of per-slice decoder (like Cedrus), won't we force userspace to
> > > always compute l0/l1 even though the HW might be handling that ? Shall
> > > we instead pass the modification list and implement the non-parsing
> > > bits of applying the modifications in the kernel ?  
> > 
> > Applying the modifications is a standard procedure, right? If it's
> > completely driver-agnostic, it sounds to me like the right place to
> > perform the operation is in userspace.
> 
> Well, the counter argument to that is "drivers know better what's
> needed by the HW", and if we want to avoid doing useless work without
> having complex caps checking done in userspace, doing this task
> kenel-side makes sense.

I believe we should also try and alleviate the pain on the user-space
side by having these decoder-specific details handled by the kernel.

It also brings us closer to bitstream format (where the modifications
are coded) and leaves DPB management to the decoder/driver, which IMO
makes a lot of sense.

Cheers,

Paul

-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  9:06           ` Thierry Reding
@ 2019-06-04  9:15             ` Jonas Karlman
  2019-06-04  9:28               ` Paul Kocialkowski
  2019-06-04  9:38               ` Boris Brezillon
  0 siblings, 2 replies; 55+ messages in thread
From: Jonas Karlman @ 2019-06-04  9:15 UTC (permalink / raw)
  To: Thierry Reding, Boris Brezillon
  Cc: Nicolas Dufresne, Paul Kocialkowski, Linux Media Mailing List,
	Hans Verkuil, Tomasz Figa, Alexandre Courbot, Maxime Ripard,
	Jernej Skrabec, Ezequiel Garcia

On 2019-06-04 11:06, Thierry Reding wrote:
> On Tue, Jun 04, 2019 at 10:49:21AM +0200, Boris Brezillon wrote:
>> On Tue, 4 Jun 2019 10:31:57 +0200
>> Thierry Reding <thierry.reding@gmail.com> wrote:
>>
>>>>>>> - Using flags
>>>>>>>
>>>>>>> The current MPEG-2 controls have lots of u8 values that can be
>>>>>>> represented as flags. Using flags also helps with padding.
>>>>>>> It's unlikely that we'll get more than 64 flags, so using a u64 by
>>>>>>> default for that sounds fine (we definitely do want to keep some room
>>>>>>> available and I don't think using 32 bits as a default is good enough).
>>>>>>>
>>>>>>> I think H.264/HEVC per-control flags should also be moved to u64.    
>>>>>> There was also some concensus on this, that u64 should be good enough
>>>>>> for anything out there, though we obviously don't know what the future
>>>>>> will hold, so perhaps adding some way for possible extending this in the
>>>>>> future might be good. I guess we'll get new controls for new codecs
>>>>>> anyway, so we can punt on this until then.
>>>>>>     
>>>>>>> - Clear split of controls and terminology
>>>>>>>
>>>>>>> Some codecs have explicit NAL units that are good fits to match as
>>>>>>> controls: e.g. slice header, pps, sps. I think we should stick to the
>>>>>>> bitstream element names for those.
>>>>>>>
>>>>>>> For H.264, that would suggest the following changes:
>>>>>>> - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
>>>>>>> - killing v4l2_ctrl_h264_decode_param and having the reference lists
>>>>>>> where they belong, which seems to be slice_header;    
>>>>> But now here it's being described per slice. When I look at the slice
>>>>> header, I only see list of modifications and when I look at userspace,
>>>>> That list is simply built from DPB, the modifications list found in the
>>>>> slice header seems to be only used to craft the l0/l1 list.  
>>>> Yes, I think there was a misunderstanding which was then clarified
>>>> (unfortunately it happened on IRC, so we don't have a trace of this
>>>> discussion). The reference list should definitely be per-frame, and the
>>>> L0/L1 slice reflists are referring to the per-frame reference list (it's
>>>> just a sub-set of the per-frame reflist re-ordered differently).
>>>>   
>>>>> There is one thing that come up though, if we enable per-frame decoding
>>>>> on top of per-slice decoder (like Cedrus), won't we force userspace to
>>>>> always compute l0/l1 even though the HW might be handling that ?  
>>>> That's true, the question is, what's the cost of this extra re-ordering?  
>>> I think ultimately userspace is already forced to compute these lists
>>> even if some hardware may be able to do it in hardware. There's going to
>>> be other hardware that userspace wants to support that can't do it by
>>> itself, so userspace has to at least have the code anyway. What it could
>>> do on top of that decide not to run that code if it somehow detects that
>>> hardware can do it already. On the other hand this means that we have to
>>> expose a whole lot of capabilities to userspace and userspace has to go
>>> and detect all of them in order to parameterize all of the code.
>>>
>>> Ultimately I suspect many applications will just choose to pass the data
>>> all the time out of simplicity. I mean drivers that don't need it will
>>> already ignore it (i.e. they must not break if they get the extra data)
>>> so other than the potential runtime savings on some hardware, there are
>>> no advantages.
>>>
>>> Given that other APIs don't bother exposing this level of control to
>>> applications makes me think that it's just not worth it from a
>>> performance point of view.
>> That's not exactly what Nicolas proposed. He was suggesting that we
>> build those reflists kernel-side: V4L would provide an helper and
>> drivers that need those lists would use it, others won't. This way we
>> have no useless computation done, and userspace doesn't even have to
>> bother checking the device caps to avoid this extra step.
> Oh yeah, that sounds much better. I suppose one notable differences to
> other APIs is that they have to pass in buffers for all the frames in
> the DPB, so they basically have to build the lists in userspace. Since
> we'll end up looking up the frames in the kernel, it sounds reasonable
> to also build the lists in the kernel.

Userspace must already process the modification list or it wont have correct DPB for next frame.
If you move this processing into kernel side you also introduce state into the stateless driver.

Regards,
Jonas
>
> On that note, it would probably be useful to have some sort of helper
> to get at all the buffers that make up the DPB in the kernel. That's got
> to be something that everybody wants.
>
> Thierry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  9:15             ` Jonas Karlman
@ 2019-06-04  9:28               ` Paul Kocialkowski
  2019-06-04  9:38               ` Boris Brezillon
  1 sibling, 0 replies; 55+ messages in thread
From: Paul Kocialkowski @ 2019-06-04  9:28 UTC (permalink / raw)
  To: Jonas Karlman, Thierry Reding, Boris Brezillon
  Cc: Nicolas Dufresne, Linux Media Mailing List, Hans Verkuil,
	Tomasz Figa, Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia

Hi,

On Tue, 2019-06-04 at 09:15 +0000, Jonas Karlman wrote:
> On 2019-06-04 11:06, Thierry Reding wrote:
> > On Tue, Jun 04, 2019 at 10:49:21AM +0200, Boris Brezillon wrote:
> > > On Tue, 4 Jun 2019 10:31:57 +0200
> > > Thierry Reding <thierry.reding@gmail.com> wrote:
> > > 
> > > > > > > > - Using flags
> > > > > > > > 
> > > > > > > > The current MPEG-2 controls have lots of u8 values that can be
> > > > > > > > represented as flags. Using flags also helps with padding.
> > > > > > > > It's unlikely that we'll get more than 64 flags, so using a u64 by
> > > > > > > > default for that sounds fine (we definitely do want to keep some room
> > > > > > > > available and I don't think using 32 bits as a default is good enough).
> > > > > > > > 
> > > > > > > > I think H.264/HEVC per-control flags should also be moved to u64.    
> > > > > > > There was also some concensus on this, that u64 should be good enough
> > > > > > > for anything out there, though we obviously don't know what the future
> > > > > > > will hold, so perhaps adding some way for possible extending this in the
> > > > > > > future might be good. I guess we'll get new controls for new codecs
> > > > > > > anyway, so we can punt on this until then.
> > > > > > >     
> > > > > > > > - Clear split of controls and terminology
> > > > > > > > 
> > > > > > > > Some codecs have explicit NAL units that are good fits to match as
> > > > > > > > controls: e.g. slice header, pps, sps. I think we should stick to the
> > > > > > > > bitstream element names for those.
> > > > > > > > 
> > > > > > > > For H.264, that would suggest the following changes:
> > > > > > > > - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> > > > > > > > - killing v4l2_ctrl_h264_decode_param and having the reference lists
> > > > > > > > where they belong, which seems to be slice_header;    
> > > > > > But now here it's being described per slice. When I look at the slice
> > > > > > header, I only see list of modifications and when I look at userspace,
> > > > > > That list is simply built from DPB, the modifications list found in the
> > > > > > slice header seems to be only used to craft the l0/l1 list.  
> > > > > Yes, I think there was a misunderstanding which was then clarified
> > > > > (unfortunately it happened on IRC, so we don't have a trace of this
> > > > > discussion). The reference list should definitely be per-frame, and the
> > > > > L0/L1 slice reflists are referring to the per-frame reference list (it's
> > > > > just a sub-set of the per-frame reflist re-ordered differently).
> > > > >   
> > > > > > There is one thing that come up though, if we enable per-frame decoding
> > > > > > on top of per-slice decoder (like Cedrus), won't we force userspace to
> > > > > > always compute l0/l1 even though the HW might be handling that ?  
> > > > > That's true, the question is, what's the cost of this extra re-ordering?  
> > > > I think ultimately userspace is already forced to compute these lists
> > > > even if some hardware may be able to do it in hardware. There's going to
> > > > be other hardware that userspace wants to support that can't do it by
> > > > itself, so userspace has to at least have the code anyway. What it could
> > > > do on top of that decide not to run that code if it somehow detects that
> > > > hardware can do it already. On the other hand this means that we have to
> > > > expose a whole lot of capabilities to userspace and userspace has to go
> > > > and detect all of them in order to parameterize all of the code.
> > > > 
> > > > Ultimately I suspect many applications will just choose to pass the data
> > > > all the time out of simplicity. I mean drivers that don't need it will
> > > > already ignore it (i.e. they must not break if they get the extra data)
> > > > so other than the potential runtime savings on some hardware, there are
> > > > no advantages.
> > > > 
> > > > Given that other APIs don't bother exposing this level of control to
> > > > applications makes me think that it's just not worth it from a
> > > > performance point of view.
> > > That's not exactly what Nicolas proposed. He was suggesting that we
> > > build those reflists kernel-side: V4L would provide an helper and
> > > drivers that need those lists would use it, others won't. This way we
> > > have no useless computation done, and userspace doesn't even have to
> > > bother checking the device caps to avoid this extra step.
> > Oh yeah, that sounds much better. I suppose one notable differences to
> > other APIs is that they have to pass in buffers for all the frames in
> > the DPB, so they basically have to build the lists in userspace. Since
> > we'll end up looking up the frames in the kernel, it sounds reasonable
> > to also build the lists in the kernel.
> 
> Userspace must already process the modification list or it wont have
> correct DPB for next frame.
> If you move this processing into kernel side you also introduce state
> into the stateless driver.

There is state in the form of the m2m context anyway, so I don't think
that's a concern in particular. We've been using the "stateless"
terminology all around, but it's really more about programming the
decoding registers versus passing raw bitstream through a mailbox
interface rather than the fine stateless/stateful distinction.

Cheers,

Paul

> Regards,
> Jonas
> > On that note, it would probably be useful to have some sort of helper
> > to get at all the buffers that make up the DPB in the kernel. That's got
> > to be something that everybody wants.
> > 
> > Thierry
-- 
Paul Kocialkowski, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  9:15             ` Jonas Karlman
  2019-06-04  9:28               ` Paul Kocialkowski
@ 2019-06-04  9:38               ` Boris Brezillon
  2019-06-04 10:49                 ` Jonas Karlman
  1 sibling, 1 reply; 55+ messages in thread
From: Boris Brezillon @ 2019-06-04  9:38 UTC (permalink / raw)
  To: Jonas Karlman
  Cc: Thierry Reding, Nicolas Dufresne, Paul Kocialkowski,
	Linux Media Mailing List, Hans Verkuil, Tomasz Figa,
	Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia

On Tue, 4 Jun 2019 09:15:28 +0000
Jonas Karlman <jonas@kwiboo.se> wrote:

> On 2019-06-04 11:06, Thierry Reding wrote:
> > On Tue, Jun 04, 2019 at 10:49:21AM +0200, Boris Brezillon wrote:  
> >> On Tue, 4 Jun 2019 10:31:57 +0200
> >> Thierry Reding <thierry.reding@gmail.com> wrote:
> >>  
> >>>>>>> - Using flags
> >>>>>>>
> >>>>>>> The current MPEG-2 controls have lots of u8 values that can be
> >>>>>>> represented as flags. Using flags also helps with padding.
> >>>>>>> It's unlikely that we'll get more than 64 flags, so using a u64 by
> >>>>>>> default for that sounds fine (we definitely do want to keep some room
> >>>>>>> available and I don't think using 32 bits as a default is good enough).
> >>>>>>>
> >>>>>>> I think H.264/HEVC per-control flags should also be moved to u64.      
> >>>>>> There was also some concensus on this, that u64 should be good enough
> >>>>>> for anything out there, though we obviously don't know what the future
> >>>>>> will hold, so perhaps adding some way for possible extending this in the
> >>>>>> future might be good. I guess we'll get new controls for new codecs
> >>>>>> anyway, so we can punt on this until then.
> >>>>>>       
> >>>>>>> - Clear split of controls and terminology
> >>>>>>>
> >>>>>>> Some codecs have explicit NAL units that are good fits to match as
> >>>>>>> controls: e.g. slice header, pps, sps. I think we should stick to the
> >>>>>>> bitstream element names for those.
> >>>>>>>
> >>>>>>> For H.264, that would suggest the following changes:
> >>>>>>> - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
> >>>>>>> - killing v4l2_ctrl_h264_decode_param and having the reference lists
> >>>>>>> where they belong, which seems to be slice_header;      
> >>>>> But now here it's being described per slice. When I look at the slice
> >>>>> header, I only see list of modifications and when I look at userspace,
> >>>>> That list is simply built from DPB, the modifications list found in the
> >>>>> slice header seems to be only used to craft the l0/l1 list.    
> >>>> Yes, I think there was a misunderstanding which was then clarified
> >>>> (unfortunately it happened on IRC, so we don't have a trace of this
> >>>> discussion). The reference list should definitely be per-frame, and the
> >>>> L0/L1 slice reflists are referring to the per-frame reference list (it's
> >>>> just a sub-set of the per-frame reflist re-ordered differently).
> >>>>     
> >>>>> There is one thing that come up though, if we enable per-frame decoding
> >>>>> on top of per-slice decoder (like Cedrus), won't we force userspace to
> >>>>> always compute l0/l1 even though the HW might be handling that ?    
> >>>> That's true, the question is, what's the cost of this extra re-ordering?    
> >>> I think ultimately userspace is already forced to compute these lists
> >>> even if some hardware may be able to do it in hardware. There's going to
> >>> be other hardware that userspace wants to support that can't do it by
> >>> itself, so userspace has to at least have the code anyway. What it could
> >>> do on top of that decide not to run that code if it somehow detects that
> >>> hardware can do it already. On the other hand this means that we have to
> >>> expose a whole lot of capabilities to userspace and userspace has to go
> >>> and detect all of them in order to parameterize all of the code.
> >>>
> >>> Ultimately I suspect many applications will just choose to pass the data
> >>> all the time out of simplicity. I mean drivers that don't need it will
> >>> already ignore it (i.e. they must not break if they get the extra data)
> >>> so other than the potential runtime savings on some hardware, there are
> >>> no advantages.
> >>>
> >>> Given that other APIs don't bother exposing this level of control to
> >>> applications makes me think that it's just not worth it from a
> >>> performance point of view.  
> >> That's not exactly what Nicolas proposed. He was suggesting that we
> >> build those reflists kernel-side: V4L would provide an helper and
> >> drivers that need those lists would use it, others won't. This way we
> >> have no useless computation done, and userspace doesn't even have to
> >> bother checking the device caps to avoid this extra step.  
> > Oh yeah, that sounds much better. I suppose one notable differences to
> > other APIs is that they have to pass in buffers for all the frames in
> > the DPB, so they basically have to build the lists in userspace. Since
> > we'll end up looking up the frames in the kernel, it sounds reasonable
> > to also build the lists in the kernel.  
> 
> Userspace must already process the modification list or it wont have correct DPB for next frame.

Can you point us to the code or the section in the spec that
mentions/proves this dependency? I might have missed something, but my
understanding was that the slice ref lists (or the list of
modifications to apply to the list of long/short refs attached to a
frame) had no impact on the list of long/short refs attached to the
following frame.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-04  9:38               ` Boris Brezillon
@ 2019-06-04 10:49                 ` Jonas Karlman
  0 siblings, 0 replies; 55+ messages in thread
From: Jonas Karlman @ 2019-06-04 10:49 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Thierry Reding, Nicolas Dufresne, Paul Kocialkowski,
	Linux Media Mailing List, Hans Verkuil, Tomasz Figa,
	Alexandre Courbot, Maxime Ripard, Jernej Skrabec,
	Ezequiel Garcia

On 2019-06-04 11:38, Boris Brezillon wrote:
> On Tue, 4 Jun 2019 09:15:28 +0000
> Jonas Karlman <jonas@kwiboo.se> wrote:
>
>> On 2019-06-04 11:06, Thierry Reding wrote:
>>> On Tue, Jun 04, 2019 at 10:49:21AM +0200, Boris Brezillon wrote:  
>>>> On Tue, 4 Jun 2019 10:31:57 +0200
>>>> Thierry Reding <thierry.reding@gmail.com> wrote:
>>>>  
>>>>>>>>> - Using flags
>>>>>>>>>
>>>>>>>>> The current MPEG-2 controls have lots of u8 values that can be
>>>>>>>>> represented as flags. Using flags also helps with padding.
>>>>>>>>> It's unlikely that we'll get more than 64 flags, so using a u64 by
>>>>>>>>> default for that sounds fine (we definitely do want to keep some room
>>>>>>>>> available and I don't think using 32 bits as a default is good enough).
>>>>>>>>>
>>>>>>>>> I think H.264/HEVC per-control flags should also be moved to u64.      
>>>>>>>> There was also some concensus on this, that u64 should be good enough
>>>>>>>> for anything out there, though we obviously don't know what the future
>>>>>>>> will hold, so perhaps adding some way for possible extending this in the
>>>>>>>> future might be good. I guess we'll get new controls for new codecs
>>>>>>>> anyway, so we can punt on this until then.
>>>>>>>>       
>>>>>>>>> - Clear split of controls and terminology
>>>>>>>>>
>>>>>>>>> Some codecs have explicit NAL units that are good fits to match as
>>>>>>>>> controls: e.g. slice header, pps, sps. I think we should stick to the
>>>>>>>>> bitstream element names for those.
>>>>>>>>>
>>>>>>>>> For H.264, that would suggest the following changes:
>>>>>>>>> - renaming v4l2_ctrl_h264_decode_param to v4l2_ctrl_h264_slice_header;
>>>>>>>>> - killing v4l2_ctrl_h264_decode_param and having the reference lists
>>>>>>>>> where they belong, which seems to be slice_header;      
>>>>>>> But now here it's being described per slice. When I look at the slice
>>>>>>> header, I only see list of modifications and when I look at userspace,
>>>>>>> That list is simply built from DPB, the modifications list found in the
>>>>>>> slice header seems to be only used to craft the l0/l1 list.    
>>>>>> Yes, I think there was a misunderstanding which was then clarified
>>>>>> (unfortunately it happened on IRC, so we don't have a trace of this
>>>>>> discussion). The reference list should definitely be per-frame, and the
>>>>>> L0/L1 slice reflists are referring to the per-frame reference list (it's
>>>>>> just a sub-set of the per-frame reflist re-ordered differently).
>>>>>>     
>>>>>>> There is one thing that come up though, if we enable per-frame decoding
>>>>>>> on top of per-slice decoder (like Cedrus), won't we force userspace to
>>>>>>> always compute l0/l1 even though the HW might be handling that ?    
>>>>>> That's true, the question is, what's the cost of this extra re-ordering?    
>>>>> I think ultimately userspace is already forced to compute these lists
>>>>> even if some hardware may be able to do it in hardware. There's going to
>>>>> be other hardware that userspace wants to support that can't do it by
>>>>> itself, so userspace has to at least have the code anyway. What it could
>>>>> do on top of that decide not to run that code if it somehow detects that
>>>>> hardware can do it already. On the other hand this means that we have to
>>>>> expose a whole lot of capabilities to userspace and userspace has to go
>>>>> and detect all of them in order to parameterize all of the code.
>>>>>
>>>>> Ultimately I suspect many applications will just choose to pass the data
>>>>> all the time out of simplicity. I mean drivers that don't need it will
>>>>> already ignore it (i.e. they must not break if they get the extra data)
>>>>> so other than the potential runtime savings on some hardware, there are
>>>>> no advantages.
>>>>>
>>>>> Given that other APIs don't bother exposing this level of control to
>>>>> applications makes me think that it's just not worth it from a
>>>>> performance point of view.  
>>>> That's not exactly what Nicolas proposed. He was suggesting that we
>>>> build those reflists kernel-side: V4L would provide an helper and
>>>> drivers that need those lists would use it, others won't. This way we
>>>> have no useless computation done, and userspace doesn't even have to
>>>> bother checking the device caps to avoid this extra step.  
>>> Oh yeah, that sounds much better. I suppose one notable differences to
>>> other APIs is that they have to pass in buffers for all the frames in
>>> the DPB, so they basically have to build the lists in userspace. Since
>>> we'll end up looking up the frames in the kernel, it sounds reasonable
>>> to also build the lists in the kernel.  
>> Userspace must already process the modification list or it wont have correct DPB for next frame.
> Can you point us to the code or the section in the spec that
> mentions/proves this dependency? I might have missed something, but my
> understanding was that the slice ref lists (or the list of
> modifications to apply to the list of long/short refs attached to a
> frame) had no impact on the list of long/short refs attached to the
> following frame.

I must have mixed up the marking process with the modification process.
You seem to be correct, the modification process should not affect the short/long-term
reference marking if I understand the spec and code correctly.

Regards,
Jonas

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-05-22 10:55                             ` Hans Verkuil
  2019-05-22 11:55                               ` Thierry Reding
@ 2019-06-07  6:11                               ` Tomasz Figa
  2019-06-07  6:45                                 ` Hans Verkuil
  1 sibling, 1 reply; 55+ messages in thread
From: Tomasz Figa @ 2019-06-07  6:11 UTC (permalink / raw)
  To: Hans Verkuil
  Cc: Thierry Reding, Paul Kocialkowski, Nicolas Dufresne,
	Jernej Škrabec, Linux Media Mailing List, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Ezequiel Garcia, Jonas Karlman

On Wed, May 22, 2019 at 7:56 PM Hans Verkuil <hverkuil-cisco@xs4all.nl> wrote:
>
> On 5/22/19 12:42 PM, Thierry Reding wrote:
> > On Wed, May 22, 2019 at 10:26:28AM +0200, Paul Kocialkowski wrote:
> >> Hi,
> >>
> >> Le mercredi 22 mai 2019 à 15:48 +0900, Tomasz Figa a écrit :
> >>> On Sat, May 18, 2019 at 11:09 PM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> >>>> Le samedi 18 mai 2019 à 12:29 +0200, Paul Kocialkowski a écrit :
> >>>>> Hi,
> >>>>>
> >>>>> Le samedi 18 mai 2019 à 12:04 +0200, Jernej Škrabec a écrit :
> >>>>>> Dne sobota, 18. maj 2019 ob 11:50:37 CEST je Paul Kocialkowski napisal(a):
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On Fri, 2019-05-17 at 16:43 -0400, Nicolas Dufresne wrote:
> >>>>>>>> Le jeudi 16 mai 2019 à 20:45 +0200, Paul Kocialkowski a écrit :
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> Le jeudi 16 mai 2019 à 14:24 -0400, Nicolas Dufresne a écrit :
> >>>>>>>>>> Le mercredi 15 mai 2019 à 22:59 +0200, Paul Kocialkowski a écrit :
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> Le mercredi 15 mai 2019 à 14:54 -0400, Nicolas Dufresne a écrit :
> >>>>>>>>>>>> Le mercredi 15 mai 2019 à 19:42 +0200, Paul Kocialkowski a écrit :
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Le mercredi 15 mai 2019 à 10:42 -0400, Nicolas Dufresne a écrit
> >>>>>> :
> >>>>>>>>>>>>>> Le mercredi 15 mai 2019 à 12:09 +0200, Paul Kocialkowski a
> >>>>>> écrit :
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> With the Rockchip stateless VPU driver in the works, we now
> >>>>>>>>>>>>>>> have a
> >>>>>>>>>>>>>>> better idea of what the situation is like on platforms other
> >>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>> Allwinner. This email shares my conclusions about the
> >>>>>>>>>>>>>>> situation and how
> >>>>>>>>>>>>>>> we should update the MPEG-2, H.264 and H.265 controls
> >>>>>>>>>>>>>>> accordingly.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Per-slice decoding
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We've discussed this one already[0] and Hans has submitted a
> >>>>>>>>>>>>>>> patch[1]
> >>>>>>>>>>>>>>> to implement the required core bits. When we agree it looks
> >>>>>>>>>>>>>>> good, we
> >>>>>>>>>>>>>>> should lift the restriction that all slices must be
> >>>>>>>>>>>>>>> concatenated and
> >>>>>>>>>>>>>>> have them submitted as individual requests.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One question is what to do about other controls. I feel like
> >>>>>>>>>>>>>>> it would
> >>>>>>>>>>>>>>> make sense to always pass all the required controls for
> >>>>>>>>>>>>>>> decoding the
> >>>>>>>>>>>>>>> slice, including the ones that don't change across slices.
> >>>>>>>>>>>>>>> But there
> >>>>>>>>>>>>>>> may be no particular advantage to this and only downsides.
> >>>>>>>>>>>>>>> Not doing it
> >>>>>>>>>>>>>>> and relying on the "control cache" can work, but we need to
> >>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>> that only a single stream can be decoded per opened instance
> >>>>>>>>>>>>>>> of the
> >>>>>>>>>>>>>>> v4l2 device. This is the assumption we're going with for
> >>>>>>>>>>>>>>> handling
> >>>>>>>>>>>>>>> multi-slice anyway, so it shouldn't be an issue.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My opinion on this is that the m2m instance is a state, and
> >>>>>>>>>>>>>> the driver
> >>>>>>>>>>>>>> should be responsible of doing time-division multiplexing
> >>>>>>>>>>>>>> across
> >>>>>>>>>>>>>> multiple m2m instance jobs. Doing the time-division
> >>>>>>>>>>>>>> multiplexing in
> >>>>>>>>>>>>>> userspace would require some sort of daemon to work properly
> >>>>>>>>>>>>>> across
> >>>>>>>>>>>>>> processes. I also think the kernel is better place for doing
> >>>>>>>>>>>>>> resource
> >>>>>>>>>>>>>> access scheduling in general.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree with that yes. We always have a single m2m context and
> >>>>>>>>>>>>> specific
> >>>>>>>>>>>>> controls per opened device so keeping cached values works out
> >>>>>>>>>>>>> well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> So maybe we shall explicitly require that the request with the
> >>>>>>>>>>>>> first
> >>>>>>>>>>>>> slice for a frame also contains the per-frame controls.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>> - Annex-B formats
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think we have really reached a conclusion on the
> >>>>>>>>>>>>>>> pixel formats
> >>>>>>>>>>>>>>> we want to expose. The main issue is how to deal with codecs
> >>>>>>>>>>>>>>> that need
> >>>>>>>>>>>>>>> the full slice NALU with start code, where the slice_header
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> duplicated in raw bitstream, when others are fine with just
> >>>>>>>>>>>>>>> the encoded
> >>>>>>>>>>>>>>> slice data and the parsed slice header control.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> My initial thinking was that we'd need 3 formats:
> >>>>>>>>>>>>>>> - One that only takes only the slice compressed data
> >>>>>>>>>>>>>>> (without raw slice
> >>>>>>>>>>>>>>> header and start code);
> >>>>>>>>>>>>>>> - One that takes both the NALU data (including start code,
> >>>>>>>>>>>>>>> raw header
> >>>>>>>>>>>>>>> and compressed data) and slice header controls;
> >>>>>>>>>>>>>>> - One that takes the NALU data but no slice header.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> But I no longer think the latter really makes sense in the
> >>>>>>>>>>>>>>> context of
> >>>>>>>>>>>>>>> stateless video decoding.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A side-note: I think we should definitely have data offsets
> >>>>>>>>>>>>>>> in every
> >>>>>>>>>>>>>>> case, so that implementations can just push the whole NALU
> >>>>>>>>>>>>>>> regardless
> >>>>>>>>>>>>>>> of the format if they're lazy.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I realize that I didn't share our latest research on the
> >>>>>>>>>>>>>> subject. So a
> >>>>>>>>>>>>>> slice in the original bitstream is formed of the following
> >>>>>>>>>>>>>> blocks
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> (simplified):
> >>>>>>>>>>>>>>   [nal_header][nal_type][slice_header][slice]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the details!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> nal_header:
> >>>>>>>>>>>>>> This one is a header used to locate the start and the end of
> >>>>>>>>>>>>>> the of a
> >>>>>>>>>>>>>> NAL. There is two standard forms, the ANNEX B / start code, a
> >>>>>>>>>>>>>> sequence
> >>>>>>>>>>>>>> of 3 bytes 0x00 0x00 0x01, you'll often see 4 bytes, the first
> >>>>>>>>>>>>>> byte
> >>>>>>>>>>>>>> would be a leading 0 from the previous NAL padding, but this
> >>>>>>>>>>>>>> is also
> >>>>>>>>>>>>>> totally valid start code. The second form is the AVC form,
> >>>>>>>>>>>>>> notably used
> >>>>>>>>>>>>>> in ISOMP4 container. It simply is the size of the NAL. You
> >>>>>>>>>>>>>> must keep
> >>>>>>>>>>>>>> your buffer aligned to NALs in this case as you cannot scan
> >>>>>>>>>>>>>> from random
> >>>>>>>>>>>>>> location.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> nal_type:
> >>>>>>>>>>>>>> It's a bit more then just the type, but it contains at least
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> information of the nal type. This has different size on H.264
> >>>>>>>>>>>>>> and HEVC
> >>>>>>>>>>>>>> but I know it's size is in bytes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> slice_header:
> >>>>>>>>>>>>>> This contains per slice parameters, like the modification
> >>>>>>>>>>>>>> lists to
> >>>>>>>>>>>>>> apply on the references. This one has a size in bits, not in
> >>>>>>>>>>>>>> bytes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> slice:
> >>>>>>>>>>>>>> I don't really know what is in it exactly, but this is the
> >>>>>>>>>>>>>> data used to
> >>>>>>>>>>>>>> decode. This bit has a special coding called the
> >>>>>>>>>>>>>> anti-emulation, which
> >>>>>>>>>>>>>> prevents a start-code from appearing in it. This coding is
> >>>>>>>>>>>>>> present in
> >>>>>>>>>>>>>> both forms, ANNEX-B or AVC (in GStreamer and some reference
> >>>>>>>>>>>>>> manual they
> >>>>>>>>>>>>>> call ANNEX-B the bytestream format).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So, what we notice is that what is currently passed through
> >>>>>>>>>>>>>> Cedrus
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> driver:
> >>>>>>>>>>>>>>   [nal_type][slice_header][slice]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This matches what is being passed through VA-API. We can
> >>>>>>>>>>>>>> understand
> >>>>>>>>>>>>>> that stripping off the slice_header would be hard, since it's
> >>>>>>>>>>>>>> size is
> >>>>>>>>>>>>>> in bits. Instead we pass size and header_bit_size in
> >>>>>>>>>>>>>> slice_params.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> True, there is that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> About Rockchip. RK3288 is a Hantro G1 and has a bit called
> >>>>>>>>>>>>>> start_code_e, when you turn this off, you don't need start
> >>>>>>>>>>>>>> code. As a
> >>>>>>>>>>>>>> side effect, the bitstream becomes identical. We do now know
> >>>>>>>>>>>>>> that it
> >>>>>>>>>>>>>> works with the ffmpeg branch implement for cedrus.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Oh great, that makes life easier in the short term, but I guess
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> issue could arise on another decoder sooner or later.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Now what's special about Hantro G1 (also found on IMX8M) is
> >>>>>>>>>>>>>> that it
> >>>>>>>>>>>>>> take care for us of reading and executing the modification
> >>>>>>>>>>>>>> lists found
> >>>>>>>>>>>>>> in the slice header. Mostly because I very disliked having to
> >>>>>>>>>>>>>> pass the
> >>>>>>>>>>>>>> p/b0/b1 parameters, is that Boris implemented in the driver
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> transformation from the DPB entries into this p/b0/b1 list.
> >>>>>>>>>>>>>> These list
> >>>>>>>>>>>>>> a standard, it's basically implementing 8.2.4.1 and 8.2.4.2.
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> following section is the execution of the modification list.
> >>>>>>>>>>>>>> As this
> >>>>>>>>>>>>>> list is not modified, it only need to be calculated per frame.
> >>>>>>>>>>>>>> As a
> >>>>>>>>>>>>>> result, we don't need these new lists, and we can work with
> >>>>>>>>>>>>>> the same
> >>>>>>>>>>>>>> H264_SLICE format as Cedrus is using.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yes but I definitely think it makes more sense to pass the list
> >>>>>>>>>>>>> modifications rather than reconstructing those in the driver
> >>>>>>>>>>>>> from a
> >>>>>>>>>>>>> full list. IMO controls should stick to the bitstream as close
> >>>>>>>>>>>>> as
> >>>>>>>>>>>>> possible.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For Hantro and RKVDEC, the list of modification is parsed by the
> >>>>>>>>>>>> IP
> >>>>>>>>>>>> from the slice header bits. Just to make sure, because I myself
> >>>>>>>>>>>> was
> >>>>>>>>>>>> confused on this before, the slice header does not contain a list
> >>>>>>>>>>>> of
> >>>>>>>>>>>> references, instead it contains a list modification to be applied
> >>>>>>>>>>>> to
> >>>>>>>>>>>> the reference list. I need to check again, but to execute these
> >>>>>>>>>>>> modification, you need to filter and sort the references in a
> >>>>>>>>>>>> specific
> >>>>>>>>>>>> order. This should be what is defined in the spec as 8.2.4.1 and
> >>>>>>>>>>>> 8.2.4.2. Then 8.2.4.3 is the process that creates the l0/l1.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The list of references is deduced from the DPB. The DPB, which I
> >>>>>>>>>>>> thinks
> >>>>>>>>>>>> should be rename as "references", seems more useful then p/b0/b1,
> >>>>>>>>>>>> since
> >>>>>>>>>>>> this is the data that gives use the ability to implementing glue
> >>>>>>>>>>>> in the
> >>>>>>>>>>>> driver to compensate some HW differences.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In the case of Hantro / RKVDEC, we think it's natural to build the
> >>>>>>>>>>>> HW
> >>>>>>>>>>>> specific lists (p/b0/b1) from the references rather then adding HW
> >>>>>>>>>>>> specific list in the decode_params structure. The fact these lists
> >>>>>>>>>>>> are
> >>>>>>>>>>>> standard intermediate step of the standard is not that important.
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry I got confused (once more) about it. Boris just explained the
> >>>>>>>>>>> same thing to me over IRC :) Anyway my point is that we want to pass
> >>>>>>>>>>> what's in ffmpeg's short and long term ref lists, and name them that
> >>>>>>>>>>> instead of dpb.
> >>>>>>>>>>>
> >>>>>>>>>>>>>> Now, this is just a start. For RK3399, we have a different
> >>>>>>>>>>>>>> CODEC
> >>>>>>>>>>>>>> design. This one does not have the start_code_e bit. What the
> >>>>>>>>>>>>>> IP does,
> >>>>>>>>>>>>>> is that you give it one or more slice per buffer, setup the
> >>>>>>>>>>>>>> params,
> >>>>>>>>>>>>>> start decoding, but the decoder then return the location of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> following NAL. So basically you could offload the scanning of
> >>>>>>>>>>>>>> start
> >>>>>>>>>>>>>> code to the HW. That being said, with the driver layer in
> >>>>>>>>>>>>>> between, that
> >>>>>>>>>>>>>> would be amazingly inconvenient to use, and with Boyer-more
> >>>>>>>>>>>>>> algorithm,
> >>>>>>>>>>>>>> it is pretty cheap to scan this type of start-code on CPU. But
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> feature that this allows is to operate in frame mode. In this
> >>>>>>>>>>>>>> mode, you
> >>>>>>>>>>>>>> have 1 interrupt per frame.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm not sure there is any interest in exposing that from
> >>>>>>>>>>>>> userspace and
> >>>>>>>>>>>>> my current feeling is that we should just ditch support for
> >>>>>>>>>>>>> per-frame
> >>>>>>>>>>>>> decoding altogether. I think it mixes decoding with notions that
> >>>>>>>>>>>>> are
> >>>>>>>>>>>>> higher-level than decoding, but I agree it's a blurry line.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm not worried about this either. We can already support that by
> >>>>>>>>>>>> copying the bitstream internally to the driver, though zero-copy
> >>>>>>>>>>>> with
> >>>>>>>>>>>> this would require a new format, the one we talked about,
> >>>>>>>>>>>> SLICE_ANNEX_B.
> >>>>>>>>>>>
> >>>>>>>>>>> Right, but what I'm thinking about is making that the one and only
> >>>>>>>>>>> format. The rationale is that it's always easier to just append a
> >>>>>>>>>>> start
> >>>>>>>>>>> code from userspace if needed. And we need a bit offset to the slice
> >>>>>>>>>>> data part anyway, so it doesn't hurt to require a few extra bits to
> >>>>>>>>>>> have the whole thing that will work in every situation.
> >>>>>>>>>>
> >>>>>>>>>> What I'd like is to eventually allow zero-copy (aka userptr) into the
> >>>>>>>>>> driver. If you make the start code mandatory, any decoding from ISOMP4
> >>>>>>>>>> (.mp4, .mov) will require a full bitstream copy in userspace to add
> >>>>>>>>>> the
> >>>>>>>>>> start code (unless you hack your allocation in your demuxer, but it's
> >>>>>>>>>> a
> >>>>>>>>>> bit complicated since this code might come from two libraries). In
> >>>>>>>>>> ISOMP4, you have an AVC header, which is just the size of the NAL that
> >>>>>>>>>> follows.
> >>>>>>>>>
> >>>>>>>>> Well, I think we have to do a copy from system memory to the buffer
> >>>>>>>>> allocated by v4l2 anyway. Our hardware pipelines can reasonably be
> >>>>>>>>> expected not to have any MMU unit and not allow sg import anyway.
> >>>>>>>>
> >>>>>>>> The Rockchip has an mmu. You need one copy at least indeed,
> >>>>>>>
> >>>>>>> Is the MMU in use currently? That can make things troublesome if we run
> >>>>>>> into a case where the VPU has MMU and deals with scatter-gather while
> >>>>>>> the display part doesn't. As far as I know, there's no way for
> >>>>>>> userspace to know whether a dma-buf-exported buffer is backed by CMA or
> >>>>>>> by scatter-gather memory. This feels like a major issue for using dma-
> >>>>>>> buf, since userspace can't predict whether a buffer exported on one
> >>>>>>> device can be imported on another when building its pipeline.
> >>>>>>
> >>>>>> FYI, Allwinner H6 also has IOMMU, it's just that there is no mainline driver
> >>>>>> for it yet. It is supported for display, both VPUs and some other devices. I
> >>>>>> think no sane SoC designer would left out one or another unit without IOMMU
> >>>>>> support, that just calls for troubles, as you pointed out.
> >>>>>
> >>>>> Right right, I've been following that from a distance :)
> >>>>>
> >>>>> Indeed I think it's realistic to expect that for now, but it may not
> >>>>> play out so well in the long term. For instance, maybe connecting a USB
> >>>>> display would require CMA when the rest of the system can do with sg.
> >>>>>
> >>>>> I think it would really be useful for userspace to have a way to test
> >>>>> whether a buffer can be imported from one device to another. It feels
> >>>>> better than indicating where the memory lives, since there are
> >>>>> countless cases where additional restrictions apply too.
> >>>>
> >>>> I don't know for the integration on the Rockchip, but I did notice the
> >>>> register documentation for it.
> >>>
> >>> All the important components in the SoC have their IOMMUs as well -
> >>> display controller, GPU.
> >>>
> >>> There is a blitter called RGA that is not behind an IOMMU, but has
> >>> some scatter-gather capability (with a need for the hardware sg table
> >>> to be physically contiguous).
> >>
> >> That's definitely good to know and justfies the need to introduce a way
> >> for userspace to check if a buffer can be imported from one device to
> >> another.
> >
> > There's been a lot of discussion about this before. You may be aware of
> > James Jones' attempt to create an allocator library for this:
> >
> >       https://github.com/cubanismo/allocator
> >
> > I haven't heard an update on this for quite some time and I think it's
> > stagnated due to a lack of interest. However, I think the lack of
> > interest could be an indicator that the issue might not be pressing
> > enough. Luckily most SoCs are reasonably integrated, so there's usually
> > no issue sharing buffers between different hardware blocks.
> >
> > Technically it's already possible to check for compatibility of buffers
> > at import time.
> >
> > In the tegra-vde driver we do something along the lines of:
> >
> >       sgt = dma_buf_map_attachment(...);
> >       ...
> >       if (sgt->nents != 1)
> >               return -EINVAL;
> >
> > because we don't support an IOMMU currently. Of course its still up to
> > userspace to react to that in a sensible way and it may not be obvious
> > what to do when the import fails.
> >
> >>> That said, significance of such blitters
> >>> nowadays is rather low, as most of the time you need a compositor on
> >>> the GPU anyway, which can do any transformation in the same pass as
> >>> the composition.
> >>
> >> I think that is a crucial mistake and the way I see things, this will
> >> have to change eventually. We cannot keep under-using the fast and
> >> efficient hardware components and going with the war machine that is
> >> the GPU in all situations. This has caused enough trouble in the
> >> GNU/Linux userspace display stack already and I strongly believe it has
> >> to stop.
> >
> > Unfortunately there's really no good API to develop drivers against. All
> > of the 2D APIs that exist are not really efficient when implemented via
> > hardware-accelerated drivers. And none of the attempts at defining an
> > API for hardware-accelerated 2D haven't really gained any momentum.
> >
> > I had looked a bit at ways to make use of some compositing hardware that
> > we have on Tegra (which is like a blender/blitter of a sort) and the
> > best thing I could find would've been to accelerate some paths in Mesa.
> > However that would require quite a bit of infrastructure work because it
> > currently completely relies on GPU shaders to accelerate those paths.
> >
> > Daniel has written a very interesting bit about this, in case you
> > haven't seen it yet:
> >
> >       https://blog.ffwll.ch/2018/08/no-2d-in-drm.html
> >
> >>>> In general, the most significant gain
> >>>> with having iommu for CODECs is that it makes start up (and re-init)
> >>>> time much shorter, but also in a much more predictable duration. I do
> >>>> believe that the Venus driver (qualcomm) is one with solid support for
> >>>> this, and it's quite noticably more snappy then the others.
> >>>
> >>> Obviously you also get support for USERPTR if you have an IOMMU, but
> >>> that also has some costs - you need to pin the user pages and map to
> >>> the IOMMU before each frame and unmap and unpin after each frame,
> >>> which sometimes is more costly than actually having the userspace copy
> >>> to a preallocated and premapped buffer, especially for relatively
> >>> small contents, such as compressed bitstream.
> >>
> >> Heh, interesting point!
> >
> > I share the same experience. Bitstream buffers are usually so small that
> > you can always find a physically contiguous memory region for them and a
> > memcpy() will be faster than the overhead of getting an IOMMU involved.
> > This obviously depends on the specific hardware, but there's always some
> > threshold before which mapping through an IOMMU just doesn't make sense
> > from a fragmentation and/or performance point of view.
> >
> > I wonder, though, if it's not possible to keep userptr buffers around
> > and avoid the constant mapping/unmapping. If we only performed cache
> > maintenance on them as necessary, perhaps that could provide a viable,
> > maybe even good, zero-copy mechanism.
>
> The vb2 framework will keep the mapping for a userptr as long as userspace
> uses the same userptr for every buffer.
>
> I.e. the first time a buffer with index I is queued the userptr is mapped.
> If that buffer is later dequeued and then requeued again with the same
> userptr the vb2 core will reuse the old mapping. Otherwise it will unmap
> and map again with the new userptr.

That's a good point. I forgot that we've been seeing random memory
corruptions (fortunately of the userptr memory only, not random system
memory) because of this behavior and carrying a patch in all
downstream branches to remove this caching.

I can see that we keep references on the pages that corresponded to
the user VMA at the time the buffer was queued, but are we guaranteed
that the list of pages backing that VMA hasn't changed over time?

>
> The same is done for dmabuf, BTW. So if userspace keeps changing dmabuf
> fds for each buffer, then that is not optimal.

We could possibly try to search through the other buffers and reuse
the mapping if there is a match?

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-07  6:11                               ` Tomasz Figa
@ 2019-06-07  6:45                                 ` Hans Verkuil
  2019-06-07  8:23                                   ` Hans Verkuil
  0 siblings, 1 reply; 55+ messages in thread
From: Hans Verkuil @ 2019-06-07  6:45 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Thierry Reding, Paul Kocialkowski, Nicolas Dufresne,
	Jernej Škrabec, Linux Media Mailing List, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Ezequiel Garcia, Jonas Karlman

On 6/7/19 8:11 AM, Tomasz Figa wrote:
> On Wed, May 22, 2019 at 7:56 PM Hans Verkuil <hverkuil-cisco@xs4all.nl> wrote:
>>> I share the same experience. Bitstream buffers are usually so small that
>>> you can always find a physically contiguous memory region for them and a
>>> memcpy() will be faster than the overhead of getting an IOMMU involved.
>>> This obviously depends on the specific hardware, but there's always some
>>> threshold before which mapping through an IOMMU just doesn't make sense
>>> from a fragmentation and/or performance point of view.
>>>
>>> I wonder, though, if it's not possible to keep userptr buffers around
>>> and avoid the constant mapping/unmapping. If we only performed cache
>>> maintenance on them as necessary, perhaps that could provide a viable,
>>> maybe even good, zero-copy mechanism.
>>
>> The vb2 framework will keep the mapping for a userptr as long as userspace
>> uses the same userptr for every buffer.
>>
>> I.e. the first time a buffer with index I is queued the userptr is mapped.
>> If that buffer is later dequeued and then requeued again with the same
>> userptr the vb2 core will reuse the old mapping. Otherwise it will unmap
>> and map again with the new userptr.
> 
> That's a good point. I forgot that we've been seeing random memory
> corruptions (fortunately of the userptr memory only, not random system
> memory) because of this behavior and carrying a patch in all
> downstream branches to remove this caching.
> 
> I can see that we keep references on the pages that corresponded to
> the user VMA at the time the buffer was queued, but are we guaranteed
> that the list of pages backing that VMA hasn't changed over time?

Since you are seeing memory corruptions, the answer to this is perhaps 'no'?

I think the (quite possibly faulty) reasoning was that while memory is mapped,
userspace can't do a free()/malloc() pair and end up with the same address.

I suspect this might be a wrong assumption, and in that case we're better off
removing this check.

But I'd like to have some confirmation that it is really wrong.

USERPTR isn't used very often, so it wouldn't surprise me if it is buggy.

Regards,

	Hans

> 
>>
>> The same is done for dmabuf, BTW. So if userspace keeps changing dmabuf
>> fds for each buffer, then that is not optimal.
> 
> We could possibly try to search through the other buffers and reuse
> the mapping if there is a match?
> 
> Best regards,
> Tomasz
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support
  2019-06-07  6:45                                 ` Hans Verkuil
@ 2019-06-07  8:23                                   ` Hans Verkuil
  0 siblings, 0 replies; 55+ messages in thread
From: Hans Verkuil @ 2019-06-07  8:23 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Thierry Reding, Paul Kocialkowski, Nicolas Dufresne,
	Jernej Škrabec, Linux Media Mailing List, Alexandre Courbot,
	Boris Brezillon, Maxime Ripard, Ezequiel Garcia, Jonas Karlman

On 6/7/19 8:45 AM, Hans Verkuil wrote:
> On 6/7/19 8:11 AM, Tomasz Figa wrote:
>> On Wed, May 22, 2019 at 7:56 PM Hans Verkuil <hverkuil-cisco@xs4all.nl> wrote:
>>>> I share the same experience. Bitstream buffers are usually so small that
>>>> you can always find a physically contiguous memory region for them and a
>>>> memcpy() will be faster than the overhead of getting an IOMMU involved.
>>>> This obviously depends on the specific hardware, but there's always some
>>>> threshold before which mapping through an IOMMU just doesn't make sense
>>>> from a fragmentation and/or performance point of view.
>>>>
>>>> I wonder, though, if it's not possible to keep userptr buffers around
>>>> and avoid the constant mapping/unmapping. If we only performed cache
>>>> maintenance on them as necessary, perhaps that could provide a viable,
>>>> maybe even good, zero-copy mechanism.
>>>
>>> The vb2 framework will keep the mapping for a userptr as long as userspace
>>> uses the same userptr for every buffer.
>>>
>>> I.e. the first time a buffer with index I is queued the userptr is mapped.
>>> If that buffer is later dequeued and then requeued again with the same
>>> userptr the vb2 core will reuse the old mapping. Otherwise it will unmap
>>> and map again with the new userptr.
>>
>> That's a good point. I forgot that we've been seeing random memory
>> corruptions (fortunately of the userptr memory only, not random system
>> memory) because of this behavior and carrying a patch in all
>> downstream branches to remove this caching.
>>
>> I can see that we keep references on the pages that corresponded to
>> the user VMA at the time the buffer was queued, but are we guaranteed
>> that the list of pages backing that VMA hasn't changed over time?
> 
> Since you are seeing memory corruptions, the answer to this is perhaps 'no'?
> 
> I think the (quite possibly faulty) reasoning was that while memory is mapped,
> userspace can't do a free()/malloc() pair and end up with the same address.
> 
> I suspect this might be a wrong assumption, and in that case we're better off
> removing this check.
> 
> But I'd like to have some confirmation that it is really wrong.

I did some testing, and indeed, this doesn't work.

A patch fixing this will be posted soon.

Regards,

	Hans

> 
> USERPTR isn't used very often, so it wouldn't surprise me if it is buggy.
> 
> Regards,
> 
> 	Hans
> 
>>
>>>
>>> The same is done for dmabuf, BTW. So if userspace keeps changing dmabuf
>>> fds for each buffer, then that is not optimal.
>>
>> We could possibly try to search through the other buffers and reuse
>> the mapping if there is a match?
>>
>> Best regards,
>> Tomasz
>>
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2019-06-07  8:23 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 10:09 Proposed updates and guidelines for MPEG-2, H.264 and H.265 stateless support Paul Kocialkowski
2019-05-15 14:42 ` Nicolas Dufresne
2019-05-15 17:42   ` Paul Kocialkowski
2019-05-15 18:54     ` Nicolas Dufresne
2019-05-15 20:59       ` Paul Kocialkowski
2019-05-16 18:24         ` Nicolas Dufresne
2019-05-16 18:45           ` Paul Kocialkowski
2019-05-17 20:43             ` Nicolas Dufresne
2019-05-18  9:50               ` Paul Kocialkowski
2019-05-18 10:04                 ` Jernej Škrabec
2019-05-18 10:29                   ` Paul Kocialkowski
2019-05-18 14:09                     ` Nicolas Dufresne
2019-05-22  6:48                       ` Tomasz Figa
2019-05-22  8:26                         ` Paul Kocialkowski
2019-05-22 10:42                           ` Thierry Reding
2019-05-22 10:55                             ` Hans Verkuil
2019-05-22 11:55                               ` Thierry Reding
2019-06-07  6:11                               ` Tomasz Figa
2019-06-07  6:45                                 ` Hans Verkuil
2019-06-07  8:23                                   ` Hans Verkuil
2019-05-21 10:27     ` Tomasz Figa
2019-05-21 11:44       ` Paul Kocialkowski
2019-05-21 15:09         ` Thierry Reding
2019-05-21 16:07           ` Nicolas Dufresne
2019-05-22  8:08             ` Thierry Reding
2019-05-22  6:01         ` Tomasz Figa
2019-05-22 18:15           ` Nicolas Dufresne
2019-05-21 15:43     ` Thierry Reding
2019-05-21 16:23       ` Nicolas Dufresne
2019-05-22  6:39         ` Tomasz Figa
2019-05-22  7:29           ` Boris Brezillon
2019-05-22  8:20             ` Boris Brezillon
2019-05-22 18:18               ` Nicolas Dufresne
2019-05-22  8:32             ` Thierry Reding
2019-05-22  9:29               ` Paul Kocialkowski
2019-05-22 11:39                 ` Thierry Reding
2019-05-22 18:31                   ` Nicolas Dufresne
2019-05-22 18:26                 ` Nicolas Dufresne
2019-05-22 10:08         ` Thierry Reding
2019-05-22 18:37           ` Nicolas Dufresne
2019-05-23 21:04 ` Jonas Karlman
2019-06-03 11:24 ` Thierry Reding
2019-06-03 18:52   ` Nicolas Dufresne
2019-06-03 19:41     ` Boris Brezillon
2019-06-04  8:31       ` Thierry Reding
2019-06-04  8:49         ` Boris Brezillon
2019-06-04  9:06           ` Thierry Reding
2019-06-04  9:15             ` Jonas Karlman
2019-06-04  9:28               ` Paul Kocialkowski
2019-06-04  9:38               ` Boris Brezillon
2019-06-04 10:49                 ` Jonas Karlman
2019-06-04  8:50     ` Thierry Reding
2019-06-04  8:55     ` Thierry Reding
2019-06-04  9:05       ` Boris Brezillon
2019-06-04  9:09         ` Paul Kocialkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).