On 05/05, Joshua Ashton wrote:
> Some corrections and replies inline.
> 
> On Fri, 5 May 2023 at 12:42, Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Thu, 04 May 2023 15:22:59 +0000
> > Simon Ser <contact@emersion.fr> wrote:
> >
> > > Hi all,
> > >
> > > The goal of this RFC is to expose a generic KMS uAPI to configure the color
> > > pipeline before blending, ie. after a pixel is tapped from a plane's
> > > framebuffer and before it's blended with other planes. With this new uAPI we
> > > aim to reduce the battery life impact of color management and HDR on mobile
> > > devices, to improve performance and to decrease latency by skipping
> > > composition on the 3D engine. This proposal is the result of discussions at
> > > the Red Hat HDR hackfest [1] which took place a few days ago. Engineers
> > > familiar with the AMD, Intel and NVIDIA hardware have participated in the
> > > discussion.
> >
> > Hi Simon,
> >
> > this is an excellent write-up, thank you!
> >
> > Harry's question about what constitutes UAPI is a good one for danvet.
> >
> > I don't really have much to add here, a couple inline comments. I think
> > this could work.
> >
> > >
> > > This proposal takes a prescriptive approach instead of a descriptive approach.
> > > Drivers describe the available hardware blocks in terms of low-level
> > > mathematical operations, then user-space configures each block. We decided
> > > against a descriptive approach where user-space would provide a high-level
> > > description of the colorspace and other parameters: we want to give more
> > > control and flexibility to user-space, e.g. to be able to replicate exactly the
> > > color pipeline with shaders and switch between shaders and KMS pipelines
> > > seamlessly, and to avoid forcing user-space into a particular color management
> > > policy.
> > >
> > > We've decided against mirroring the existing CRTC properties
> > > DEGAMMA_LUT/CTM/GAMMA_LUT onto KMS planes. Indeed, the color management
> > > pipeline can significantly differ between vendors and this approach cannot
> > > accurately abstract all hardware. In particular, the availability, ordering and
> > > capabilities of hardware blocks is different on each display engine. So, we've
> > > decided to go for a highly detailed hardware capability discovery.
> > >
> > > This new uAPI should not be in conflict with existing standard KMS properties,
> > > since there are none which control the pre-blending color pipeline at the
> > > moment. It does conflict with any vendor-specific properties like
> > > NV_INPUT_COLORSPACE or the patches on the mailing list adding AMD-specific
> > > properties. Drivers will need to either reject atomic commits configuring both
> > > uAPIs, or alternatively we could add a DRM client cap which hides the vendor
> > > properties and shows the new generic properties when enabled.
> > >
> > > To use this uAPI, first user-space needs to discover hardware capabilities via
> > > KMS objects and properties, then user-space can configure the hardware via an
> > > atomic commit. This works similarly to the existing KMS uAPI, e.g. planes.
> > >
> > > Our proposal introduces a new "color_pipeline" plane property, and a new KMS
> > > object type, "COLOROP" (short for color operation). The "color_pipeline" plane
> > > property is an enum, each enum entry represents a color pipeline supported by
> > > the hardware. The special zero entry indicates that the pipeline is in
> > > "bypass"/"no-op" mode. For instance, the following plane properties describe a
> > > primary plane with 2 supported pipelines but currently configured in bypass
> > > mode:
> > >
> > >     Plane 10
> > >     ├─ "type": immutable enum {Overlay, Primary, Cursor} = Primary
> > >     ├─ …
> > >     └─ "color_pipeline": enum {0, 42, 52} = 0
> > >
> > > The non-zero entries describe color pipelines as a linked list of COLOROP KMS
> > > objects. The entry value is an object ID pointing to the head of the linked
> > > list (the first operation in the color pipeline).
> > >
> > > The new COLOROP objects also expose a number of KMS properties. Each has a
> > > type, a reference to the next COLOROP object in the linked list, and other
> > > type-specific properties. Here is an example for a 1D LUT operation:
> > >
> > >     Color operation 42
> > >     ├─ "type": enum {Bypass, 1D curve} = 1D curve
> > >     ├─ "1d_curve_type": enum {LUT, sRGB, PQ, BT.709, HLG, …} = LUT
> > >     ├─ "lut_size": immutable range = 4096
> > >     ├─ "lut_data": blob
> > >     └─ "next": immutable color operation ID = 43
> > >
> > > To configure this hardware block, user-space can fill a KMS blob with 4096 u32
> > > entries, then set "lut_data" to the blob ID. Other color operation types might
> > > have different properties.
> > >
> > > Here is another example with a 3D LUT:
> > >
> > >     Color operation 42
> > >     ├─ "type": enum {Bypass, 3D LUT} = 3D LUT
> > >     ├─ "lut_size": immutable range = 33
> > >     ├─ "lut_data": blob
> > >     └─ "next": immutable color operation ID = 43
> > >
> > > And one last example with a matrix:
> > >
> > >     Color operation 42
> > >     ├─ "type": enum {Bypass, Matrix} = Matrix
> > >     ├─ "matrix_data": blob
> > >     └─ "next": immutable color operation ID = 43
> > >
> > > [Simon note: having "Bypass" in the "type" enum, and making "type" mutable is
> > > a bit weird. Maybe we can just add an "active"/"bypass" boolean property on
> > > blocks which can be bypassed instead.]
> > >
> > > [Jonas note: perhaps a single "data" property for both LUTs and matrices
> > > would make more sense. And a "size" prop for both 1D and 3D LUTs.]
> > >
> > > If some hardware supports re-ordering operations in the color pipeline, the
> > > driver can expose multiple pipelines with different operation ordering, and
> > > user-space can pick the ordering it prefers by selecting the right pipeline.
> > > The same scheme can be used to expose hardware blocks supporting multiple
> > > precision levels.
> > >
> > > That's pretty much all there is to it, but as always the devil is in the
> > > details.
> > >
> > > First, we realized that we need a way to indicate where the scaling operation
> > > is happening. The contents of the framebuffer attached to the plane might be
> > > scaled up or down depending on the CRTC_W and CRTC_H properties. Depending on
> > > the colorspace scaling is applied in, the result will be different, so we need
> > > a way for the kernel to indicate which hardware blocks are pre-scaling, and
> > > which ones are post-scaling. We introduce a special "scaling" operation type,
> > > which is part of the pipeline like other operations but serves an informational
> > > role only (effectively, the operation cannot be configured by user-space, all
> > > of its properties are immutable). For example:
> > >
> > >     Color operation 43
> > >     ├─ "type": immutable enum {Scaling} = Scaling
> > >     └─ "next": immutable color operation ID = 44
> >
> > I like this.
> >
> > >
> > > [Simon note: an alternative would be to split the color pipeline into two, by
> > > having two plane properties ("color_pipeline_pre_scale" and
> > > "color_pipeline_post_scale") instead of a single one. This would be similar to
> > > the way we want to split pre-blending and post-blending. This could be less
> > > expressive for drivers, there may be hardware where there are dependencies
> > > between the pre- and post-scaling pipeline?]
> > >
> > > Then, Alex from NVIDIA described how their hardware works. NVIDIA hardware
> > > contains some fixed-function blocks which convert from LMS to ICtCp and cannot
> > > be disabled/bypassed. NVIDIA hardware has been designed for descriptive APIs
> > > where user-space provides a high-level description of the colorspace
> > > conversions it needs to perform, and this is at odds with our KMS uAPI
> > > proposal. To address this issue, we suggest adding a special block type which
> > > describes a fixed conversion from one colorspace to another and cannot be
> > > configured by user-space. Then user-space will need to accomodate its pipeline
> > > for these special blocks. Such fixed hardware blocks need to be well enough
> > > documented so that they can be implemented via shaders.
> > >
> > > We also noted that it should always be possible for user-space to completely
> > > disable the color pipeline and switch back to bypass/identity without a
> > > modeset. Some drivers will need to fail atomic commits for some color
> > > pipelines, in particular for some specific LUT payloads. For instance, AMD
> > > doesn't support curves which are too steep, and Intel doesn't support curves
> > > which decrease. This isn't something which routinely happens, but there might
> > > be more cases where the hardware needs to reject the pipeline. Thus, when
> > > user-space has a running KMS color pipeline, then hits a case where the
> > > pipeline cannot keep running (gets rejected by the driver), user-space needs to
> > > be able to immediately fall back to shaders without any glitch. This doesn't
> > > seem to be an issue for AMD, Intel and NVIDIA.
> > >
> > > This uAPI is extensible: we can add more color operations, and we can add more
> > > properties for each color operation type. For instance, we might want to add
> > > support for Intel piece-wise linear (PWL) 1D curves, or might want to advertise
> > > the effective precision of the LUTs. The uAPI is deliberately somewhat minimal
> > > to keep the scope of the proposal manageable.
> > >
> > > Later on, we plan to re-use the same machinery for post-blending color
> > > pipelines. There are some more details about post-blending which have been
> > > separately debated at the hackfest, but we believe it's a viable plan. This
> > > solution would supersede the existing DEGAMMA_LUT/CTM/GAMMA_LUT properties, so
> > > we'd like to introduce a client cap to hide the old properties and show the new
> > > post-blending color pipeline properties.
> > >
> > > We envision a future user-space library to translate a high-level descriptive
> > > color pipeline into low-level prescriptive KMS color pipeline ("libliftoff but
> > > for color pipelines"). The library could also offer a translation into shaders.
> > > This should help share more infrastructure between compositors and ease KMS
> > > offloading. This should also help dealing with the NVIDIA case.
> > >
> > > To wrap things up, let's take a real-world example: how would gamescope [2]
> > > configure the AMD DCN 3.0 hardware for its color pipeline? The gamescope color
> > > pipeline is described in [3]. The AMD DCN 3.0 hardware is described in [4].
> > >
> > > AMD would expose the following objects and properties:
> > >
> > >     Plane 10
> > >     ├─ "type": immutable enum {Overlay, Primary, Cursor} = Primary
> > >     └─ "color_pipeline": enum {0, 42} = 0
> > >     Color operation 42 (input CSC)
> > >     ├─ "type": enum {Bypass, Matrix} = Matrix
> > >     ├─ "matrix_data": blob
> > >     └─ "next": immutable color operation ID = 43
> > >     Color operation 43
> > >     ├─ "type": enum {Scaling} = Scaling
> > >     └─ "next": immutable color operation ID = 44
> > >     Color operation 44 (DeGamma)
> > >     ├─ "type": enum {Bypass, 1D curve} = 1D curve
> > >     ├─ "1d_curve_type": enum {sRGB, PQ, …} = sRGB
> > >     └─ "next": immutable color operation ID = 45
> 
> Some vendors have per-tap degamma and some have a degamma after the sample.
> How do we distinguish that behaviour?
> It is important to know.
> 
> > >     Color operation 45 (gamut remap)
> > >     ├─ "type": enum {Bypass, Matrix} = Matrix
> > >     ├─ "matrix_data": blob
> > >     └─ "next": immutable color operation ID = 46
> > >     Color operation 46 (shaper LUT RAM)
> > >     ├─ "type": enum {Bypass, 1D curve} = 1D curve
> > >     ├─ "1d_curve_type": enum {LUT} = LUT
> > >     ├─ "lut_size": immutable range = 4096
> > >     ├─ "lut_data": blob
> > >     └─ "next": immutable color operation ID = 47
> > >     Color operation 47 (3D LUT RAM)
> > >     ├─ "type": enum {Bypass, 3D LUT} = 3D LUT
> > >     ├─ "lut_size": immutable range = 17
> > >     ├─ "lut_data": blob
> > >     └─ "next": immutable color operation ID = 48
> > >     Color operation 48 (blend gamma)
> > >     ├─ "type": enum {Bypass, 1D curve} = 1D curve
> > >     ├─ "1d_curve_type": enum {LUT, sRGB, PQ, …} = LUT
> > >     ├─ "lut_size": immutable range = 4096
> > >     ├─ "lut_data": blob
> > >     └─ "next": immutable color operation ID = 0
> > >
> > > To configure the pipeline for an HDR10 PQ plane (path at the top) and a HDR
> > > display, gamescope would perform an atomic commit with the following property
> > > values:
> > >
> > >     Plane 10
> > >     └─ "color_pipeline" = 42
> > >     Color operation 42 (input CSC)
> > >     └─ "matrix_data" = PQ → scRGB (TF)
> 
> ^
> Not sure what this is.
> We don't use an input CSC before degamma.
> 
> > >     Color operation 44 (DeGamma)
> > >     └─ "type" = Bypass
> 
> ^
> If we did PQ, this would be PQ -> Linear / 80
> If this was sRGB, it'd be sRGB -> Linear
> If this was scRGB this would be just treating it as it is. So... Linear / 80.
> 
> > >     Color operation 45 (gamut remap)
> > >     └─ "matrix_data" = scRGB (TF) → PQ
> 
> ^
> This is wrong, we just use this to do scRGB primaries (709) to 2020.
> 
> We then go from scRGB -> PQ to go into our shaper + 3D LUT.
> 
> > >     Color operation 46 (shaper LUT RAM)
> > >     └─ "lut_data" = PQ → Display native
> 
> ^
> "Display native" is just the response curve of the display.
> In HDR10, this would just be PQ -> PQ
> If we were doing HDR10 on SDR, this would be PQ -> Gamma 2.2 (mapped
> from 0 to display native luminance) [with a potential bit of headroom
> for tonemapping in the 3D LUT]
> For SDR on HDR10 this would be Gamma 2.2 -> PQ (Not intending to start
> an sRGB vs G2.2 argument here! :P)
> 
> > >     Color operation 47 (3D LUT RAM)
> > >     └─ "lut_data" = Gamut mapping + tone mapping + night mode
> > >     Color operation 48 (blend gamma)
> > >     └─ "1d_curve_type" = PQ
> 
> ^
> This is wrong, this should be Display Native -> Linearized Display Referred

This is a good point to discuss. I understand for the HDR10 case that we
are just setting an enumerated TF (that is PQ for this case - correct me
if I got it wrong) but, unlike when we use a user-LUT, we don't know
from the API that this enumerated TF value with an empty LUT is used for
linearizing/degamma. Perhaps this could come as a pair? Any idea?

> 
> >
> > You cannot do a TF with a matrix, and a gamut remap with a matrix on
> > electrical values is certainly surprising, so the example here is a
> > bit odd, but I don't think that hurts the intention of demonstration.
> 
> I have done some corrections inline.
> 
> You can see our fully correct color pipeline here:
> https://raw.githubusercontent.com/ValveSoftware/gamescope/master/src/docs/Steam%20Deck%20Display%20Pipeline.png
> 
> Please let me know if you have any more questions about our color pipeline.
> 
> >
> > Btw. ISTR that if you want to do scaling properly with alpha channel,
> > you need optical values multiplied by alpha. Alpha vs. scaling is just
> > yet another thing to look into, and TF operations do not work with
> > pre-mult.
> 
> What are your concerns here?
> 
> Having pre-multiplied alpha is fine with a TF: the alpha was
> premultiplied in linear, then encoded with the TF by the client.
> If you think of a TF as something something relative to a bunch of
> reference state or whatever then you might think "oh you can't do
> that!", but you really can.
> It's really best to just think of it as a mathematical encoding of a
> value in all instances that we touch.
> 
> The only issue is that you lose precision from having pre-multiplied
> alpha as it's quantized to fit into the DRM format rather than using
> the full range then getting divided by the alpha at blend time.
> It doesn't end up being a visible issue ever however in my experience, at 8bpc.
> 
> Thanks
>  - Joshie 🐸✨
> 
> >
> >
> > Thanks,
> > pq
> >
> > >
> > > I hope comparing these properties to the diagrams linked above can help
> > > understand how the uAPI would be used and give an idea of its viability.
> > >
> > > Please feel free to provide feedback! It would be especially useful to have
> > > someone familiar with Arm SoCs look at this, to confirm that this proposal
> > > would work there.
> > >
> > > Unless there is a show-stopper, we plan to follow up this RFC with
> > > implementations for AMD, Intel, NVIDIA, gamescope, and IGT.
> > >
> > > Many thanks to everybody who contributed to the hackfest, on-site or remotely!
> > > Let's work together to make this happen!
> > >
> > > Simon, on behalf of the hackfest participants
> > >
> > > [1]: https://wiki.gnome.org/Hackfests/ShellDisplayNext2023
> > > [2]: https://github.com/ValveSoftware/gamescope
> > > [3]: https://github.com/ValveSoftware/gamescope/blob/5af321724c8b8a29cef5ae9e31293fd5d560c4ec/src/docs/Steam%20Deck%20Display%20Pipeline.png
> > > [4]: https://kernel.org/doc/html/latest/_images/dcn3_cm_drm_current.svg
> >