Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color pipeline is needed

From: Alex Goins <agoins@nvidia.com>
To: Pekka Paalanen <ppaalanen@gmail.com>
Cc: "Sasha McIntosh" <sashamcintosh@google.com>,
	"Liviu Dudau" <Liviu.Dudau@arm.com>,
	"Victoria Brekenfeld" <victoria@system76.com>,
	dri-devel@lists.freedesktop.org,
	"Michel Dänzer" <mdaenzer@redhat.com>,
	"Arthur Grillo" <arthurgrillo@riseup.net>,
	"Christopher Braga" <quic_cbraga@quicinc.com>,
	"Sebastian Wick" <sebastian.wick@redhat.com>,
	"Shashank Sharma" <shashank.sharma@amd.com>,
	wayland-devel@lists.freedesktop.org,
	"Jonas Ådahl" <jadahl@redhat.com>,
	"Uma Shankar" <uma.shankar@intel.com>,
	"Abhinav Kumar" <quic_abhinavk@quicinc.com>,
	"Naseer Ahmed" <quic_naseer@quicinc.com>,
	"Melissa Wen" <mwen@igalia.com>, "Aleix Pol" <aleixpol@kde.org>,
	"Hector Martin" <marcan@marcan.st>,
	"Xaver Hugl" <xaver.hugl@gmail.com>,
	"Joshua Ashton" <joshua@froggi.es>
Subject: Re: [RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color pipeline is needed
Date: Wed, 25 Oct 2023 15:16:08 -0500 (CDT)	[thread overview]
Message-ID: <bf69b740-dce7-94f1-293d-a4b274b52f55@nvidia.com> (raw)
In-Reply-To: <20231023111219.6e287958@eldfell>

[-- Attachment #1: Type: text/plain, Size: 11920 bytes --]

Thank you Harry and all other contributors for your work on this. Responses
inline -

On Mon, 23 Oct 2023, Pekka Paalanen wrote:

> On Fri, 20 Oct 2023 11:23:28 -0400
> Harry Wentland <harry.wentland@amd.com> wrote:
> 
> > On 2023-10-20 10:57, Pekka Paalanen wrote:
> > > On Fri, 20 Oct 2023 16:22:56 +0200
> > > Sebastian Wick <sebastian.wick@redhat.com> wrote:
> > >   
> > >> Thanks for continuing to work on this!
> > >>
> > >> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote:  
> > >>> v2:
> > >>>  - Update colorop visualizations to match reality (Sebastian, Alex Hung)
> > >>>  - Updated wording (Pekka)
> > >>>  - Change BYPASS wording to make it non-mandatory (Sebastian)
> > >>>  - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property
> > >>>    section (Pekka)
> > >>>  - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa)
> > >>>  - Add "Driver Implementer's Guide" section (Pekka)
> > >>>  - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka)
> > >
> > > ...
> > >
> > >>> +An example of a drm_colorop object might look like one of these::
> > >>> +
> > >>> +    /* 1D enumerated curve */
> > >>> +    Color operation 42
> > >>> +    ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve
> > >>> +    ├─ "BYPASS": bool {true, false}
> > >>> +    ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …}
> > >>> +    └─ "NEXT": immutable color operation ID = 43

I know these are just examples, but I would also like to suggest the possibility
of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results
compared to setting an identity in some cases depending on the hardware. See
below for more on this, RE: implicit format conversions.

Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in
offline discussions that it would nonetheless be helpful to expose enumerated
curves in order to hide the vendor-specific complexities of programming
segmented LUTs from clients. In that case, we would simply refer to the
enumerated curve when calculating/choosing segmented LUT entries.

Another thing that came up in offline discussions is that we could use multiple
color operations to program a single operation in hardware. As I understand it,
AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an
"HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but
we could combine them into a singular LUT in software, such that you can combine
e.g. segmented PQ EOTF with night light. One caveat is that you will lose
precision from the custom LUT where it overlaps with the linear section of the
enumerated curve, but that is unavoidable and shouldn't be an issue in most
use-cases.

Actually, the current examples in the proposal don't include a multiplier color
op, which might be useful. For AMD as above, but also for NVIDIA as the
following issue arises:

As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed
point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps
to in floating point varies depending on the source content. If it's SDR
content, we want the max value in FP16 to be 1.0 (80 nits), subject to a
potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ
content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption
is that this is also what AMD's "HDR Multiplier" stage is used for, is that
correct?

From the given enumerated curves, it's not clear how they would map to the
above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16
value of 125.0? That may work, but it tends towards the "descriptive" notion of
assuming the source content, which may not be accurate in all cases. This is
also an issue for the custom 1D LUT, as the blob will need to be converted to
FP16 in order to populate our "degamma" LUT. What should the resulting max FP16
value be, given that we no longer have any hint as to the source content?

I think a multiplier color op solves all of these issues. Named curves and
custom 1D LUTs would by default assume a max FP16 value of 1.0, which would then
be adjusted by the multiplier. For 80 nit SDR content, set it to 1, for 400
nit SDR content, set it to 5, for HDR PQ content, set it to 125, etc. 

> > >>> +
> > >>> +    /* custom 4k entry 1D LUT */
> > >>> +    Color operation 52
> > >>> +    ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D LUT
> > >>> +    ├─ "BYPASS": bool {true, false}
> > >>> +    ├─ "LUT_1D_SIZE": immutable range = 4096
> > >>> +    ├─ "LUT_1D": blob
> > >>> +    └─ "NEXT": immutable color operation ID = 0
> > > 
> > > ...
> > >   
> > >>> +Driver Forward/Backward Compatibility
> > >>> +=====================================
> > >>> +
> > >>> +As this is uAPI drivers can't regress color pipelines that have been
> > >>> +introduced for a given HW generation. New HW generations are free to
> > >>> +abandon color pipelines advertised for previous generations.
> > >>> +Nevertheless, it can be beneficial to carry support for existing color
> > >>> +pipelines forward as those will likely already have support in DRM
> > >>> +clients.
> > >>> +
> > >>> +Introducing new colorops to a pipeline is fine, as long as they can be
> > >>> +disabled or are purely informational. DRM clients implementing support
> > >>> +for the pipeline can always skip unknown properties as long as they can
> > >>> +be confident that doing so will not cause unexpected results.
> > >>> +
> > >>> +If a new colorop doesn't fall into one of the above categories
> > >>> +(bypassable or informational) the modified pipeline would be unusable
> > >>> +for user space. In this case a new pipeline should be defined.    
> > >>
> > >> How can user space detect an informational element? Should we just add a
> > >> BYPASS property to informational elements, make it read only and set to
> > >> true maybe? Or something more descriptive?  
> > > 
> > > Read-only BYPASS set to true would be fine by me, I guess.
> > >   
> > 
> > Don't you mean set to false? An informational element will always do
> > something, so it can't be bypassed.
> 
> Yeah, this is why we need a definition. I understand "informational" to
> not change pixel values in any way. Previously I had some weird idea
> that scaling doesn't alter color, but of course it may.

On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do
implicit fixed-point to FP16 conversions, and vice versa.

For example, the "degamma" LUT towards the beginning of the pipeline implicitly
converts from fixed point to FP16, and some of the following operations expect
to operate in FP16. As such, if you have a fixed point input and don't bypass
those following operations, you *must not* bypass the LUT, even if you are
otherwise just programming it with the identity. Conversely, if you have a
floating point input, you *must* bypass the LUT.

Could informational elements and allowing the exclusion of the BYPASS property
be used to convey this information to the client?  For example, we could expose
one pipeline with the LUT exposed with read-only BYPASS set to false, and
sandwich it with informational "Fixed Point" and "FP16" elements to accommodate
fixed point input. Then, expose another pipeline with the LUT missing, and an
informational "FP16" element in its place to accommodate floating point input.

That's just an example; we also have other operations in the pipeline that do
similar implicit conversions. In these cases we don't want the operations to be
bypassed individually, so instead we would expose them as mandatory in some
pipelines and missing in others, with informational elements to help inform the
client of which to choose. Is that acceptable under the current proposal?

Note that in this case, the information just has to do with what format the
pixels should be in, it doesn't correspond to any specific operation. So, I'm
not sure that BYPASS has any meaning for informational elements in this context.

> > > I think we also need a definition of "informational".
> > > 
> > > Counter-example 1: a colorop that represents a non-configurable  
> > 
> > Not sure what's "counter" for these examples?
> > 
> > > YUV<->RGB conversion. Maybe it determines its operation from FB pixel
> > > format. It cannot be set to bypass, it cannot be configured, and it
> > > will alter color values.

Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and
no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop
based on the principle that read-only blobs could be used to express some static
pipeline elements without the need to define a new type, but got mixed opinions.
I think this demonstrates the principle further, as clients could detect this
programmatically instead of having to special-case the informational element.

> > > 
> > > Counter-example 2: image size scaling colorop. It might not be
> > > configurable, it is controlled by the plane CRTC_* and SRC_*
> > > properties. You still need to understand what it does, so you can
> > > arrange the scaling to work correctly. (Do not want to scale an image
> > > with PQ-encoded values as Josh demonstrated in XDC.)
> > >   
> > 
> > IMO the position of the scaling operation is the thing that's important
> > here as the color pipeline won't define scaling properties.

I agree that blending should ideally be done in linear space, and I remember
that from Josh's presentation at XDC, but I don't recall the same being said for
scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the
pipeline that is meant to be in PQ space (more on this below), and that was
found to achieve better results at HDR/SDR boundaries. Of course, this only
bolsters the argument that it would be helpful to have an informational "scaler"
element to understand at which stage scaling takes place.

> > > Counter-example 3: image sampling colorop. Averages FB originated color
> > > values to produce a color sample. Again do not want to do this with
> > > PQ-encoded values.
> > >   
> > 
> > Wouldn't this only happen during a scaling op?
> 
> There is certainly some overlap between examples 2 and 3. IIRC SRC_X/Y
> coordinates can be fractional, which makes nearest vs. bilinear
> sampling have a difference even if there is no scaling.
> 
> There is also the question of chroma siting with sub-sampled YUV. I
> don't know how that actually works, or how it theoretically should work.

We have some operations in our pipeline that are intended to be static, i.e. a
static matrix that converts from RGB to LMS, and later another that converts
from LMS to ICtCp. There are even LUTs that are intended to be static,
converting from linear to PQ and vice versa. All of this is because the
pre-blending scaler and tone mapping operator are intended to operate in ICtCp
PQ space. Although the stated LUTs and matrices are intended to be static, they
are actually programmable. In offline discussions, it was indicated that it
would be helpful to actually expose the programmability, as opposed to exposing
them as non-bypassable blocks, as some compositors may have novel uses for them.

Despite being programmable, the LUTs are updated in a manner that is less
efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful
if there was some way to tag operations according to their performance,
for example so that clients can prefer a high performance one when they
intend to do an animated transition? I recall from the XDC HDR workshop
that this is also an issue with AMD's 3DLUT, where updates can be too
slow to animate.

Thanks,
Alex Goins
NVIDIA Linux Driver Team

> Thanks,
> pq
>