dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
@ 2023-10-20 15:58 Aravind Iddamsetty
  2023-10-20 15:58 ` [RFC v4 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw)
  To: intel-xe, dri-devel, alexander.deucher, airlied, daniel,
	joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov,
	michael.j.ruhl

Our hardware supports RAS(Reliability, Availability, Serviceability) by
reporting the errors to the host, which the KMD processes and exposes a
set of error counters which can be used by observability tools to take 
corrective actions or repairs. Traditionally there were being exposed 
via PMU (for relative counters) and sysfs interface (for absolute 
value) in our internal branch. But, due to the limitations in this 
approach to use two interfaces and also not able to have an event based 
reporting or configurability, an alternative approach to try netlink 
was suggested by community for drm subsystem wide UAPI for RAS and 
telemetry as discussed in [1]. 

This [1] is the inspiration to this series. It uses the generic
netlink(genl) family subsystem and exposes a set of commands that can
be used by every drm driver, the framework provides a means to have
custom commands too. Each drm driver instance in this example xe driver
instance registers a family and operations to the genl subsystem through
which it enumerates and reports the error counters. An event based
notification is also supported to which userpace can subscribe to and
be notified when any error occurs and read the error counter this avoids
continuous polling on error counter. This can also be extended to
threshold based notification.

[1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

this series is on top of https://patchwork.freedesktop.org/series/125373/,

v4:
1. Rebase
2. rename drm_genl_send to drm_genl_reply
3. catch error from xa_store and handle appropriately
4. presently xe_list_errors fills blank data for IGFX, prevent it by
having an early check of IS_DGFX (Michael J. Ruhl)

v3:
1. Rebase on latest RAS series for XE
2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
register to netlink subsystem

v2: define common interfaces to genl netlink subsystem that all drm drivers
can leverage.

Below is an example tool drm_ras which demonstrates the use of the
supported commands. The tool will be sent to ML with the subject
"[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
https://patchwork.freedesktop.org/series/118437/#rev2

read single error counter:

$ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
counter value 0

read all error counters:

$ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
name                                                    config-id               counter

error-gt0-correctable-guc                               0x0000000000000001      0
error-gt0-correctable-slm                               0x0000000000000003      0
error-gt0-correctable-eu-ic                             0x0000000000000004      0
error-gt0-correctable-eu-grf                            0x0000000000000005      0
error-gt0-fatal-guc                                     0x0000000000000009      0
error-gt0-fatal-slm                                     0x000000000000000d      0
error-gt0-fatal-eu-grf                                  0x000000000000000f      0
error-gt0-fatal-fpu                                     0x0000000000000010      0
error-gt0-fatal-tlb                                     0x0000000000000011      0
error-gt0-fatal-l3-fabric                               0x0000000000000012      0
error-gt0-correctable-subslice                          0x0000000000000013      0
error-gt0-correctable-l3bank                            0x0000000000000014      0
error-gt0-fatal-subslice                                0x0000000000000015      0
error-gt0-fatal-l3bank                                  0x0000000000000016      0
error-gt0-sgunit-correctable                            0x0000000000000017      0
error-gt0-sgunit-nonfatal                               0x0000000000000018      0
error-gt0-sgunit-fatal                                  0x0000000000000019      0
error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
error-gt0-soc-fatal-punit                               0x000000000000001d      0
error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
error-gt1-correctable-guc                               0x1000000000000001      0
error-gt1-correctable-slm                               0x1000000000000003      0
error-gt1-correctable-eu-ic                             0x1000000000000004      0
error-gt1-correctable-eu-grf                            0x1000000000000005      0
error-gt1-fatal-guc                                     0x1000000000000009      0
error-gt1-fatal-slm                                     0x100000000000000d      0
error-gt1-fatal-eu-grf                                  0x100000000000000f      0
error-gt1-fatal-fpu                                     0x1000000000000010      0
error-gt1-fatal-tlb                                     0x1000000000000011      0
error-gt1-fatal-l3-fabric                               0x1000000000000012      0
error-gt1-correctable-subslice                          0x1000000000000013      0
error-gt1-correctable-l3bank                            0x1000000000000014      0
error-gt1-fatal-subslice                                0x1000000000000015      0
error-gt1-fatal-l3bank                                  0x1000000000000016      0
error-gt1-sgunit-correctable                            0x1000000000000017      0
error-gt1-sgunit-nonfatal                               0x1000000000000018      0
error-gt1-sgunit-fatal                                  0x1000000000000019      0
error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
error-gt1-soc-fatal-punit                               0x100000000000001d      0
error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0

wait on a error event:

$ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
waiting for error event
error event received
counter value 0

list all errors:

$ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
name                                                    config-id

error-gt0-correctable-guc                               0x0000000000000001
error-gt0-correctable-slm                               0x0000000000000003
error-gt0-correctable-eu-ic                             0x0000000000000004
error-gt0-correctable-eu-grf                            0x0000000000000005
error-gt0-fatal-guc                                     0x0000000000000009
error-gt0-fatal-slm                                     0x000000000000000d
error-gt0-fatal-eu-grf                                  0x000000000000000f
error-gt0-fatal-fpu                                     0x0000000000000010
error-gt0-fatal-tlb                                     0x0000000000000011
error-gt0-fatal-l3-fabric                               0x0000000000000012
error-gt0-correctable-subslice                          0x0000000000000013
error-gt0-correctable-l3bank                            0x0000000000000014
error-gt0-fatal-subslice                                0x0000000000000015
error-gt0-fatal-l3bank                                  0x0000000000000016
error-gt0-sgunit-correctable                            0x0000000000000017
error-gt0-sgunit-nonfatal                               0x0000000000000018
error-gt0-sgunit-fatal                                  0x0000000000000019
error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
error-gt0-soc-fatal-punit                               0x000000000000001d
error-gt0-soc-fatal-psf-0                               0x000000000000001e
error-gt0-soc-fatal-psf-1                               0x000000000000001f
error-gt0-soc-fatal-psf-2                               0x0000000000000020
error-gt0-soc-fatal-cd0                                 0x0000000000000021
error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
error-gt1-correctable-guc                               0x1000000000000001
error-gt1-correctable-slm                               0x1000000000000003
error-gt1-correctable-eu-ic                             0x1000000000000004
error-gt1-correctable-eu-grf                            0x1000000000000005
error-gt1-fatal-guc                                     0x1000000000000009
error-gt1-fatal-slm                                     0x100000000000000d
error-gt1-fatal-eu-grf                                  0x100000000000000f
error-gt1-fatal-fpu                                     0x1000000000000010
error-gt1-fatal-tlb                                     0x1000000000000011
error-gt1-fatal-l3-fabric                               0x1000000000000012
error-gt1-correctable-subslice                          0x1000000000000013
error-gt1-correctable-l3bank                            0x1000000000000014
error-gt1-fatal-subslice                                0x1000000000000015
error-gt1-fatal-l3bank                                  0x1000000000000016
error-gt1-sgunit-correctable                            0x1000000000000017
error-gt1-sgunit-nonfatal                               0x1000000000000018
error-gt1-sgunit-fatal                                  0x1000000000000019
error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
error-gt1-soc-fatal-punit                               0x100000000000001d
error-gt1-soc-fatal-psf-0                               0x100000000000001e
error-gt1-soc-fatal-psf-1                               0x100000000000001f
error-gt1-soc-fatal-psf-2                               0x1000000000000020
error-gt1-soc-fatal-cd0                                 0x1000000000000021
error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Tomer Tayar <ttayar@habana.ai>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Cc: Kuehling Felix <Felix.Kuehling@amd.com>
Cc: Tuikov Luben <Luben.Tuikov@amd.com>
Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>


Aravind Iddamsetty (5):
  drm/netlink: Add netlink infrastructure
  drm/xe/RAS: Register netlink capability
  drm/xe/RAS: Expose the error counters
  drm/netlink: Define multicast groups
  drm/xe/RAS: send multicast event on occurrence of an error

 drivers/gpu/drm/Makefile             |   1 +
 drivers/gpu/drm/drm_drv.c            |   7 +
 drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
 drivers/gpu/drm/xe/Makefile          |   1 +
 drivers/gpu/drm/xe/xe_device.c       |   4 +
 drivers/gpu/drm/xe/xe_device_types.h |   1 +
 drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
 drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
 include/drm/drm_device.h             |   8 +
 include/drm/drm_drv.h                |   7 +
 include/drm/drm_netlink.h            |  35 ++
 include/uapi/drm/drm_netlink.h       |  87 +++++
 include/uapi/drm/xe_drm.h            |  81 +++++
 13 files changed, 977 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_netlink.c
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
 create mode 100644 include/drm/drm_netlink.h
 create mode 100644 include/uapi/drm/drm_netlink.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC v4 1/5] drm/netlink: Add netlink infrastructure
  2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
@ 2023-10-20 15:58 ` Aravind Iddamsetty
  2023-10-20 20:36   ` Ruhl, Michael J
  2023-11-10 12:24   ` Tomer Tayar
  2023-10-20 15:58 ` [RFC v2 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw)
  To: intel-xe, dri-devel, alexander.deucher, airlied, daniel,
	joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov,
	michael.j.ruhl

Define the netlink registration interface and commands, attributes that
can be commonly used across by drm drivers. This patch intends to use
the generic netlink family to expose various stats of device. At present
it defines some commands that shall be used to expose RAS error counters.

v2:
define common interfaces to genl netlink subsystem that all drm drivers
can leverage.(Tomer Tayar)

v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
register to netlink subsystem (Daniel Vetter)

v4:(Michael J. Ruhl)
1. rename drm_genl_send to drm_genl_reply
2. catch error from xa_store and handle appropriately

Cc: Tomer Tayar <ttayar@habana.ai>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Michael J. Ruhl <michael.j.ruhl@intel.com>

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/Makefile       |   1 +
 drivers/gpu/drm/drm_drv.c      |   7 ++
 drivers/gpu/drm/drm_netlink.c  | 188 +++++++++++++++++++++++++++++++++
 include/drm/drm_device.h       |   8 ++
 include/drm/drm_drv.h          |   7 ++
 include/drm/drm_netlink.h      |  30 ++++++
 include/uapi/drm/drm_netlink.h |  83 +++++++++++++++
 7 files changed, 324 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_netlink.c
 create mode 100644 include/drm/drm_netlink.h
 create mode 100644 include/uapi/drm/drm_netlink.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index ee64c51274ad..60864369adaa 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -35,6 +35,7 @@ drm-y := \
 	drm_mode_object.o \
 	drm_modes.o \
 	drm_modeset_lock.o \
+	drm_netlink.o \
 	drm_plane.o \
 	drm_prime.o \
 	drm_print.o \
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 535f16e7882e..31f55c1c7524 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -937,6 +937,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
 	if (ret)
 		goto err_minors;
 
+	if (driver->genl_ops) {
+		ret = drm_genl_register(dev);
+		if (ret)
+			goto err_minors;
+	}
+
 	ret = create_compat_control_link(dev);
 	if (ret)
 		goto err_minors;
@@ -1074,6 +1080,7 @@ static void drm_core_exit(void)
 {
 	drm_privacy_screen_lookup_exit();
 	accel_core_exit();
+	drm_genl_exit();
 	unregister_chrdev(DRM_MAJOR, "drm");
 	debugfs_remove(drm_debugfs_root);
 	drm_sysfs_destroy();
diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
new file mode 100644
index 000000000000..8add249c1da3
--- /dev/null
+++ b/drivers/gpu/drm/drm_netlink.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <drm/drm_device.h>
+#include <drm/drm_drv.h>
+#include <drm/drm_file.h>
+#include <drm/drm_managed.h>
+#include <drm/drm_netlink.h>
+#include <drm/drm_print.h>
+
+DEFINE_XARRAY(drm_dev_xarray);
+
+/**
+ * drm_genl_reply - response to a request
+ * @msg: socket buffer
+ * @info: receiver information
+ * @usrhdr: pointer to user specific header in the message buffer
+ *
+ * RETURNS:
+ * 0 on success and negative error code on failure
+ */
+int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
+{
+	int ret;
+
+	genlmsg_end(msg, usrhdr);
+
+	ret = genlmsg_reply(msg, info);
+	if (ret)
+		nlmsg_free(msg);
+
+	return ret;
+}
+EXPORT_SYMBOL(drm_genl_reply);
+
+/**
+ * drm_genl_alloc_msg - allocate genl message buffer
+ * @dev: drm_device for which the message is being allocated
+ * @info: receiver information
+ * @usrhdr: pointer to user specific header in the message buffer
+ *
+ * RETURNS:
+ * pointer to new allocated buffer on success, NULL on failure
+ */
+struct sk_buff *
+drm_genl_alloc_msg(struct drm_device *dev,
+		   struct genl_info *info,
+		   size_t msg_size, void **usrhdr)
+{
+	struct sk_buff *new_msg;
+
+	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!new_msg)
+		return new_msg;
+
+	*usrhdr = genlmsg_put_reply(new_msg, info, &dev->drm_genl_family, 0, info->genlhdr->cmd);
+	if (!*usrhdr) {
+		nlmsg_free(new_msg);
+		new_msg = NULL;
+	}
+
+	return new_msg;
+}
+EXPORT_SYMBOL(drm_genl_alloc_msg);
+
+static struct drm_device *genl_to_dev(struct genl_info *info)
+{
+	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
+}
+
+static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
+{
+	struct drm_device *dev = genl_to_dev(info);
+
+	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_REQUEST))
+		return -EINVAL;
+
+	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
+		return -EOPNOTSUPP;
+
+	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
+}
+
+static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
+{
+	struct drm_device *dev = genl_to_dev(info);
+
+	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
+		return -EINVAL;
+
+	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
+		return -EOPNOTSUPP;
+
+	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
+}
+
+/* attribute policies */
+static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
+	[DRM_RAS_ATTR_REQUEST] = { .type = NLA_U8 },
+};
+
+static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
+	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
+};
+
+/* drm genl operations definition */
+const struct genl_ops drm_genl_ops[] = {
+	{
+		.cmd = DRM_RAS_CMD_QUERY,
+		.doit = drm_genl_list_errors,
+		.policy = drm_attr_policy_query,
+	},
+	{
+		.cmd = DRM_RAS_CMD_READ_ONE,
+		.doit = drm_genl_read_error,
+		.policy = drm_attr_policy_read_one,
+	},
+	{
+		.cmd = DRM_RAS_CMD_READ_ALL,
+		.doit = drm_genl_list_errors,
+		.policy = drm_attr_policy_query,
+	},
+};
+
+static void drm_genl_family_init(struct drm_device *dev)
+{
+	/* Use drm primary node name eg: card0 to name the genl family */
+	snprintf(dev->drm_genl_family.name, sizeof(dev->drm_genl_family.name), "%s", dev->primary->kdev->kobj.name);
+	dev->drm_genl_family.version = DRM_GENL_VERSION;
+	dev->drm_genl_family.parallel_ops = true;
+	dev->drm_genl_family.ops = drm_genl_ops;
+	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
+	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
+	dev->drm_genl_family.module = dev->dev->driver->owner;
+}
+
+static void drm_genl_deregister(struct drm_device *dev,  void *arg)
+{
+	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family.name);
+
+	xa_erase(&drm_dev_xarray, dev->drm_genl_family.id);
+
+	genl_unregister_family(&dev->drm_genl_family);
+}
+
+/**
+ * drm_genl_register - Register genl family
+ * @dev: drm_device for which genl family needs to be registered
+ *
+ * RETURNS:
+ * 0 on success and negative error code on failure
+ */
+int drm_genl_register(struct drm_device *dev)
+{
+	int ret;
+
+	drm_genl_family_init(dev);
+
+	ret = genl_register_family(&dev->drm_genl_family);
+	if (ret < 0) {
+		drm_warn(dev, "genl family registration failed\n");
+		return ret;
+	}
+
+	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family.id, dev->drm_genl_family.name);
+
+	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family.id, dev, GFP_KERNEL));
+	if (ret)
+		goto genl_unregister;
+
+	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
+
+	return ret;
+
+genl_unregister:
+	genl_unregister_family(&dev->drm_genl_family);
+	return ret;
+}
+
+/**
+ * drm_genl_exit: destroy drm_dev_xarray
+ */
+void drm_genl_exit(void)
+{
+	xa_destroy(&drm_dev_xarray);
+}
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c490977ee250..d3ae91b7714d 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -8,6 +8,7 @@
 
 #include <drm/drm_legacy.h>
 #include <drm/drm_mode_config.h>
+#include <drm/drm_netlink.h>
 
 struct drm_driver;
 struct drm_minor;
@@ -318,6 +319,13 @@ struct drm_device {
 	 */
 	struct dentry *debugfs_root;
 
+	/**
+	 * @drm_genl_family:
+	 *
+	 * Generic netlink family registration structure.
+	 */
+	struct genl_family drm_genl_family;
+
 	/* Everything below here is for legacy driver, never use! */
 	/* private: */
 #if IS_ENABLED(CONFIG_DRM_LEGACY)
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index e2640dc64e08..ebdb7850d235 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -434,6 +434,13 @@ struct drm_driver {
 	 */
 	const struct file_operations *fops;
 
+	/**
+	 * @genl_ops:
+	 *
+	 * Drivers private callback to genl commands
+	 */
+	const struct driver_genl_ops *genl_ops;
+
 #ifdef CONFIG_DRM_LEGACY
 	/* Everything below here is for legacy driver, never use! */
 	/* private: */
diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
new file mode 100644
index 000000000000..54527dae7847
--- /dev/null
+++ b/include/drm/drm_netlink.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#ifndef __DRM_NETLINK_H__
+#define __DRM_NETLINK_H__
+
+#include <linux/netdevice.h>
+#include <net/genetlink.h>
+#include <net/sock.h>
+#include <uapi/drm/drm_netlink.h>
+
+struct drm_device;
+
+struct driver_genl_ops {
+	int		       (*doit)(struct drm_device *dev,
+				       struct sk_buff *skb,
+				       struct genl_info *info);
+};
+
+int drm_genl_register(struct drm_device *dev);
+void drm_genl_exit(void);
+int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
+struct sk_buff *
+drm_genl_alloc_msg(struct drm_device *dev,
+		   struct genl_info *info,
+		   size_t msg_size, void **usrhdr);
+#endif
+
diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
new file mode 100644
index 000000000000..aab42147a20e
--- /dev/null
+++ b/include/uapi/drm/drm_netlink.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright 2023 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _DRM_NETLINK_H_
+#define _DRM_NETLINK_H_
+
+#define DRM_GENL_VERSION 1
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/**
+ * enum drm_genl_error_cmds - Supported error commands
+ *
+ */
+enum drm_genl_error_cmds {
+	DRM_CMD_UNSPEC,
+	/** @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id */
+	DRM_RAS_CMD_QUERY,
+	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */
+	DRM_RAS_CMD_READ_ONE,
+	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
+	DRM_RAS_CMD_READ_ALL,
+
+	__DRM_CMD_MAX,
+	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
+};
+
+/**
+ * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
+ *
+ */
+enum drm_error_attr {
+	DRM_ATTR_UNSPEC,
+	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
+	/**
+	 * @DRM_RAS_ATTR_REQUEST: Should be used with DRM_RAS_CMD_QUERY,
+	 * DRM_RAS_CMD_READ_ALL
+	 */
+	DRM_RAS_ATTR_REQUEST, /* NLA_U8 */
+	/**
+	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
+	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands.
+	 */
+	DRM_RAS_ATTR_QUERY_REPLY, /*NLA_NESTED*/
+	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
+	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
+	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id */
+	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
+	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
+	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
+
+	__DRM_ATTR_MAX,
+	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
+};
+
+#if defined(__cplusplus)
+}
+#endif
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC v2 2/5] drm/xe/RAS: Register netlink capability
  2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
  2023-10-20 15:58 ` [RFC v4 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
@ 2023-10-20 15:58 ` Aravind Iddamsetty
  2023-10-20 20:37   ` Ruhl, Michael J
  2023-10-20 15:58 ` [RFC v3 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw)
  To: intel-xe, dri-devel, alexander.deucher, airlied, daniel,
	joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov,
	michael.j.ruhl

Register netlink capability with the DRM and register the driver
callbacks to DRM RAS netlink commands.

v2:
Move the netlink registration parts to DRM susbsytem (Tomer Tayar)

Cc: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/xe/Makefile          |  1 +
 drivers/gpu/drm/xe/xe_device.c       |  4 ++++
 drivers/gpu/drm/xe/xe_device_types.h |  1 +
 drivers/gpu/drm/xe/xe_netlink.c      | 22 ++++++++++++++++++++++
 4 files changed, 28 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.c

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index ed772f440689..048f9a23e2f0 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -87,6 +87,7 @@ xe-y += xe_bb.o \
 	xe_mmio.o \
 	xe_mocs.o \
 	xe_module.o \
+	xe_netlink.o \
 	xe_pat.o \
 	xe_pci.o \
 	xe_pcode.o \
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 628cb46a2509..8c928719a537 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -151,6 +151,8 @@ static void xe_driver_release(struct drm_device *dev)
 	pci_set_drvdata(to_pci_dev(xe->drm.dev), NULL);
 }
 
+extern const struct driver_genl_ops xe_genl_ops[];
+
 static struct drm_driver driver = {
 	/* Don't use MTRRs here; the Xserver or userspace app should
 	 * deal with them for Intel hardware.
@@ -159,6 +161,7 @@ static struct drm_driver driver = {
 	    DRIVER_GEM |
 	    DRIVER_RENDER | DRIVER_SYNCOBJ |
 	    DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
+
 	.open = xe_file_open,
 	.postclose = xe_file_close,
 
@@ -170,6 +173,7 @@ static struct drm_driver driver = {
 	.show_fdinfo = xe_drm_client_fdinfo,
 #endif
 	.release = &xe_driver_release,
+	.genl_ops = xe_genl_ops,
 
 	.ioctls = xe_ioctls,
 	.num_ioctls = ARRAY_SIZE(xe_ioctls),
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index a1bacf820d37..8201f3644b86 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -10,6 +10,7 @@
 
 #include <drm/drm_device.h>
 #include <drm/drm_file.h>
+#include <drm/drm_netlink.h>
 #include <drm/ttm/ttm_device.h>
 
 #include "xe_devcoredump_types.h"
diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
new file mode 100644
index 000000000000..81d785455632
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_netlink.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+#include "xe_device.h"
+
+static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+{
+	return 0;
+}
+
+static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+{
+	return 0;
+}
+
+/* driver callbacks to DRM netlink commands*/
+const struct driver_genl_ops xe_genl_ops[] = {
+	[DRM_RAS_CMD_QUERY] =		{ .doit = xe_genl_list_errors },
+	[DRM_RAS_CMD_READ_ONE] =	{ .doit = xe_genl_read_error },
+	[DRM_RAS_CMD_READ_ALL] =	{ .doit = xe_genl_list_errors, },
+};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC v3 3/5] drm/xe/RAS: Expose the error counters
  2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
  2023-10-20 15:58 ` [RFC v4 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
  2023-10-20 15:58 ` [RFC v2 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
@ 2023-10-20 15:58 ` Aravind Iddamsetty
  2023-10-20 20:39   ` Ruhl, Michael J
  2023-11-10 12:27   ` Tomer Tayar
  2023-10-20 15:58 ` [RFC 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw)
  To: intel-xe, dri-devel, alexander.deucher, airlied, daniel,
	joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov,
	michael.j.ruhl

We expose the various error counters supported on a hardware via genl
subsytem through the registered commands to userspace. The
DRM_RAS_CMD_QUERY lists the error names with config id,
DRM_RAD_CMD_READ_ONE returns the counter value for the requested config
id and the DRM_RAS_CMD_READ_ALL lists the counters for all errors along
with their names and config ids.

v2: Rebase

v3:
1. presently xe_list_errors fills blank data for IGFX, prevent it by
having an early check of IS_DGFX (Michael J. Ruhl)
2. update errors from all sources

Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_netlink.c | 499 +++++++++++++++++++++++++++++++-
 include/uapi/drm/xe_drm.h       |  81 ++++++
 2 files changed, 578 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
index 81d785455632..3e4cdb5e4920 100644
--- a/drivers/gpu/drm/xe/xe_netlink.c
+++ b/drivers/gpu/drm/xe/xe_netlink.c
@@ -2,16 +2,511 @@
 /*
  * Copyright © 2023 Intel Corporation
  */
+#include <drm/xe_drm.h>
+
 #include "xe_device.h"
 
-static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+#define MAX_ERROR_NAME	100
+
+static const char * const xe_hw_error_events[] = {
+		[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng",
+		[XE_GENL_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc",
+		[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler",
+		[XE_GENL_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm",
+		[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic",
+		[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf",
+		[XE_GENL_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
+		[XE_GENL_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
+		[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker",
+		[XE_GENL_GT_ERROR_FATAL_GUC] = "fatal-guc",
+		[XE_GENL_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
+		[XE_GENL_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
+		[XE_GENL_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
+		[XE_GENL_GT_ERROR_FATAL_SLM] = "fatal-slm",
+		[XE_GENL_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
+		[XE_GENL_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
+		[XE_GENL_GT_ERROR_FATAL_FPU] = "fatal-fpu",
+		[XE_GENL_GT_ERROR_FATAL_TLB] = "fatal-tlb",
+		[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
+		[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice",
+		[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank",
+		[XE_GENL_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
+		[XE_GENL_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
+		[XE_GENL_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable",
+		[XE_GENL_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
+		[XE_GENL_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
+		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = "soc-nonfatal-csc-psf-cmd-parity",
+		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = "soc-nonfatal-csc-psf-unexpected-completion",
+		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = "soc-nonfatal-csc-psf-unsupported-request",
+		[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = "soc-nonfatal-anr-mdfi",
+		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = "soc-nonfatal-mdfi-t2t",
+		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = "soc-nonfatal-mdfi-t2c",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = "soc-nonfatal-hbm-ss0-0",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = "soc-nonfatal-hbm-ss0-1",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = "soc-nonfatal-hbm-ss0-2",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = "soc-nonfatal-hbm-ss0-3",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = "soc-nonfatal-hbm-ss0-4",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = "soc-nonfatal-hbm-ss0-5",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = "soc-nonfatal-hbm-ss0-6",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = "soc-nonfatal-hbm-ss0-7",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = "soc-nonfatal-hbm-ss1-0",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = "soc-nonfatal-hbm-ss1-1",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = "soc-nonfatal-hbm-ss1-2",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = "soc-nonfatal-hbm-ss1-3",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = "soc-nonfatal-hbm-ss1-4",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = "soc-nonfatal-hbm-ss1-5",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = "soc-nonfatal-hbm-ss1-6",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = "soc-nonfatal-hbm-ss1-7",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = "soc-nonfatal-hbm-ss2-0",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = "soc-nonfatal-hbm-ss2-1",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = "soc-nonfatal-hbm-ss2-2",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = "soc-nonfatal-hbm-ss2-3",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = "soc-nonfatal-hbm-ss2-4",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = "soc-nonfatal-hbm-ss2-5",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = "soc-nonfatal-hbm-ss2-6",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = "soc-nonfatal-hbm-ss2-7",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = "soc-nonfatal-hbm-ss3-0",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = "soc-nonfatal-hbm-ss3-1",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = "soc-nonfatal-hbm-ss3-2",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = "soc-nonfatal-hbm-ss3-3",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = "soc-nonfatal-hbm-ss3-4",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = "soc-nonfatal-hbm-ss3-5",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = "soc-nonfatal-hbm-ss3-6",
+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = "soc-nonfatal-hbm-ss3-7",
+		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = "soc-fatal-csc-psf-cmd-parity",
+		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = "soc-fatal-csc-psf-unexpected-completion",
+		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = "soc-fatal-csc-psf-unsupported-request",
+		[XE_GENL_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
+		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = "soc-fatal-pcie-psf-command-parity",
+		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = "soc-fatal-pcie-psf-unexpected-completion",
+		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = "soc-fatal-pcie-psf-unsupported-request",
+		[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = "soc-fatal-anr-mdfi",
+		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = "soc-fatal-mdfi-t2t",
+		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = "soc-fatal-mdfi-t2c",
+		[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = "soc-fatal-malformed-pcie-aer",
+		[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = "soc-fatal-malformed-pcie-err",
+		[XE_GENL_SOC_ERROR_FATAL_UR_COND] = "soc-fatal-ur-condition-ieh",
+		[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = "soc-fatal-from-serr-sources",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6",
+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7",
+		[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc",
+		[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown",
+		[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = "gsc-nonfatal-mia-internal",
+		[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc",
+		[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout",
+		[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity",
+		[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity",
+		[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = "gsc-nonfatal-vlt-glitch",
+		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull",
+		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check",
+		[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = "gsc-nonfatal-self-mbist",
+		[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = "gsc-nonfatal-aon-parity",
+		[XE_GENL_SGGI_ERROR_NONFATAL] = "sggi-nonfatal-data-parity",
+		[XE_GENL_SGLI_ERROR_NONFATAL] = "sgli-nonfatal-data-parity",
+		[XE_GENL_SGCI_ERROR_NONFATAL] = "sgci-nonfatal-data-parity",
+		[XE_GENL_MERT_ERROR_NONFATAL] = "mert-nonfatal-data-parity",
+		[XE_GENL_SGGI_ERROR_FATAL] = "sggi-fatal-data-parity",
+		[XE_GENL_SGLI_ERROR_FATAL] = "sgli-fatal-data-parity",
+		[XE_GENL_SGCI_ERROR_FATAL] = "sgci-fatal-data-parity",
+		[XE_GENL_MERT_ERROR_FATAL] = "mert-nonfatal-data-parity",
+};
+
+static const unsigned long xe_hw_error_map[] = {
+	[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = XE_HW_ERR_GT_CORR_L3_SNG,
+	[XE_GENL_GT_ERROR_CORRECTABLE_GUC] = XE_HW_ERR_GT_CORR_GUC,
+	[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = XE_HW_ERR_GT_CORR_SAMPLER,
+	[XE_GENL_GT_ERROR_CORRECTABLE_SLM] = XE_HW_ERR_GT_CORR_SLM,
+	[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = XE_HW_ERR_GT_CORR_EU_IC,
+	[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = XE_HW_ERR_GT_CORR_EU_GRF,
+	[XE_GENL_GT_ERROR_FATAL_ARR_BIST] = XE_HW_ERR_GT_FATAL_ARR_BIST,
+	[XE_GENL_GT_ERROR_FATAL_L3_DOUB] = XE_HW_ERR_GT_FATAL_L3_DOUB,
+	[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = XE_HW_ERR_GT_FATAL_L3_ECC_CHK,
+	[XE_GENL_GT_ERROR_FATAL_GUC] = XE_HW_ERR_GT_FATAL_GUC,
+	[XE_GENL_GT_ERROR_FATAL_IDI_PAR] = XE_HW_ERR_GT_FATAL_IDI_PAR,
+	[XE_GENL_GT_ERROR_FATAL_SQIDI] = XE_HW_ERR_GT_FATAL_SQIDI,
+	[XE_GENL_GT_ERROR_FATAL_SAMPLER] = XE_HW_ERR_GT_FATAL_SAMPLER,
+	[XE_GENL_GT_ERROR_FATAL_SLM] = XE_HW_ERR_GT_FATAL_SLM,
+	[XE_GENL_GT_ERROR_FATAL_EU_IC] = XE_HW_ERR_GT_FATAL_EU_IC,
+	[XE_GENL_GT_ERROR_FATAL_EU_GRF] = XE_HW_ERR_GT_FATAL_EU_GRF,
+	[XE_GENL_GT_ERROR_FATAL_FPU] = XE_HW_ERR_GT_FATAL_FPU,
+	[XE_GENL_GT_ERROR_FATAL_TLB] = XE_HW_ERR_GT_FATAL_TLB,
+	[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = XE_HW_ERR_GT_FATAL_L3_FABRIC,
+	[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = XE_HW_ERR_GT_CORR_SUBSLICE,
+	[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = XE_HW_ERR_GT_CORR_L3BANK,
+	[XE_GENL_GT_ERROR_FATAL_SUBSLICE] = XE_HW_ERR_GT_FATAL_SUBSLICE,
+	[XE_GENL_GT_ERROR_FATAL_L3BANK] = XE_HW_ERR_GT_FATAL_L3BANK,
+	[XE_GENL_SGUNIT_ERROR_CORRECTABLE] = XE_HW_ERR_TILE_CORR_SGUNIT,
+	[XE_GENL_SGUNIT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGUNIT,
+	[XE_GENL_SGUNIT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGUNIT,
+	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMD,
+	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMP,
+	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_REQ,
+	[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = XE_HW_ERR_SOC_NONFATAL_ANR_MDFI,
+	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2T,
+	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2C,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL0,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL1,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL2,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL3,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL4,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL5,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL6,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL7,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL0,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL1,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL2,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL3,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL4,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL5,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL6,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL7,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL0,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL1,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL2,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL3,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL4,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL5,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL6,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL7,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL0,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL1,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL2,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL3,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL4,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL5,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL6,
+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL7,
+	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMD,
+	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMP,
+	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_FATAL_CSC_PSF_REQ,
+	[XE_GENL_SOC_ERROR_FATAL_PUNIT] = XE_HW_ERR_SOC_FATAL_PUNIT,
+	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMD,
+	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMP,
+	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_REQ,
+	[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = XE_HW_ERR_SOC_FATAL_ANR_MDFI,
+	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = XE_HW_ERR_SOC_FATAL_MDFI_T2T,
+	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = XE_HW_ERR_SOC_FATAL_MDFI_T2C,
+	[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = XE_HW_ERR_SOC_FATAL_PCIE_AER,
+	[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = XE_HW_ERR_SOC_FATAL_PCIE_ERR,
+	[XE_GENL_SOC_ERROR_FATAL_UR_COND] = XE_HW_ERR_SOC_FATAL_UR_COND,
+	[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = XE_HW_ERR_SOC_FATAL_SERR_SRCS,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL0,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL1,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL2,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL3,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL4,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL5,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL6,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL7,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL0,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL1,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL2,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL3,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL4,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL5,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL6,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL7,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL0,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL1,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL2,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL3,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL4,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL5,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL6,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL7,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL0,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL1,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL2,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL3,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL4,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL5,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL6,
+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL7,
+	[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = XE_HW_ERR_GSC_CORR_SRAM,
+	[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = XE_HW_ERR_GSC_NONFATAL_MIA_SHUTDOWN,
+	[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = XE_HW_ERR_GSC_NONFATAL_MIA_INTERNAL,
+	[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = XE_HW_ERR_GSC_NONFATAL_SRAM,
+	[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = XE_HW_ERR_GSC_NONFATAL_WDG,
+	[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = XE_HW_ERR_GSC_NONFATAL_ROM_PARITY,
+	[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = XE_HW_ERR_GSC_NONFATAL_UCODE_PARITY,
+	[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = XE_HW_ERR_GSC_NONFATAL_VLT_GLITCH,
+	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = XE_HW_ERR_GSC_NONFATAL_FUSE_PULL,
+	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = XE_HW_ERR_GSC_NONFATAL_FUSE_CRC,
+	[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = XE_HW_ERR_GSC_NONFATAL_SELF_MBIST,
+	[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = XE_HW_ERR_GSC_NONFATAL_AON_RF_PARITY,
+	[XE_GENL_SGGI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGGI,
+	[XE_GENL_SGLI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGLI,
+	[XE_GENL_SGCI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGCI,
+	[XE_GENL_MERT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_MERT,
+	[XE_GENL_SGGI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGGI,
+	[XE_GENL_SGLI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGLI,
+	[XE_GENL_SGCI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGCI,
+	[XE_GENL_MERT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_MERT,
+};
+
+static unsigned int config_gt_id(const u64 config)
+{
+	return config >> __XE_PMU_GT_SHIFT;
+}
+
+static u64 config_counter(const u64 config)
 {
+	return config & ~(~0ULL << __XE_PMU_GT_SHIFT);
+}
+
+static bool is_gt_error(const u64 config)
+{
+	unsigned int error;
+
+	error = config_counter(config);
+	if (error <= XE_GENL_GT_ERROR_FATAL_FPU)
+		return true;
+
+	return false;
+}
+
+static bool is_gt_vector_error(const u64 config)
+{
+	unsigned int error;
+
+	error = config_counter(config);
+	if (error >= XE_GENL_GT_ERROR_FATAL_TLB &&
+	    error <= XE_GENL_GT_ERROR_FATAL_L3BANK)
+		return true;
+
+	return false;
+}
+
+static bool is_pvc_invalid_gt_errors(const u64 config)
+{
+	switch (config_counter(config)) {
+	case XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG:
+	case XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER:
+	case XE_GENL_GT_ERROR_FATAL_ARR_BIST:
+	case XE_GENL_GT_ERROR_FATAL_L3_DOUB:
+	case XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK:
+	case XE_GENL_GT_ERROR_FATAL_IDI_PAR:
+	case XE_GENL_GT_ERROR_FATAL_SQIDI:
+	case XE_GENL_GT_ERROR_FATAL_SAMPLER:
+	case XE_GENL_GT_ERROR_FATAL_EU_IC:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static bool is_gsc_hw_error(const u64 config)
+{
+	if (config_counter(config) >= XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
+	    config_counter(config) <= XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY)
+		return true;
+
+	return false;
+}
+
+static bool is_soc_error(const u64 config)
+{
+	if (config_counter(config) >= XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD &&
+	    config_counter(config) <= XE_GENL_SOC_ERROR_FATAL_HBM(1, 15))
+		return true;
+
+	return false;
+}
+
+static int
+config_status(struct xe_device *xe, u64 config)
+{
+	unsigned int gt_id = config_gt_id(config);
+	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
+
+	if (!IS_DGFX(xe))
+		return -ENODEV;
+
+	if (gt->info.type == XE_GT_TYPE_UNINITIALIZED)
+		return -ENOENT;
+
+	/* GSC HW ERRORS are present on root tile of
+	 * platform supporting MEMORY SPARING only
+	 */
+	if (is_gsc_hw_error(config) && !(xe->info.platform == XE_PVC && !gt_id))
+		return -ENODEV;
+
+	/* GT vectors error  are valid on Platforms supporting error vectors only */
+	if (is_gt_vector_error(config) && xe->info.platform != XE_PVC)
+		return -ENODEV;
+
+	/* Skip gt errors not supported on pvc */
+	if (is_pvc_invalid_gt_errors(config) && xe->info.platform == XE_PVC)
+		return  -ENODEV;
+
+	/* FATAL FPU error is valid on PVC only */
+	if (config_counter(config) == XE_GENL_GT_ERROR_FATAL_FPU &&
+	    !(xe->info.platform == XE_PVC))
+		return -ENODEV;
+
+	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
+		return -ENODEV;
+
+	return (config_counter(config) >=
+			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
+}
+
+static u64 get_counter_value(struct xe_device *xe, u64 config)
+{
+	const unsigned int gt_id = config_gt_id(config);
+	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
+	unsigned int id = config_counter(config);
+
+	if (is_gt_error(config) || is_gt_vector_error(config))
+		return xa_to_value(xa_load(&gt->errors.hw_error, xe_hw_error_map[id]));
+
+	return xa_to_value(xa_load(&gt->tile->errors.hw_error, xe_hw_error_map[id]));
+}
+
+int fill_error_details(struct xe_device *xe, struct genl_info *info, struct sk_buff *new_msg)
+{
+	struct nlattr *entry_attr;
+	bool counter = false;
+	struct xe_gt *gt;
+	int i, j;
+
+	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
+		     ARRAY_SIZE(xe_hw_error_map));
+
+	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL)
+		counter = true;
+
+	entry_attr = nla_nest_start(new_msg, DRM_RAS_ATTR_QUERY_REPLY);
+	if (!entry_attr)
+		return -EMSGSIZE;
+
+	for_each_gt(gt, xe, j) {
+		char str[MAX_ERROR_NAME];
+		u64 val;
+
+		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
+			u64 config = XE_HW_ERROR(j, i);
+
+			if (config_status(xe, config))
+				continue;
+
+			/* should this be cleared everytime */
+			snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]);
+
+			if (nla_put_string(new_msg, DRM_RAS_ATTR_ERROR_NAME, str))
+				goto err;
+			if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
+				goto err;
+			if (counter) {
+				val = get_counter_value(xe, config);
+				if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD))
+					goto err;
+			}
+		}
+	}
+
+	nla_nest_end(new_msg, entry_attr);
+
 	return 0;
+err:
+	drm_dbg_driver(&xe->drm, "msg buff is small\n");
+	nla_nest_cancel(new_msg, entry_attr);
+	nlmsg_free(new_msg);
+
+	return -EMSGSIZE;
+}
+
+static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
+{
+	struct xe_device *xe = to_xe_device(drm);
+	size_t msg_size = NLMSG_DEFAULT_SIZE;
+	struct sk_buff *new_msg;
+	int retries = 2;
+	void *usrhdr;
+	int ret = 0;
+
+	if (!IS_DGFX(xe))
+		return -ENODEV;
+
+	do {
+		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
+		if (!new_msg)
+			return -ENOMEM;
+
+		ret = fill_error_details(xe, info, new_msg);
+		if (!ret)
+			break;
+
+		msg_size += NLMSG_DEFAULT_SIZE;
+	} while (retries--);
+
+	if (!ret)
+		ret = drm_genl_reply(new_msg, info, usrhdr);
+
+	return ret;
 }
 
 static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
 {
-	return 0;
+	struct xe_device *xe = to_xe_device(drm);
+	size_t msg_size = NLMSG_DEFAULT_SIZE;
+	struct sk_buff *new_msg;
+	void *usrhdr;
+	int ret = 0;
+	int retries = 2;
+	u64 config, val;
+
+	config = nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_ID]);
+	ret = config_status(xe, config);
+	if (ret)
+		return ret;
+	do {
+		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
+		if (!new_msg)
+			return -ENOMEM;
+
+		val = get_counter_value(xe, config);
+		if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
+			msg_size += NLMSG_DEFAULT_SIZE;
+			continue;
+		}
+
+		break;
+	} while (retries--);
+
+	ret = drm_genl_reply(new_msg, info, usrhdr);
+
+	return ret;
 }
 
 /* driver callbacks to DRM netlink commands*/
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index 60cc6418d9a7..dbb3f1afba5f 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -1087,6 +1087,87 @@ struct drm_xe_vm_madvise {
 #define XE_PMU_MEDIA_GROUP_BUSY(gt)		___XE_PMU_OTHER(gt, 3)
 #define XE_PMU_ANY_ENGINE_GROUP_BUSY(gt)	___XE_PMU_OTHER(gt, 4)
 
+/**
+ * DOC: XE GENL netlink event IDs
+ * TODO: Add more details
+ */
+#define XE_HW_ERROR(gt, id) \
+	((id) | ((__u64)(gt) << __XE_PMU_GT_SHIFT))
+
+#define XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG		(0)
+#define XE_GENL_GT_ERROR_CORRECTABLE_GUC		(1)
+#define XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER		(2)
+#define XE_GENL_GT_ERROR_CORRECTABLE_SLM		(3)
+#define XE_GENL_GT_ERROR_CORRECTABLE_EU_IC		(4)
+#define XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF		(5)
+#define XE_GENL_GT_ERROR_FATAL_ARR_BIST			(6)
+#define XE_GENL_GT_ERROR_FATAL_L3_DOUB			(7)
+#define XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK		(8)
+#define XE_GENL_GT_ERROR_FATAL_GUC			(9)
+#define XE_GENL_GT_ERROR_FATAL_IDI_PAR			(10)
+#define XE_GENL_GT_ERROR_FATAL_SQIDI			(11)
+#define XE_GENL_GT_ERROR_FATAL_SAMPLER			(12)
+#define XE_GENL_GT_ERROR_FATAL_SLM			(13)
+#define XE_GENL_GT_ERROR_FATAL_EU_IC			(14)
+#define XE_GENL_GT_ERROR_FATAL_EU_GRF			(15)
+#define XE_GENL_GT_ERROR_FATAL_FPU			(16)
+#define XE_GENL_GT_ERROR_FATAL_TLB			(17)
+#define XE_GENL_GT_ERROR_FATAL_L3_FABRIC		(18)
+#define XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE		(19)
+#define XE_GENL_GT_ERROR_CORRECTABLE_L3BANK		(20)
+#define XE_GENL_GT_ERROR_FATAL_SUBSLICE			(21)
+#define XE_GENL_GT_ERROR_FATAL_L3BANK			(22)
+#define XE_GENL_SGUNIT_ERROR_CORRECTABLE		(23)
+#define XE_GENL_SGUNIT_ERROR_NONFATAL			(24)
+#define XE_GENL_SGUNIT_ERROR_FATAL			(25)
+#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD		(26)
+#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP		(27)
+#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ		(28)
+#define XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI		(29)
+#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T		(30)
+#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C		(31)
+#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD		(32)
+#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP		(33)
+#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ		(34)
+#define XE_GENL_SOC_ERROR_FATAL_PUNIT			(35)
+#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD			(36)
+#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP			(37)
+#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ			(38)
+#define XE_GENL_SOC_ERROR_FATAL_ANR_MDFI		(39)
+#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2T		(40)
+#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2C		(41)
+#define XE_GENL_SOC_ERROR_FATAL_PCIE_AER		(42)
+#define XE_GENL_SOC_ERROR_FATAL_PCIE_ERR		(43)
+#define XE_GENL_SOC_ERROR_FATAL_UR_COND			(44)
+#define XE_GENL_SOC_ERROR_FATAL_SERR_SRCS		(45)
+
+#define XE_GENL_SOC_ERROR_NONFATAL_HBM(ss, n)\
+		(XE_GENL_SOC_ERROR_FATAL_SERR_SRCS + 0x1 + (ss) * 0x10 + (n))
+#define XE_GENL_SOC_ERROR_FATAL_HBM(ss, n)\
+		(XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15) + 0x1 + (ss) * 0x10 + (n))
+
+/* 109 is the last ID used by SOC errors */
+#define XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC		(110)
+#define XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN		(111)
+#define XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL		(112)
+#define XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC		(113)
+#define XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT		(114)
+#define XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY		(115)
+#define XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY		(116)
+#define XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH		(117)
+#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL		(118)
+#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(119)
+#define XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST		(120)
+#define XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY	(121)
+#define XE_GENL_SGGI_ERROR_NONFATAL			(122)
+#define XE_GENL_SGLI_ERROR_NONFATAL			(123)
+#define XE_GENL_SGCI_ERROR_NONFATAL			(124)
+#define XE_GENL_MERT_ERROR_NONFATAL			(125)
+#define XE_GENL_SGGI_ERROR_FATAL			(126)
+#define XE_GENL_SGLI_ERROR_FATAL			(127)
+#define XE_GENL_SGCI_ERROR_FATAL			(128)
+#define XE_GENL_MERT_ERROR_FATAL			(129)
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC 4/5] drm/netlink: Define multicast groups
  2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (2 preceding siblings ...)
  2023-10-20 15:58 ` [RFC v3 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
@ 2023-10-20 15:58 ` Aravind Iddamsetty
  2023-10-20 20:39   ` Ruhl, Michael J
  2023-10-20 15:58 ` [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw)
  To: intel-xe, dri-devel, alexander.deucher, airlied, daniel,
	joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov,
	michael.j.ruhl

Netlink subsystem supports event notifications to userspace. we define
two multicast groups for correctable and uncorrectable errors to which
userspace can subscribe and be notified when any of those errors happen.
The group names are local to the driver's genl netlink family.

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/drm_netlink.c  | 7 +++++++
 include/drm/drm_netlink.h      | 5 +++++
 include/uapi/drm/drm_netlink.h | 4 ++++
 3 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
index 8add249c1da3..425a7355a573 100644
--- a/drivers/gpu/drm/drm_netlink.c
+++ b/drivers/gpu/drm/drm_netlink.c
@@ -12,6 +12,11 @@
 
 DEFINE_XARRAY(drm_dev_xarray);
 
+static const struct genl_multicast_group drm_event_mcgrps[] = {
+	[DRM_GENL_MCAST_CORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, },
+	[DRM_GENL_MCAST_UNCORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, },
+};
+
 /**
  * drm_genl_reply - response to a request
  * @msg: socket buffer
@@ -133,6 +138,8 @@ static void drm_genl_family_init(struct drm_device *dev)
 	dev->drm_genl_family.ops = drm_genl_ops;
 	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
 	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
+	dev->drm_genl_family.mcgrps = drm_event_mcgrps;
+	dev->drm_genl_family.n_mcgrps = ARRAY_SIZE(drm_event_mcgrps);
 	dev->drm_genl_family.module = dev->dev->driver->owner;
 }
 
diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
index 54527dae7847..758239643c17 100644
--- a/include/drm/drm_netlink.h
+++ b/include/drm/drm_netlink.h
@@ -13,6 +13,11 @@
 
 struct drm_device;
 
+enum mcgrps_events {
+	DRM_GENL_MCAST_CORR_ERR,
+	DRM_GENL_MCAST_UNCORR_ERR,
+};
+
 struct driver_genl_ops {
 	int		       (*doit)(struct drm_device *dev,
 				       struct sk_buff *skb,
diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
index aab42147a20e..c7a0ce5b4624 100644
--- a/include/uapi/drm/drm_netlink.h
+++ b/include/uapi/drm/drm_netlink.h
@@ -26,6 +26,8 @@
 #define _DRM_NETLINK_H_
 
 #define DRM_GENL_VERSION 1
+#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR	"drm_corr_err"
+#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR	"drm_uncorr_err"
 
 #if defined(__cplusplus)
 extern "C" {
@@ -43,6 +45,8 @@ enum drm_genl_error_cmds {
 	DRM_RAS_CMD_READ_ONE,
 	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
 	DRM_RAS_CMD_READ_ALL,
+	/** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of multicast event */
+	DRM_RAS_CMD_ERROR_EVENT,
 
 	__DRM_CMD_MAX,
 	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (3 preceding siblings ...)
  2023-10-20 15:58 ` [RFC 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
@ 2023-10-20 15:58 ` Aravind Iddamsetty
  2023-10-20 20:40   ` Ruhl, Michael J
  2023-11-10 12:27   ` Tomer Tayar
  2023-10-23 15:29 ` [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Alex Deucher
  2023-11-10 12:23 ` Tomer Tayar
  6 siblings, 2 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw)
  To: intel-xe, dri-devel, alexander.deucher, airlied, daniel,
	joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov,
	michael.j.ruhl

Whenever a correctable or an uncorrectable error happens an event is sent
to the corresponding listeners of these groups.

v2: Rebase

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
---
 drivers/gpu/drm/xe/xe_hw_error.c | 33 ++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
index bab6d4cf0b69..b0befb5e01cb 100644
--- a/drivers/gpu/drm/xe/xe_hw_error.c
+++ b/drivers/gpu/drm/xe/xe_hw_error.c
@@ -786,6 +786,37 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 				(HARDWARE_ERROR_MAX << 1) + 1);
 }
 
+static void
+generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
+{
+	struct sk_buff *msg;
+	void *hdr;
+
+	if (!xe->drm.drm_genl_family.module)
+		return;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
+	if (!msg) {
+		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
+		return;
+	}
+
+	hdr = genlmsg_put(msg, 0, 0, &xe->drm.drm_genl_family, 0, DRM_RAS_CMD_ERROR_EVENT);
+	if (!hdr) {
+		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
+		nlmsg_free(msg);
+		return;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	genlmsg_multicast(&xe->drm.drm_genl_family, msg, 0,
+			  hw_err ?
+			  DRM_GENL_MCAST_UNCORR_ERR
+			  : DRM_GENL_MCAST_CORR_ERR,
+			  GFP_ATOMIC);
+}
+
 static void
 xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
 {
@@ -849,6 +880,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
 	}
 
 	xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err), errsrc);
+
+	generate_netlink_event(tile_to_xe(tile), hw_err);
 unlock:
 	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* RE: [RFC v4 1/5] drm/netlink: Add netlink infrastructure
  2023-10-20 15:58 ` [RFC v4 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
@ 2023-10-20 20:36   ` Ruhl, Michael J
  2023-10-21  1:10     ` Aravind Iddamsetty
  2023-11-10 12:24   ` Tomer Tayar
  1 sibling, 1 reply; 31+ messages in thread
From: Ruhl, Michael J @ 2023-10-20 20:36 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Tayar, Tomer (Habana),
	Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling,
	Luben.Tuikov

>-----Original Message-----
>From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>Sent: Friday, October 20, 2023 11:59 AM
>To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
>alexander.deucher@amd.com; airlied@gmail.com; daniel@ffwll.ch;
>joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; Tayar, Tomer (Habana)
><ttayar@habana.ai>; Hawking.Zhang@amd.com;
>Harish.Kasiviswanathan@amd.com; Felix.Kuehling@amd.com;
>Luben.Tuikov@amd.com; Ruhl, Michael J <michael.j.ruhl@intel.com>
>Subject: [RFC v4 1/5] drm/netlink: Add netlink infrastructure
>
>Define the netlink registration interface and commands, attributes that
>can be commonly used across by drm drivers. This patch intends to use
>the generic netlink family to expose various stats of device. At present
>it defines some commands that shall be used to expose RAS error counters.
>
>v2:
>define common interfaces to genl netlink subsystem that all drm drivers
>can leverage.(Tomer Tayar)
>
>v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>register to netlink subsystem (Daniel Vetter)
>
>v4:(Michael J. Ruhl)
>1. rename drm_genl_send to drm_genl_reply
>2. catch error from xa_store and handle appropriately

Hi Aravind,

This looks reasonable to me.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>

M

>Cc: Tomer Tayar <ttayar@habana.ai>
>Cc: Daniel Vetter <daniel@ffwll.ch>
>Cc: Michael J. Ruhl <michael.j.ruhl@intel.com>
>
>Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>---
> drivers/gpu/drm/Makefile       |   1 +
> drivers/gpu/drm/drm_drv.c      |   7 ++
> drivers/gpu/drm/drm_netlink.c  | 188
>+++++++++++++++++++++++++++++++++
> include/drm/drm_device.h       |   8 ++
> include/drm/drm_drv.h          |   7 ++
> include/drm/drm_netlink.h      |  30 ++++++
> include/uapi/drm/drm_netlink.h |  83 +++++++++++++++
> 7 files changed, 324 insertions(+)
> create mode 100644 drivers/gpu/drm/drm_netlink.c
> create mode 100644 include/drm/drm_netlink.h
> create mode 100644 include/uapi/drm/drm_netlink.h
>
>diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>index ee64c51274ad..60864369adaa 100644
>--- a/drivers/gpu/drm/Makefile
>+++ b/drivers/gpu/drm/Makefile
>@@ -35,6 +35,7 @@ drm-y := \
> 	drm_mode_object.o \
> 	drm_modes.o \
> 	drm_modeset_lock.o \
>+	drm_netlink.o \
> 	drm_plane.o \
> 	drm_prime.o \
> 	drm_print.o \
>diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>index 535f16e7882e..31f55c1c7524 100644
>--- a/drivers/gpu/drm/drm_drv.c
>+++ b/drivers/gpu/drm/drm_drv.c
>@@ -937,6 +937,12 @@ int drm_dev_register(struct drm_device *dev,
>unsigned long flags)
> 	if (ret)
> 		goto err_minors;
>
>+	if (driver->genl_ops) {
>+		ret = drm_genl_register(dev);
>+		if (ret)
>+			goto err_minors;
>+	}
>+
> 	ret = create_compat_control_link(dev);
> 	if (ret)
> 		goto err_minors;
>@@ -1074,6 +1080,7 @@ static void drm_core_exit(void)
> {
> 	drm_privacy_screen_lookup_exit();
> 	accel_core_exit();
>+	drm_genl_exit();
> 	unregister_chrdev(DRM_MAJOR, "drm");
> 	debugfs_remove(drm_debugfs_root);
> 	drm_sysfs_destroy();
>diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
>new file mode 100644
>index 000000000000..8add249c1da3
>--- /dev/null
>+++ b/drivers/gpu/drm/drm_netlink.c
>@@ -0,0 +1,188 @@
>+// SPDX-License-Identifier: MIT
>+/*
>+ * Copyright © 2023 Intel Corporation
>+ */
>+
>+#include <drm/drm_device.h>
>+#include <drm/drm_drv.h>
>+#include <drm/drm_file.h>
>+#include <drm/drm_managed.h>
>+#include <drm/drm_netlink.h>
>+#include <drm/drm_print.h>
>+
>+DEFINE_XARRAY(drm_dev_xarray);
>+
>+/**
>+ * drm_genl_reply - response to a request
>+ * @msg: socket buffer
>+ * @info: receiver information
>+ * @usrhdr: pointer to user specific header in the message buffer
>+ *
>+ * RETURNS:
>+ * 0 on success and negative error code on failure
>+ */
>+int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
>+{
>+	int ret;
>+
>+	genlmsg_end(msg, usrhdr);
>+
>+	ret = genlmsg_reply(msg, info);
>+	if (ret)
>+		nlmsg_free(msg);
>+
>+	return ret;
>+}
>+EXPORT_SYMBOL(drm_genl_reply);
>+
>+/**
>+ * drm_genl_alloc_msg - allocate genl message buffer
>+ * @dev: drm_device for which the message is being allocated
>+ * @info: receiver information
>+ * @usrhdr: pointer to user specific header in the message buffer
>+ *
>+ * RETURNS:
>+ * pointer to new allocated buffer on success, NULL on failure
>+ */
>+struct sk_buff *
>+drm_genl_alloc_msg(struct drm_device *dev,
>+		   struct genl_info *info,
>+		   size_t msg_size, void **usrhdr)
>+{
>+	struct sk_buff *new_msg;
>+	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
>+	if (!new_msg)
>+		return new_msg;
>+
>+	*usrhdr = genlmsg_put_reply(new_msg, info, &dev->drm_genl_family, 0, info->genlhdr->cmd);
>+	if (!*usrhdr) {
>+		nlmsg_free(new_msg);
>+		new_msg = NULL;
>+	}
>+
>+	return new_msg;
>+}
>+EXPORT_SYMBOL(drm_genl_alloc_msg);
>+
>+static struct drm_device *genl_to_dev(struct genl_info *info)
>+{
>+	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
>+}
>+
>+static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
>+{
>+	struct drm_device *dev = genl_to_dev(info);
>+
>+	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_REQUEST))
>+		return -EINVAL;
>+
>+	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>+		return -EOPNOTSUPP;
>+
>+	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg,
>info);
>+}
>+
>+static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
>+{
>+	struct drm_device *dev = genl_to_dev(info);
>+
>+	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
>+		return -EINVAL;
>+
>+	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>+		return -EOPNOTSUPP;
>+
>+	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg,
>info);
>+}
>+
>+/* attribute policies */
>+static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>+	[DRM_RAS_ATTR_REQUEST] = { .type = NLA_U8 },
>+};
>+
>+static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1]
>= {
>+	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
>+};
>+
>+/* drm genl operations definition */
>+const struct genl_ops drm_genl_ops[] = {
>+	{
>+		.cmd = DRM_RAS_CMD_QUERY,
>+		.doit = drm_genl_list_errors,
>+		.policy = drm_attr_policy_query,
>+	},
>+	{
>+		.cmd = DRM_RAS_CMD_READ_ONE,
>+		.doit = drm_genl_read_error,
>+		.policy = drm_attr_policy_read_one,
>+	},
>+	{
>+		.cmd = DRM_RAS_CMD_READ_ALL,
>+		.doit = drm_genl_list_errors,
>+		.policy = drm_attr_policy_query,
>+	},
>+};
>+
>+static void drm_genl_family_init(struct drm_device *dev)
>+{
>+	/* Use drm primary node name eg: card0 to name the genl family */
>+	snprintf(dev->drm_genl_family.name, sizeof(dev->drm_genl_family.name), "%s", dev->primary->kdev->kobj.name);
>+	dev->drm_genl_family.version = DRM_GENL_VERSION;
>+	dev->drm_genl_family.parallel_ops = true;
>+	dev->drm_genl_family.ops = drm_genl_ops;
>+	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
>+	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
>+	dev->drm_genl_family.module = dev->dev->driver->owner;
>+}
>+
>+static void drm_genl_deregister(struct drm_device *dev,  void *arg)
>+{
>+	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family.name);
>+
>+	xa_erase(&drm_dev_xarray, dev->drm_genl_family.id);
>+
>+	genl_unregister_family(&dev->drm_genl_family);
>+}
>+
>+/**
>+ * drm_genl_register - Register genl family
>+ * @dev: drm_device for which genl family needs to be registered
>+ *
>+ * RETURNS:
>+ * 0 on success and negative error code on failure
>+ */
>+int drm_genl_register(struct drm_device *dev)
>+{
>+	int ret;
>+
>+	drm_genl_family_init(dev);
>+
>+	ret = genl_register_family(&dev->drm_genl_family);
>+	if (ret < 0) {
>+		drm_warn(dev, "genl family registration failed\n");
>+		return ret;
>+	}
>+
>+	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family.id, dev->drm_genl_family.name);
>+
>+	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family.id, dev, GFP_KERNEL));
>+	if (ret)
>+		goto genl_unregister;
>+
>+	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
>+
>+	return ret;
>+
>+genl_unregister:
>+	genl_unregister_family(&dev->drm_genl_family);
>+	return ret;
>+}
>+
>+/**
>+ * drm_genl_exit: destroy drm_dev_xarray
>+ */
>+void drm_genl_exit(void)
>+{
>+	xa_destroy(&drm_dev_xarray);
>+}
>diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>index c490977ee250..d3ae91b7714d 100644
>--- a/include/drm/drm_device.h
>+++ b/include/drm/drm_device.h
>@@ -8,6 +8,7 @@
>
> #include <drm/drm_legacy.h>
> #include <drm/drm_mode_config.h>
>+#include <drm/drm_netlink.h>
>
> struct drm_driver;
> struct drm_minor;
>@@ -318,6 +319,13 @@ struct drm_device {
> 	 */
> 	struct dentry *debugfs_root;
>
>+	/**
>+	 * @drm_genl_family:
>+	 *
>+	 * Generic netlink family registration structure.
>+	 */
>+	struct genl_family drm_genl_family;
>+
> 	/* Everything below here is for legacy driver, never use! */
> 	/* private: */
> #if IS_ENABLED(CONFIG_DRM_LEGACY)
>diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>index e2640dc64e08..ebdb7850d235 100644
>--- a/include/drm/drm_drv.h
>+++ b/include/drm/drm_drv.h
>@@ -434,6 +434,13 @@ struct drm_driver {
> 	 */
> 	const struct file_operations *fops;
>
>+	/**
>+	 * @genl_ops:
>+	 *
>+	 * Drivers private callback to genl commands
>+	 */
>+	const struct driver_genl_ops *genl_ops;
>+
> #ifdef CONFIG_DRM_LEGACY
> 	/* Everything below here is for legacy driver, never use! */
> 	/* private: */
>diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
>new file mode 100644
>index 000000000000..54527dae7847
>--- /dev/null
>+++ b/include/drm/drm_netlink.h
>@@ -0,0 +1,30 @@
>+/* SPDX-License-Identifier: MIT */
>+/*
>+ * Copyright © 2023 Intel Corporation
>+ */
>+
>+#ifndef __DRM_NETLINK_H__
>+#define __DRM_NETLINK_H__
>+
>+#include <linux/netdevice.h>
>+#include <net/genetlink.h>
>+#include <net/sock.h>
>+#include <uapi/drm/drm_netlink.h>
>+
>+struct drm_device;
>+
>+struct driver_genl_ops {
>+	int		       (*doit)(struct drm_device *dev,
>+				       struct sk_buff *skb,
>+				       struct genl_info *info);
>+};
>+
>+int drm_genl_register(struct drm_device *dev);
>+void drm_genl_exit(void);
>+int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
>+struct sk_buff *
>+drm_genl_alloc_msg(struct drm_device *dev,
>+		   struct genl_info *info,
>+		   size_t msg_size, void **usrhdr);
>+#endif
>+
>diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>new file mode 100644
>index 000000000000..aab42147a20e
>--- /dev/null
>+++ b/include/uapi/drm/drm_netlink.h
>@@ -0,0 +1,83 @@
>+/* SPDX-License-Identifier: MIT */
>+/*
>+ * Copyright 2023 Intel Corporation
>+ *
>+ * Permission is hereby granted, free of charge, to any person obtaining a
>+ * copy of this software and associated documentation files (the "Software"),
>+ * to deal in the Software without restriction, including without limitation
>+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>+ * and/or sell copies of the Software, and to permit persons to whom the
>+ * Software is furnished to do so, subject to the following conditions:
>+ *
>+ * The above copyright notice and this permission notice (including the next
>+ * paragraph) shall be included in all copies or substantial portions of the
>+ * Software.
>+ *
>+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>EXPRESS OR
>+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>MERCHANTABILITY,
>+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
>EVENT SHALL
>+ * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM,
>DAMAGES OR
>+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
>OTHERWISE,
>+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
>USE OR
>+ * OTHER DEALINGS IN THE SOFTWARE.
>+ */
>+
>+#ifndef _DRM_NETLINK_H_
>+#define _DRM_NETLINK_H_
>+
>+#define DRM_GENL_VERSION 1
>+
>+#if defined(__cplusplus)
>+extern "C" {
>+#endif
>+
>+/**
>+ * enum drm_genl_error_cmds - Supported error commands
>+ *
>+ */
>+enum drm_genl_error_cmds {
>+	DRM_CMD_UNSPEC,
>+	/** @DRM_RAS_CMD_QUERY: Command to list all errors names with
>config-id */
>+	DRM_RAS_CMD_QUERY,
>+	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a
>specific error */
>+	DRM_RAS_CMD_READ_ONE,
>+	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all
>errors */
>+	DRM_RAS_CMD_READ_ALL,
>+
>+	__DRM_CMD_MAX,
>+	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>+};
>+
>+/**
>+ * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
>+ *
>+ */
>+enum drm_error_attr {
>+	DRM_ATTR_UNSPEC,
>+	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>+	/**
>+	 * @DRM_RAS_ATTR_REQUEST: Should be used with
>DRM_RAS_CMD_QUERY,
>+	 * DRM_RAS_CMD_READ_ALL
>+	 */
>+	DRM_RAS_ATTR_REQUEST, /* NLA_U8 */
>+	/**
>+	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
>+	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL
>commands.
>+	 */
>+	DRM_RAS_ATTR_QUERY_REPLY, /*NLA_NESTED*/
>+	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
>+	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>+	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id */
>+	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
>+	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
>+	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
>+
>+	__DRM_ATTR_MAX,
>+	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>+};
>+
>+#if defined(__cplusplus)
>+}
>+#endif
>+
>+#endif
>--
>2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v2 2/5] drm/xe/RAS: Register netlink capability
  2023-10-20 15:58 ` [RFC v2 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
@ 2023-10-20 20:37   ` Ruhl, Michael J
  0 siblings, 0 replies; 31+ messages in thread
From: Ruhl, Michael J @ 2023-10-20 20:37 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Tayar, Tomer (Habana),
	Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling,
	Luben.Tuikov

>-----Original Message-----
>From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>Sent: Friday, October 20, 2023 11:59 AM
>To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
>alexander.deucher@amd.com; airlied@gmail.com; daniel@ffwll.ch;
>joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; Tayar, Tomer (Habana)
><ttayar@habana.ai>; Hawking.Zhang@amd.com;
>Harish.Kasiviswanathan@amd.com; Felix.Kuehling@amd.com;
>Luben.Tuikov@amd.com; Ruhl, Michael J <michael.j.ruhl@intel.com>
>Subject: [RFC v2 2/5] drm/xe/RAS: Register netlink capability
>
>Register netlink capability with the DRM and register the driver
>callbacks to DRM RAS netlink commands.
>
>v2:
>Move the netlink registration parts to DRM susbsytem (Tomer Tayar)
>
>Cc: Tomer Tayar <ttayar@habana.ai>
>Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>---
> drivers/gpu/drm/xe/Makefile          |  1 +
> drivers/gpu/drm/xe/xe_device.c       |  4 ++++
> drivers/gpu/drm/xe/xe_device_types.h |  1 +
> drivers/gpu/drm/xe/xe_netlink.c      | 22 ++++++++++++++++++++++
> 4 files changed, 28 insertions(+)
> create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>
>diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>index ed772f440689..048f9a23e2f0 100644
>--- a/drivers/gpu/drm/xe/Makefile
>+++ b/drivers/gpu/drm/xe/Makefile
>@@ -87,6 +87,7 @@ xe-y += xe_bb.o \
> 	xe_mmio.o \
> 	xe_mocs.o \
> 	xe_module.o \
>+	xe_netlink.o \
> 	xe_pat.o \
> 	xe_pci.o \
> 	xe_pcode.o \
>diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>index 628cb46a2509..8c928719a537 100644
>--- a/drivers/gpu/drm/xe/xe_device.c
>+++ b/drivers/gpu/drm/xe/xe_device.c
>@@ -151,6 +151,8 @@ static void xe_driver_release(struct drm_device *dev)
> 	pci_set_drvdata(to_pci_dev(xe->drm.dev), NULL);
> }
>
>+extern const struct driver_genl_ops xe_genl_ops[];
>+
> static struct drm_driver driver = {
> 	/* Don't use MTRRs here; the Xserver or userspace app should
> 	 * deal with them for Intel hardware.
>@@ -159,6 +161,7 @@ static struct drm_driver driver = {
> 	    DRIVER_GEM |
> 	    DRIVER_RENDER | DRIVER_SYNCOBJ |
> 	    DRIVER_SYNCOBJ_TIMELINE | DRIVER_GEM_GPUVA,
>+

Gratuitous blank line?

With or without this cleaned up:

This looks reasonable to me.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>

M
> 	.open = xe_file_open,
> 	.postclose = xe_file_close,
>
>@@ -170,6 +173,7 @@ static struct drm_driver driver = {
> 	.show_fdinfo = xe_drm_client_fdinfo,
> #endif
> 	.release = &xe_driver_release,
>+	.genl_ops = xe_genl_ops,
>
> 	.ioctls = xe_ioctls,
> 	.num_ioctls = ARRAY_SIZE(xe_ioctls),
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h
>b/drivers/gpu/drm/xe/xe_device_types.h
>index a1bacf820d37..8201f3644b86 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -10,6 +10,7 @@
>
> #include <drm/drm_device.h>
> #include <drm/drm_file.h>
>+#include <drm/drm_netlink.h>
> #include <drm/ttm/ttm_device.h>
>
> #include "xe_devcoredump_types.h"
>diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
>new file mode 100644
>index 000000000000..81d785455632
>--- /dev/null
>+++ b/drivers/gpu/drm/xe/xe_netlink.c
>@@ -0,0 +1,22 @@
>+// SPDX-License-Identifier: MIT
>+/*
>+ * Copyright © 2023 Intel Corporation
>+ */
>+#include "xe_device.h"
>+
>+static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg,
>struct genl_info *info)
>+{
>+	return 0;
>+}
>+
>+static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg,
>struct genl_info *info)
>+{
>+	return 0;
>+}
>+
>+/* driver callbacks to DRM netlink commands*/
>+const struct driver_genl_ops xe_genl_ops[] = {
>+	[DRM_RAS_CMD_QUERY] =		{ .doit = xe_genl_list_errors },
>+	[DRM_RAS_CMD_READ_ONE] =	{ .doit = xe_genl_read_error },
>+	[DRM_RAS_CMD_READ_ALL] =	{ .doit = xe_genl_list_errors, },
>+};
>--
>2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v3 3/5] drm/xe/RAS: Expose the error counters
  2023-10-20 15:58 ` [RFC v3 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
@ 2023-10-20 20:39   ` Ruhl, Michael J
  2023-11-10 12:27   ` Tomer Tayar
  1 sibling, 0 replies; 31+ messages in thread
From: Ruhl, Michael J @ 2023-10-20 20:39 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Tayar, Tomer (Habana),
	Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling,
	Luben.Tuikov

>-----Original Message-----
>From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>Sent: Friday, October 20, 2023 11:59 AM
>To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
>alexander.deucher@amd.com; airlied@gmail.com; daniel@ffwll.ch;
>joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; Tayar, Tomer (Habana)
><ttayar@habana.ai>; Hawking.Zhang@amd.com;
>Harish.Kasiviswanathan@amd.com; Felix.Kuehling@amd.com;
>Luben.Tuikov@amd.com; Ruhl, Michael J <michael.j.ruhl@intel.com>
>Subject: [RFC v3 3/5] drm/xe/RAS: Expose the error counters
>
>We expose the various error counters supported on a hardware via genl
>subsytem through the registered commands to userspace. The
>DRM_RAS_CMD_QUERY lists the error names with config id,
>DRM_RAD_CMD_READ_ONE returns the counter value for the requested config
>id and the DRM_RAS_CMD_READ_ALL lists the counters for all errors along
>with their names and config ids.
>
>v2: Rebase
>
>v3:
>1. presently xe_list_errors fills blank data for IGFX, prevent it by
>having an early check of IS_DGFX (Michael J. Ruhl)
>2. update errors from all sources

Hi Aravind,

This looks reasonable to me.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>

M

>Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>---
> drivers/gpu/drm/xe/xe_netlink.c | 499
>+++++++++++++++++++++++++++++++-
> include/uapi/drm/xe_drm.h       |  81 ++++++
> 2 files changed, 578 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
>index 81d785455632..3e4cdb5e4920 100644
>--- a/drivers/gpu/drm/xe/xe_netlink.c
>+++ b/drivers/gpu/drm/xe/xe_netlink.c
>@@ -2,16 +2,511 @@
> /*
>  * Copyright © 2023 Intel Corporation
>  */
>+#include <drm/xe_drm.h>
>+
> #include "xe_device.h"
>
>-static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg,
>struct genl_info *info)
>+#define MAX_ERROR_NAME	100
>+
>+static const char * const xe_hw_error_events[] = {
>+		[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-
>l3-sng",
>+		[XE_GENL_GT_ERROR_CORRECTABLE_GUC] = "correctable-
>guc",
>+		[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] =
>"correctable-sampler",
>+		[XE_GENL_GT_ERROR_CORRECTABLE_SLM] = "correctable-
>slm",
>+		[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-
>eu-ic",
>+		[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-
>eu-grf",
>+		[XE_GENL_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
>+		[XE_GENL_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
>+		[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-
>checker",
>+		[XE_GENL_GT_ERROR_FATAL_GUC] = "fatal-guc",
>+		[XE_GENL_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
>+		[XE_GENL_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
>+		[XE_GENL_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
>+		[XE_GENL_GT_ERROR_FATAL_SLM] = "fatal-slm",
>+		[XE_GENL_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
>+		[XE_GENL_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
>+		[XE_GENL_GT_ERROR_FATAL_FPU] = "fatal-fpu",
>+		[XE_GENL_GT_ERROR_FATAL_TLB] = "fatal-tlb",
>+		[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
>+		[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] =
>"correctable-subslice",
>+		[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-
>l3bank",
>+		[XE_GENL_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
>+		[XE_GENL_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
>+		[XE_GENL_SGUNIT_ERROR_CORRECTABLE] = "sgunit-
>correctable",
>+		[XE_GENL_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
>+		[XE_GENL_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
>+		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = "soc-
>nonfatal-csc-psf-cmd-parity",
>+		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = "soc-
>nonfatal-csc-psf-unexpected-completion",
>+		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = "soc-
>nonfatal-csc-psf-unsupported-request",
>+		[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = "soc-
>nonfatal-anr-mdfi",
>+		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = "soc-
>nonfatal-mdfi-t2t",
>+		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = "soc-
>nonfatal-mdfi-t2c",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = "soc-
>nonfatal-hbm-ss0-0",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = "soc-
>nonfatal-hbm-ss0-1",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = "soc-
>nonfatal-hbm-ss0-2",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = "soc-
>nonfatal-hbm-ss0-3",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = "soc-
>nonfatal-hbm-ss0-4",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = "soc-
>nonfatal-hbm-ss0-5",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = "soc-
>nonfatal-hbm-ss0-6",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = "soc-
>nonfatal-hbm-ss0-7",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = "soc-
>nonfatal-hbm-ss1-0",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = "soc-
>nonfatal-hbm-ss1-1",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = "soc-
>nonfatal-hbm-ss1-2",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = "soc-
>nonfatal-hbm-ss1-3",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = "soc-
>nonfatal-hbm-ss1-4",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = "soc-
>nonfatal-hbm-ss1-5",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = "soc-
>nonfatal-hbm-ss1-6",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = "soc-
>nonfatal-hbm-ss1-7",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = "soc-
>nonfatal-hbm-ss2-0",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = "soc-
>nonfatal-hbm-ss2-1",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = "soc-
>nonfatal-hbm-ss2-2",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = "soc-
>nonfatal-hbm-ss2-3",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = "soc-
>nonfatal-hbm-ss2-4",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = "soc-
>nonfatal-hbm-ss2-5",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = "soc-
>nonfatal-hbm-ss2-6",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = "soc-
>nonfatal-hbm-ss2-7",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = "soc-
>nonfatal-hbm-ss3-0",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = "soc-
>nonfatal-hbm-ss3-1",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = "soc-
>nonfatal-hbm-ss3-2",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = "soc-
>nonfatal-hbm-ss3-3",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = "soc-
>nonfatal-hbm-ss3-4",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = "soc-
>nonfatal-hbm-ss3-5",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = "soc-
>nonfatal-hbm-ss3-6",
>+		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = "soc-
>nonfatal-hbm-ss3-7",
>+		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = "soc-fatal-csc-
>psf-cmd-parity",
>+		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = "soc-fatal-csc-
>psf-unexpected-completion",
>+		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = "soc-fatal-csc-
>psf-unsupported-request",
>+		[XE_GENL_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
>+		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = "soc-fatal-
>pcie-psf-command-parity",
>+		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = "soc-fatal-
>pcie-psf-unexpected-completion",
>+		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = "soc-fatal-
>pcie-psf-unsupported-request",
>+		[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = "soc-fatal-anr-
>mdfi",
>+		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = "soc-fatal-mdfi-
>t2t",
>+		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = "soc-fatal-mdfi-
>t2c",
>+		[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = "soc-fatal-
>malformed-pcie-aer",
>+		[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = "soc-fatal-
>malformed-pcie-err",
>+		[XE_GENL_SOC_ERROR_FATAL_UR_COND] = "soc-fatal-ur-
>condition-ieh",
>+		[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = "soc-fatal-from-
>serr-sources",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-
>ss0-0",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-
>ss0-1",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-
>ss0-2",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-
>ss0-3",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-
>ss0-4",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-
>ss0-5",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-
>ss0-6",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-
>ss0-7",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-
>ss1-0",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-
>ss1-1",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-
>ss1-2",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-
>ss1-3",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-
>ss1-4",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-
>ss1-5",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-
>ss1-6",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-
>ss1-7",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-
>ss2-0",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-
>ss2-1",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-
>ss2-2",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-
>ss2-3",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-
>ss2-4",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-
>ss2-5",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-
>ss2-6",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-
>ss2-7",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-
>ss3-0",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-
>ss3-1",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-
>ss3-2",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-
>ss3-3",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-
>ss3-4",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-
>ss3-5",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-
>ss3-6",
>+		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-
>ss3-7",
>+		[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-
>correctable-sram-ecc",
>+		[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-
>nonfatal-mia-shutdown",
>+		[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = "gsc-
>nonfatal-mia-internal",
>+		[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-
>nonfatal-sram-ecc",
>+		[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-
>nonfatal-wdg-timeout",
>+		[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-
>nonfatal-rom-parity",
>+		[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-
>nonfatal-ucode-parity",
>+		[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = "gsc-
>nonfatal-vlt-glitch",
>+		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-
>nonfatal-fuse-pull",
>+		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-
>nonfatal-fuse-crc-check",
>+		[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = "gsc-
>nonfatal-self-mbist",
>+		[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = "gsc-
>nonfatal-aon-parity",
>+		[XE_GENL_SGGI_ERROR_NONFATAL] = "sggi-nonfatal-data-
>parity",
>+		[XE_GENL_SGLI_ERROR_NONFATAL] = "sgli-nonfatal-data-
>parity",
>+		[XE_GENL_SGCI_ERROR_NONFATAL] = "sgci-nonfatal-data-
>parity",
>+		[XE_GENL_MERT_ERROR_NONFATAL] = "mert-nonfatal-data-
>parity",
>+		[XE_GENL_SGGI_ERROR_FATAL] = "sggi-fatal-data-parity",
>+		[XE_GENL_SGLI_ERROR_FATAL] = "sgli-fatal-data-parity",
>+		[XE_GENL_SGCI_ERROR_FATAL] = "sgci-fatal-data-parity",
>+		[XE_GENL_MERT_ERROR_FATAL] = "mert-nonfatal-data-
>parity",
>+};
>+
>+static const unsigned long xe_hw_error_map[] = {
>+	[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] =
>XE_HW_ERR_GT_CORR_L3_SNG,
>+	[XE_GENL_GT_ERROR_CORRECTABLE_GUC] =
>XE_HW_ERR_GT_CORR_GUC,
>+	[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] =
>XE_HW_ERR_GT_CORR_SAMPLER,
>+	[XE_GENL_GT_ERROR_CORRECTABLE_SLM] =
>XE_HW_ERR_GT_CORR_SLM,
>+	[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] =
>XE_HW_ERR_GT_CORR_EU_IC,
>+	[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] =
>XE_HW_ERR_GT_CORR_EU_GRF,
>+	[XE_GENL_GT_ERROR_FATAL_ARR_BIST] =
>XE_HW_ERR_GT_FATAL_ARR_BIST,
>+	[XE_GENL_GT_ERROR_FATAL_L3_DOUB] =
>XE_HW_ERR_GT_FATAL_L3_DOUB,
>+	[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] =
>XE_HW_ERR_GT_FATAL_L3_ECC_CHK,
>+	[XE_GENL_GT_ERROR_FATAL_GUC] = XE_HW_ERR_GT_FATAL_GUC,
>+	[XE_GENL_GT_ERROR_FATAL_IDI_PAR] =
>XE_HW_ERR_GT_FATAL_IDI_PAR,
>+	[XE_GENL_GT_ERROR_FATAL_SQIDI] = XE_HW_ERR_GT_FATAL_SQIDI,
>+	[XE_GENL_GT_ERROR_FATAL_SAMPLER] =
>XE_HW_ERR_GT_FATAL_SAMPLER,
>+	[XE_GENL_GT_ERROR_FATAL_SLM] = XE_HW_ERR_GT_FATAL_SLM,
>+	[XE_GENL_GT_ERROR_FATAL_EU_IC] = XE_HW_ERR_GT_FATAL_EU_IC,
>+	[XE_GENL_GT_ERROR_FATAL_EU_GRF] =
>XE_HW_ERR_GT_FATAL_EU_GRF,
>+	[XE_GENL_GT_ERROR_FATAL_FPU] = XE_HW_ERR_GT_FATAL_FPU,
>+	[XE_GENL_GT_ERROR_FATAL_TLB] = XE_HW_ERR_GT_FATAL_TLB,
>+	[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] =
>XE_HW_ERR_GT_FATAL_L3_FABRIC,
>+	[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] =
>XE_HW_ERR_GT_CORR_SUBSLICE,
>+	[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] =
>XE_HW_ERR_GT_CORR_L3BANK,
>+	[XE_GENL_GT_ERROR_FATAL_SUBSLICE] =
>XE_HW_ERR_GT_FATAL_SUBSLICE,
>+	[XE_GENL_GT_ERROR_FATAL_L3BANK] =
>XE_HW_ERR_GT_FATAL_L3BANK,
>+	[XE_GENL_SGUNIT_ERROR_CORRECTABLE] =
>XE_HW_ERR_TILE_CORR_SGUNIT,
>+	[XE_GENL_SGUNIT_ERROR_NONFATAL] =
>XE_HW_ERR_TILE_NONFATAL_SGUNIT,
>+	[XE_GENL_SGUNIT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGUNIT,
>+	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] =
>XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMD,
>+	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] =
>XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMP,
>+	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] =
>XE_HW_ERR_SOC_NONFATAL_CSC_PSF_REQ,
>+	[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] =
>XE_HW_ERR_SOC_NONFATAL_ANR_MDFI,
>+	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] =
>XE_HW_ERR_SOC_NONFATAL_MDFI_T2T,
>+	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] =
>XE_HW_ERR_SOC_NONFATAL_MDFI_T2C,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL0,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL1,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL2,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL3,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL4,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL5,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL6,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] =
>XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL7,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL0,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL1,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL2,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL3,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL4,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL5,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL6,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] =
>XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL7,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL0,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL1,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL2,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL3,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL4,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL5,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL6,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] =
>XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL7,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL0,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL1,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL2,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL3,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL4,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL5,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL6,
>+	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] =
>XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL7,
>+	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] =
>XE_HW_ERR_SOC_FATAL_CSC_PSF_CMD,
>+	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] =
>XE_HW_ERR_SOC_FATAL_CSC_PSF_CMP,
>+	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] =
>XE_HW_ERR_SOC_FATAL_CSC_PSF_REQ,
>+	[XE_GENL_SOC_ERROR_FATAL_PUNIT] =
>XE_HW_ERR_SOC_FATAL_PUNIT,
>+	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] =
>XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMD,
>+	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] =
>XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMP,
>+	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] =
>XE_HW_ERR_SOC_FATAL_PCIE_PSF_REQ,
>+	[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] =
>XE_HW_ERR_SOC_FATAL_ANR_MDFI,
>+	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] =
>XE_HW_ERR_SOC_FATAL_MDFI_T2T,
>+	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] =
>XE_HW_ERR_SOC_FATAL_MDFI_T2C,
>+	[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] =
>XE_HW_ERR_SOC_FATAL_PCIE_AER,
>+	[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] =
>XE_HW_ERR_SOC_FATAL_PCIE_ERR,
>+	[XE_GENL_SOC_ERROR_FATAL_UR_COND] =
>XE_HW_ERR_SOC_FATAL_UR_COND,
>+	[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] =
>XE_HW_ERR_SOC_FATAL_SERR_SRCS,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL0,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL1,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL2,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL3,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL4,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL5,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL6,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] =
>XE_HW_ERR_SOC_FATAL_HBM0_CHNL7,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL0,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL1,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL2,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL3,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL4,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL5,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL6,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] =
>XE_HW_ERR_SOC_FATAL_HBM1_CHNL7,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL0,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL1,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL2,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL3,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL4,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL5,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL6,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] =
>XE_HW_ERR_SOC_FATAL_HBM2_CHNL7,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL0,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL1,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL2,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL3,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL4,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL5,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL6,
>+	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] =
>XE_HW_ERR_SOC_FATAL_HBM3_CHNL7,
>+	[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] =
>XE_HW_ERR_GSC_CORR_SRAM,
>+	[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] =
>XE_HW_ERR_GSC_NONFATAL_MIA_SHUTDOWN,
>+	[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] =
>XE_HW_ERR_GSC_NONFATAL_MIA_INTERNAL,
>+	[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] =
>XE_HW_ERR_GSC_NONFATAL_SRAM,
>+	[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] =
>XE_HW_ERR_GSC_NONFATAL_WDG,
>+	[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] =
>XE_HW_ERR_GSC_NONFATAL_ROM_PARITY,
>+	[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] =
>XE_HW_ERR_GSC_NONFATAL_UCODE_PARITY,
>+	[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] =
>XE_HW_ERR_GSC_NONFATAL_VLT_GLITCH,
>+	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] =
>XE_HW_ERR_GSC_NONFATAL_FUSE_PULL,
>+	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] =
>XE_HW_ERR_GSC_NONFATAL_FUSE_CRC,
>+	[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] =
>XE_HW_ERR_GSC_NONFATAL_SELF_MBIST,
>+	[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] =
>XE_HW_ERR_GSC_NONFATAL_AON_RF_PARITY,
>+	[XE_GENL_SGGI_ERROR_NONFATAL] =
>XE_HW_ERR_TILE_NONFATAL_SGGI,
>+	[XE_GENL_SGLI_ERROR_NONFATAL] =
>XE_HW_ERR_TILE_NONFATAL_SGLI,
>+	[XE_GENL_SGCI_ERROR_NONFATAL] =
>XE_HW_ERR_TILE_NONFATAL_SGCI,
>+	[XE_GENL_MERT_ERROR_NONFATAL] =
>XE_HW_ERR_TILE_NONFATAL_MERT,
>+	[XE_GENL_SGGI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGGI,
>+	[XE_GENL_SGLI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGLI,
>+	[XE_GENL_SGCI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGCI,
>+	[XE_GENL_MERT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_MERT,
>+};
>+
>+static unsigned int config_gt_id(const u64 config)
>+{
>+	return config >> __XE_PMU_GT_SHIFT;
>+}
>+
>+static u64 config_counter(const u64 config)
> {
>+	return config & ~(~0ULL << __XE_PMU_GT_SHIFT);
>+}
>+
>+static bool is_gt_error(const u64 config)
>+{
>+	unsigned int error;
>+
>+	error = config_counter(config);
>+	if (error <= XE_GENL_GT_ERROR_FATAL_FPU)
>+		return true;
>+
>+	return false;
>+}
>+
>+static bool is_gt_vector_error(const u64 config)
>+{
>+	unsigned int error;
>+
>+	error = config_counter(config);
>+	if (error >= XE_GENL_GT_ERROR_FATAL_TLB &&
>+	    error <= XE_GENL_GT_ERROR_FATAL_L3BANK)
>+		return true;
>+
>+	return false;
>+}
>+
>+static bool is_pvc_invalid_gt_errors(const u64 config)
>+{
>+	switch (config_counter(config)) {
>+	case XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG:
>+	case XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER:
>+	case XE_GENL_GT_ERROR_FATAL_ARR_BIST:
>+	case XE_GENL_GT_ERROR_FATAL_L3_DOUB:
>+	case XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK:
>+	case XE_GENL_GT_ERROR_FATAL_IDI_PAR:
>+	case XE_GENL_GT_ERROR_FATAL_SQIDI:
>+	case XE_GENL_GT_ERROR_FATAL_SAMPLER:
>+	case XE_GENL_GT_ERROR_FATAL_EU_IC:
>+		return true;
>+	default:
>+		return false;
>+	}
>+}
>+
>+static bool is_gsc_hw_error(const u64 config)
>+{
>+	if (config_counter(config) >=
>XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
>+	    config_counter(config) <=
>XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY)
>+		return true;
>+
>+	return false;
>+}
>+
>+static bool is_soc_error(const u64 config)
>+{
>+	if (config_counter(config) >=
>XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD &&
>+	    config_counter(config) <= XE_GENL_SOC_ERROR_FATAL_HBM(1,
>15))
>+		return true;
>+
>+	return false;
>+}
>+
>+static int
>+config_status(struct xe_device *xe, u64 config)
>+{
>+	unsigned int gt_id = config_gt_id(config);
>+	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
>+
>+	if (!IS_DGFX(xe))
>+		return -ENODEV;
>+
>+	if (gt->info.type == XE_GT_TYPE_UNINITIALIZED)
>+		return -ENOENT;
>+
>+	/* GSC HW ERRORS are present on root tile of
>+	 * platform supporting MEMORY SPARING only
>+	 */
>+	if (is_gsc_hw_error(config) && !(xe->info.platform == XE_PVC &&
>!gt_id))
>+		return -ENODEV;
>+
>+	/* GT vectors error  are valid on Platforms supporting error vectors only
>*/
>+	if (is_gt_vector_error(config) && xe->info.platform != XE_PVC)
>+		return -ENODEV;
>+
>+	/* Skip gt errors not supported on pvc */
>+	if (is_pvc_invalid_gt_errors(config) && xe->info.platform == XE_PVC)
>+		return  -ENODEV;
>+
>+	/* FATAL FPU error is valid on PVC only */
>+	if (config_counter(config) == XE_GENL_GT_ERROR_FATAL_FPU &&
>+	    !(xe->info.platform == XE_PVC))
>+		return -ENODEV;
>+
>+	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
>+		return -ENODEV;
>+
>+	return (config_counter(config) >=
>+			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
>+}
>+
>+static u64 get_counter_value(struct xe_device *xe, u64 config)
>+{
>+	const unsigned int gt_id = config_gt_id(config);
>+	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
>+	unsigned int id = config_counter(config);
>+
>+	if (is_gt_error(config) || is_gt_vector_error(config))
>+		return xa_to_value(xa_load(&gt->errors.hw_error,
>xe_hw_error_map[id]));
>+
>+	return xa_to_value(xa_load(&gt->tile->errors.hw_error,
>xe_hw_error_map[id]));
>+}
>+
>+int fill_error_details(struct xe_device *xe, struct genl_info *info, struct sk_buff *new_msg)
>+{
>+	struct nlattr *entry_attr;
>+	bool counter = false;
>+	struct xe_gt *gt;
>+	int i, j;
>+
>+	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
>+		     ARRAY_SIZE(xe_hw_error_map));
>+
>+	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL)
>+		counter = true;
>+
>+	entry_attr = nla_nest_start(new_msg,
>DRM_RAS_ATTR_QUERY_REPLY);
>+	if (!entry_attr)
>+		return -EMSGSIZE;
>+
>+	for_each_gt(gt, xe, j) {
>+		char str[MAX_ERROR_NAME];
>+		u64 val;
>+
>+		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
>+			u64 config = XE_HW_ERROR(j, i);
>+
>+			if (config_status(xe, config))
>+				continue;
>+
>+			/* should this be cleared everytime */
>+			snprintf(str, sizeof(str), "error-gt%d-%s", j,
>xe_hw_error_events[i]);
>+
>+			if (nla_put_string(new_msg,
>DRM_RAS_ATTR_ERROR_NAME, str))
>+				goto err;
>+			if (nla_put_u64_64bit(new_msg,
>DRM_RAS_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
>+				goto err;
>+			if (counter) {
>+				val = get_counter_value(xe, config);
>+				if (nla_put_u64_64bit(new_msg,
>DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD))
>+					goto err;
>+			}
>+		}
>+	}
>+
>+	nla_nest_end(new_msg, entry_attr);
>+
> 	return 0;
>+err:
>+	drm_dbg_driver(&xe->drm, "msg buff is small\n");
>+	nla_nest_cancel(new_msg, entry_attr);
>+	nlmsg_free(new_msg);
>+
>+	return -EMSGSIZE;
>+}
>+
>+static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg,
>struct genl_info *info)
>+{
>+	struct xe_device *xe = to_xe_device(drm);
>+	size_t msg_size = NLMSG_DEFAULT_SIZE;
>+	struct sk_buff *new_msg;
>+	int retries = 2;
>+	void *usrhdr;
>+	int ret = 0;
>+
>+	if (!IS_DGFX(xe))
>+		return -ENODEV;
>+
>+	do {
>+		new_msg = drm_genl_alloc_msg(drm, info, msg_size,
>&usrhdr);
>+		if (!new_msg)
>+			return -ENOMEM;
>+
>+		ret = fill_error_details(xe, info, new_msg);
>+		if (!ret)
>+			break;
>+
>+		msg_size += NLMSG_DEFAULT_SIZE;
>+	} while (retries--);
>+
>+	if (!ret)
>+		ret = drm_genl_reply(new_msg, info, usrhdr);
>+
>+	return ret;
> }
>
> static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg,
>struct genl_info *info)
> {
>-	return 0;
>+	struct xe_device *xe = to_xe_device(drm);
>+	size_t msg_size = NLMSG_DEFAULT_SIZE;
>+	struct sk_buff *new_msg;
>+	void *usrhdr;
>+	int ret = 0;
>+	int retries = 2;
>+	u64 config, val;
>+
>+	config = nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_ID]);
>+	ret = config_status(xe, config);
>+	if (ret)
>+		return ret;
>+	do {
>+		new_msg = drm_genl_alloc_msg(drm, info, msg_size,
>&usrhdr);
>+		if (!new_msg)
>+			return -ENOMEM;
>+
>+		val = get_counter_value(xe, config);
>+		if (nla_put_u64_64bit(new_msg,
>DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
>+			msg_size += NLMSG_DEFAULT_SIZE;
>+			continue;
>+		}
>+
>+		break;
>+	} while (retries--);
>+
>+	ret = drm_genl_reply(new_msg, info, usrhdr);
>+
>+	return ret;
> }
>
> /* driver callbacks to DRM netlink commands*/
>diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>index 60cc6418d9a7..dbb3f1afba5f 100644
>--- a/include/uapi/drm/xe_drm.h
>+++ b/include/uapi/drm/xe_drm.h
>@@ -1087,6 +1087,87 @@ struct drm_xe_vm_madvise {
> #define XE_PMU_MEDIA_GROUP_BUSY(gt)
>	___XE_PMU_OTHER(gt, 3)
> #define XE_PMU_ANY_ENGINE_GROUP_BUSY(gt)
>	___XE_PMU_OTHER(gt, 4)
>
>+/**
>+ * DOC: XE GENL netlink event IDs
>+ * TODO: Add more details
>+ */
>+#define XE_HW_ERROR(gt, id) \
>+	((id) | ((__u64)(gt) << __XE_PMU_GT_SHIFT))
>+
>+#define XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG		(0)
>+#define XE_GENL_GT_ERROR_CORRECTABLE_GUC		(1)
>+#define XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER		(2)
>+#define XE_GENL_GT_ERROR_CORRECTABLE_SLM		(3)
>+#define XE_GENL_GT_ERROR_CORRECTABLE_EU_IC		(4)
>+#define XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF		(5)
>+#define XE_GENL_GT_ERROR_FATAL_ARR_BIST			(6)
>+#define XE_GENL_GT_ERROR_FATAL_L3_DOUB			(7)
>+#define XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK		(8)
>+#define XE_GENL_GT_ERROR_FATAL_GUC			(9)
>+#define XE_GENL_GT_ERROR_FATAL_IDI_PAR			(10)
>+#define XE_GENL_GT_ERROR_FATAL_SQIDI			(11)
>+#define XE_GENL_GT_ERROR_FATAL_SAMPLER			(12)
>+#define XE_GENL_GT_ERROR_FATAL_SLM			(13)
>+#define XE_GENL_GT_ERROR_FATAL_EU_IC			(14)
>+#define XE_GENL_GT_ERROR_FATAL_EU_GRF			(15)
>+#define XE_GENL_GT_ERROR_FATAL_FPU			(16)
>+#define XE_GENL_GT_ERROR_FATAL_TLB			(17)
>+#define XE_GENL_GT_ERROR_FATAL_L3_FABRIC		(18)
>+#define XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE		(19)
>+#define XE_GENL_GT_ERROR_CORRECTABLE_L3BANK		(20)
>+#define XE_GENL_GT_ERROR_FATAL_SUBSLICE			(21)
>+#define XE_GENL_GT_ERROR_FATAL_L3BANK			(22)
>+#define XE_GENL_SGUNIT_ERROR_CORRECTABLE		(23)
>+#define XE_GENL_SGUNIT_ERROR_NONFATAL			(24)
>+#define XE_GENL_SGUNIT_ERROR_FATAL			(25)
>+#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD		(26)
>+#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP		(27)
>+#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ		(28)
>+#define XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI		(29)
>+#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T		(30)
>+#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C		(31)
>+#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD		(32)
>+#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP		(33)
>+#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ		(34)
>+#define XE_GENL_SOC_ERROR_FATAL_PUNIT			(35)
>+#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD			(36)
>+#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP			(37)
>+#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ			(38)
>+#define XE_GENL_SOC_ERROR_FATAL_ANR_MDFI		(39)
>+#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2T		(40)
>+#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2C		(41)
>+#define XE_GENL_SOC_ERROR_FATAL_PCIE_AER		(42)
>+#define XE_GENL_SOC_ERROR_FATAL_PCIE_ERR		(43)
>+#define XE_GENL_SOC_ERROR_FATAL_UR_COND			(44)
>+#define XE_GENL_SOC_ERROR_FATAL_SERR_SRCS		(45)
>+
>+#define XE_GENL_SOC_ERROR_NONFATAL_HBM(ss, n)\
>+		(XE_GENL_SOC_ERROR_FATAL_SERR_SRCS + 0x1 + (ss) * 0x10 +
>(n))
>+#define XE_GENL_SOC_ERROR_FATAL_HBM(ss, n)\
>+		(XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15) + 0x1 + (ss) *
>0x10 + (n))
>+
>+/* 109 is the last ID used by SOC errors */
>+#define XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC		(110)
>+#define XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN		(111)
>+#define XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL		(112)
>+#define XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC		(113)
>+#define XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT		(114)
>+#define XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY		(115)
>+#define XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY		(116)
>+#define XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH		(117)
>+#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL		(118)
>+#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(119)
>+#define XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST		(120)
>+#define XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY	(121)
>+#define XE_GENL_SGGI_ERROR_NONFATAL			(122)
>+#define XE_GENL_SGLI_ERROR_NONFATAL			(123)
>+#define XE_GENL_SGCI_ERROR_NONFATAL			(124)
>+#define XE_GENL_MERT_ERROR_NONFATAL			(125)
>+#define XE_GENL_SGGI_ERROR_FATAL			(126)
>+#define XE_GENL_SGLI_ERROR_FATAL			(127)
>+#define XE_GENL_SGCI_ERROR_FATAL			(128)
>+#define XE_GENL_MERT_ERROR_FATAL			(129)
>+
> #if defined(__cplusplus)
> }
> #endif
>--
>2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC 4/5] drm/netlink: Define multicast groups
  2023-10-20 15:58 ` [RFC 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
@ 2023-10-20 20:39   ` Ruhl, Michael J
  0 siblings, 0 replies; 31+ messages in thread
From: Ruhl, Michael J @ 2023-10-20 20:39 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Tayar, Tomer (Habana),
	Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling,
	Luben.Tuikov

>-----Original Message-----
>From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>Sent: Friday, October 20, 2023 11:59 AM
>To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
>alexander.deucher@amd.com; airlied@gmail.com; daniel@ffwll.ch;
>joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; Tayar, Tomer (Habana)
><ttayar@habana.ai>; Hawking.Zhang@amd.com;
>Harish.Kasiviswanathan@amd.com; Felix.Kuehling@amd.com;
>Luben.Tuikov@amd.com; Ruhl, Michael J <michael.j.ruhl@intel.com>
>Subject: [RFC 4/5] drm/netlink: Define multicast groups
>
>Netlink subsystem supports event notifications to userspace. we define
>two multicast groups for correctable and uncorrectable errors to which
>userspace can subscribe and be notified when any of those errors happen.
>The group names are local to the driver's genl netlink family.

Hi Aravind,

This looks reasonable to me.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>

M

>Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>---
> drivers/gpu/drm/drm_netlink.c  | 7 +++++++
> include/drm/drm_netlink.h      | 5 +++++
> include/uapi/drm/drm_netlink.h | 4 ++++
> 3 files changed, 16 insertions(+)
>
>diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
>index 8add249c1da3..425a7355a573 100644
>--- a/drivers/gpu/drm/drm_netlink.c
>+++ b/drivers/gpu/drm/drm_netlink.c
>@@ -12,6 +12,11 @@
>
> DEFINE_XARRAY(drm_dev_xarray);
>
>+static const struct genl_multicast_group drm_event_mcgrps[] = {
>+	[DRM_GENL_MCAST_CORR_ERR] = { .name =
>DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, },
>+	[DRM_GENL_MCAST_UNCORR_ERR] = { .name =
>DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, },
>+};
>+
> /**
>  * drm_genl_reply - response to a request
>  * @msg: socket buffer
>@@ -133,6 +138,8 @@ static void drm_genl_family_init(struct drm_device
>*dev)
> 	dev->drm_genl_family.ops = drm_genl_ops;
> 	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
> 	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
>+	dev->drm_genl_family.mcgrps = drm_event_mcgrps;
>+	dev->drm_genl_family.n_mcgrps = ARRAY_SIZE(drm_event_mcgrps);
> 	dev->drm_genl_family.module = dev->dev->driver->owner;
> }
>
>diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
>index 54527dae7847..758239643c17 100644
>--- a/include/drm/drm_netlink.h
>+++ b/include/drm/drm_netlink.h
>@@ -13,6 +13,11 @@
>
> struct drm_device;
>
>+enum mcgrps_events {
>+	DRM_GENL_MCAST_CORR_ERR,
>+	DRM_GENL_MCAST_UNCORR_ERR,
>+};
>+
> struct driver_genl_ops {
> 	int		       (*doit)(struct drm_device *dev,
> 				       struct sk_buff *skb,
>diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>index aab42147a20e..c7a0ce5b4624 100644
>--- a/include/uapi/drm/drm_netlink.h
>+++ b/include/uapi/drm/drm_netlink.h
>@@ -26,6 +26,8 @@
> #define _DRM_NETLINK_H_
>
> #define DRM_GENL_VERSION 1
>+#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR	"drm_corr_err"
>+#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR
>	"drm_uncorr_err"
>
> #if defined(__cplusplus)
> extern "C" {
>@@ -43,6 +45,8 @@ enum drm_genl_error_cmds {
> 	DRM_RAS_CMD_READ_ONE,
> 	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all
>errors */
> 	DRM_RAS_CMD_READ_ALL,
>+	/** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of
>multicast event */
>+	DRM_RAS_CMD_ERROR_EVENT,
>
> 	__DRM_CMD_MAX,
> 	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>--
>2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2023-10-20 15:58 ` [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
@ 2023-10-20 20:40   ` Ruhl, Michael J
  2023-11-10 12:27   ` Tomer Tayar
  1 sibling, 0 replies; 31+ messages in thread
From: Ruhl, Michael J @ 2023-10-20 20:40 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Tayar, Tomer (Habana),
	Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling,
	Luben.Tuikov

>-----Original Message-----
>From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>Sent: Friday, October 20, 2023 11:59 AM
>To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
>alexander.deucher@amd.com; airlied@gmail.com; daniel@ffwll.ch;
>joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; Tayar, Tomer (Habana)
><ttayar@habana.ai>; Hawking.Zhang@amd.com;
>Harish.Kasiviswanathan@amd.com; Felix.Kuehling@amd.com;
>Luben.Tuikov@amd.com; Ruhl, Michael J <michael.j.ruhl@intel.com>
>Subject: [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an
>error
>
>Whenever a correctable or an uncorrectable error happens an event is sent
>to the corresponding listeners of these groups.
>
>v2: Rebase

Hi Aravind,

This looks reasonable to me.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>

M

>Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>---
> drivers/gpu/drm/xe/xe_hw_error.c | 33
>++++++++++++++++++++++++++++++++
> 1 file changed, 33 insertions(+)
>
>diff --git a/drivers/gpu/drm/xe/xe_hw_error.c
>b/drivers/gpu/drm/xe/xe_hw_error.c
>index bab6d4cf0b69..b0befb5e01cb 100644
>--- a/drivers/gpu/drm/xe/xe_hw_error.c
>+++ b/drivers/gpu/drm/xe/xe_hw_error.c
>@@ -786,6 +786,37 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const
>enum hardware_error hw_err)
> 				(HARDWARE_ERROR_MAX << 1) + 1);
> }
>
>+static void
>+generate_netlink_event(struct xe_device *xe, const enum hardware_error
>hw_err)
>+{
>+	struct sk_buff *msg;
>+	void *hdr;
>+
>+	if (!xe->drm.drm_genl_family.module)
>+		return;
>+
>+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
>+	if (!msg) {
>+		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error
>multicast event\n");
>+		return;
>+	}
>+
>+	hdr = genlmsg_put(msg, 0, 0, &xe->drm.drm_genl_family, 0,
>DRM_RAS_CMD_ERROR_EVENT);
>+	if (!hdr) {
>+		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
>+		nlmsg_free(msg);
>+		return;
>+	}
>+
>+	genlmsg_end(msg, hdr);
>+
>+	genlmsg_multicast(&xe->drm.drm_genl_family, msg, 0,
>+			  hw_err ?
>+			  DRM_GENL_MCAST_UNCORR_ERR
>+			  : DRM_GENL_MCAST_CORR_ERR,
>+			  GFP_ATOMIC);
>+}
>+
> static void
> xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error
>hw_err)
> {
>@@ -849,6 +880,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const
>enum hardware_error hw_er
> 	}
>
> 	xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err), errsrc);
>+
>+	generate_netlink_event(tile_to_xe(tile), hw_err);
> unlock:
> 	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
> }
>--
>2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 1/5] drm/netlink: Add netlink infrastructure
  2023-10-20 20:36   ` Ruhl, Michael J
@ 2023-10-21  1:10     ` Aravind Iddamsetty
  0 siblings, 0 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-21  1:10 UTC (permalink / raw)
  To: Ruhl, Michael J, intel-xe, dri-devel, alexander.deucher, airlied,
	daniel, joonas.lahtinen, ogabbay, Tayar, Tomer (Habana),
	Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling,
	Luben.Tuikov


On 21/10/23 02:06, Ruhl, Michael J wrote:
>> -----Original Message-----
>> From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> Sent: Friday, October 20, 2023 11:59 AM
>> To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org;
>> alexander.deucher@amd.com; airlied@gmail.com; daniel@ffwll.ch;
>> joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; Tayar, Tomer (Habana)
>> <ttayar@habana.ai>; Hawking.Zhang@amd.com;
>> Harish.Kasiviswanathan@amd.com; Felix.Kuehling@amd.com;
>> Luben.Tuikov@amd.com; Ruhl, Michael J <michael.j.ruhl@intel.com>
>> Subject: [RFC v4 1/5] drm/netlink: Add netlink infrastructure
>>
>> Define the netlink registration interface and commands, attributes that
>> can be commonly used across by drm drivers. This patch intends to use
>> the generic netlink family to expose various stats of device. At present
>> it defines some commands that shall be used to expose RAS error counters.
>>
>> v2:
>> define common interfaces to genl netlink subsystem that all drm drivers
>> can leverage.(Tomer Tayar)
>>
>> v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>> register to netlink subsystem (Daniel Vetter)
>>
>> v4:(Michael J. Ruhl)
>> 1. rename drm_genl_send to drm_genl_reply
>> 2. catch error from xa_store and handle appropriately
> Hi Aravind,
>
> This looks reasonable to me.
>
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>

Hi Mike,

Thanks a lot for your reviews and r-b.

Regards,
Aravind.
>
> M
>
>> Cc: Tomer Tayar <ttayar@habana.ai>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Michael J. Ruhl <michael.j.ruhl@intel.com>
>>
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
>> ---
>> drivers/gpu/drm/Makefile       |   1 +
>> drivers/gpu/drm/drm_drv.c      |   7 ++
>> drivers/gpu/drm/drm_netlink.c  | 188
>> +++++++++++++++++++++++++++++++++
>> include/drm/drm_device.h       |   8 ++
>> include/drm/drm_drv.h          |   7 ++
>> include/drm/drm_netlink.h      |  30 ++++++
>> include/uapi/drm/drm_netlink.h |  83 +++++++++++++++
>> 7 files changed, 324 insertions(+)
>> create mode 100644 drivers/gpu/drm/drm_netlink.c
>> create mode 100644 include/drm/drm_netlink.h
>> create mode 100644 include/uapi/drm/drm_netlink.h
>>
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index ee64c51274ad..60864369adaa 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -35,6 +35,7 @@ drm-y := \
>> 	drm_mode_object.o \
>> 	drm_modes.o \
>> 	drm_modeset_lock.o \
>> +	drm_netlink.o \
>> 	drm_plane.o \
>> 	drm_prime.o \
>> 	drm_print.o \
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 535f16e7882e..31f55c1c7524 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -937,6 +937,12 @@ int drm_dev_register(struct drm_device *dev,
>> unsigned long flags)
>> 	if (ret)
>> 		goto err_minors;
>>
>> +	if (driver->genl_ops) {
>> +		ret = drm_genl_register(dev);
>> +		if (ret)
>> +			goto err_minors;
>> +	}
>> +
>> 	ret = create_compat_control_link(dev);
>> 	if (ret)
>> 		goto err_minors;
>> @@ -1074,6 +1080,7 @@ static void drm_core_exit(void)
>> {
>> 	drm_privacy_screen_lookup_exit();
>> 	accel_core_exit();
>> +	drm_genl_exit();
>> 	unregister_chrdev(DRM_MAJOR, "drm");
>> 	debugfs_remove(drm_debugfs_root);
>> 	drm_sysfs_destroy();
>> diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
>> new file mode 100644
>> index 000000000000..8add249c1da3
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_netlink.c
>> @@ -0,0 +1,188 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_device.h>
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_file.h>
>> +#include <drm/drm_managed.h>
>> +#include <drm/drm_netlink.h>
>> +#include <drm/drm_print.h>
>> +
>> +DEFINE_XARRAY(drm_dev_xarray);
>> +
>> +/**
>> + * drm_genl_reply - response to a request
>> + * @msg: socket buffer
>> + * @info: receiver information
>> + * @usrhdr: pointer to user specific header in the message buffer
>> + *
>> + * RETURNS:
>> + * 0 on success and negative error code on failure
>> + */
>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
>> +{
>> +	int ret;
>> +
>> +	genlmsg_end(msg, usrhdr);
>> +
>> +	ret = genlmsg_reply(msg, info);
>> +	if (ret)
>> +		nlmsg_free(msg);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(drm_genl_reply);
>> +
>> +/**
>> + * drm_genl_alloc_msg - allocate genl message buffer
>> + * @dev: drm_device for which the message is being allocated
>> + * @info: receiver information
>> + * @usrhdr: pointer to user specific header in the message buffer
>> + *
>> + * RETURNS:
>> + * pointer to new allocated buffer on success, NULL on failure
>> + */
>> +struct sk_buff *
>> +drm_genl_alloc_msg(struct drm_device *dev,
>> +		   struct genl_info *info,
>> +		   size_t msg_size, void **usrhdr)
>> +{
>> +	struct sk_buff *new_msg;
>> +	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
>> +	if (!new_msg)
>> +		return new_msg;
>> +
>> +	*usrhdr = genlmsg_put_reply(new_msg, info, &dev->drm_genl_family, 0, info->genlhdr->cmd);
>> +	if (!*usrhdr) {
>> +		nlmsg_free(new_msg);
>> +		new_msg = NULL;
>> +	}
>> +
>> +	return new_msg;
>> +}
>> +EXPORT_SYMBOL(drm_genl_alloc_msg);
>> +
>> +static struct drm_device *genl_to_dev(struct genl_info *info)
>> +{
>> +	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
>> +}
>> +
>> +static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct drm_device *dev = genl_to_dev(info);
>> +
>> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_REQUEST))
>> +		return -EINVAL;
>> +
>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>> +		return -EOPNOTSUPP;
>> +
>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg,
>> info);
>> +}
>> +
>> +static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct drm_device *dev = genl_to_dev(info);
>> +
>> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
>> +		return -EINVAL;
>> +
>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>> +		return -EOPNOTSUPP;
>> +
>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg,
>> info);
>> +}
>> +
>> +/* attribute policies */
>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>> +	[DRM_RAS_ATTR_REQUEST] = { .type = NLA_U8 },
>> +};
>> +
>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1]
>> = {
>> +	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
>> +};
>> +
>> +/* drm genl operations definition */
>> +const struct genl_ops drm_genl_ops[] = {
>> +	{
>> +		.cmd = DRM_RAS_CMD_QUERY,
>> +		.doit = drm_genl_list_errors,
>> +		.policy = drm_attr_policy_query,
>> +	},
>> +	{
>> +		.cmd = DRM_RAS_CMD_READ_ONE,
>> +		.doit = drm_genl_read_error,
>> +		.policy = drm_attr_policy_read_one,
>> +	},
>> +	{
>> +		.cmd = DRM_RAS_CMD_READ_ALL,
>> +		.doit = drm_genl_list_errors,
>> +		.policy = drm_attr_policy_query,
>> +	},
>> +};
>> +
>> +static void drm_genl_family_init(struct drm_device *dev)
>> +{
>> +	/* Use drm primary node name eg: card0 to name the genl family */
>> +	snprintf(dev->drm_genl_family.name, sizeof(dev->drm_genl_family.name), "%s", dev->primary->kdev->kobj.name);
>> +	dev->drm_genl_family.version = DRM_GENL_VERSION;
>> +	dev->drm_genl_family.parallel_ops = true;
>> +	dev->drm_genl_family.ops = drm_genl_ops;
>> +	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
>> +	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
>> +	dev->drm_genl_family.module = dev->dev->driver->owner;
>> +}
>> +
>> +static void drm_genl_deregister(struct drm_device *dev,  void *arg)
>> +{
>> +	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family.name);
>> +
>> +	xa_erase(&drm_dev_xarray, dev->drm_genl_family.id);
>> +
>> +	genl_unregister_family(&dev->drm_genl_family);
>> +}
>> +
>> +/**
>> + * drm_genl_register - Register genl family
>> + * @dev: drm_device for which genl family needs to be registered
>> + *
>> + * RETURNS:
>> + * 0 on success and negative error code on failure
>> + */
>> +int drm_genl_register(struct drm_device *dev)
>> +{
>> +	int ret;
>> +
>> +	drm_genl_family_init(dev);
>> +
>> +	ret = genl_register_family(&dev->drm_genl_family);
>> +	if (ret < 0) {
>> +		drm_warn(dev, "genl family registration failed\n");
>> +		return ret;
>> +	}
>> +
>> +	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family.id, dev->drm_genl_family.name);
>> +
>> +	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family.id, dev, GFP_KERNEL));
>> +	if (ret)
>> +		goto genl_unregister;
>> +
>> +	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
>> +
>> +	return ret;
>> +
>> +genl_unregister:
>> +	genl_unregister_family(&dev->drm_genl_family);
>> +	return ret;
>> +}
>> +
>> +/**
>> + * drm_genl_exit: destroy drm_dev_xarray
>> + */
>> +void drm_genl_exit(void)
>> +{
>> +	xa_destroy(&drm_dev_xarray);
>> +}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index c490977ee250..d3ae91b7714d 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -8,6 +8,7 @@
>>
>> #include <drm/drm_legacy.h>
>> #include <drm/drm_mode_config.h>
>> +#include <drm/drm_netlink.h>
>>
>> struct drm_driver;
>> struct drm_minor;
>> @@ -318,6 +319,13 @@ struct drm_device {
>> 	 */
>> 	struct dentry *debugfs_root;
>>
>> +	/**
>> +	 * @drm_genl_family:
>> +	 *
>> +	 * Generic netlink family registration structure.
>> +	 */
>> +	struct genl_family drm_genl_family;
>> +
>> 	/* Everything below here is for legacy driver, never use! */
>> 	/* private: */
>> #if IS_ENABLED(CONFIG_DRM_LEGACY)
>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>> index e2640dc64e08..ebdb7850d235 100644
>> --- a/include/drm/drm_drv.h
>> +++ b/include/drm/drm_drv.h
>> @@ -434,6 +434,13 @@ struct drm_driver {
>> 	 */
>> 	const struct file_operations *fops;
>>
>> +	/**
>> +	 * @genl_ops:
>> +	 *
>> +	 * Drivers private callback to genl commands
>> +	 */
>> +	const struct driver_genl_ops *genl_ops;
>> +
>> #ifdef CONFIG_DRM_LEGACY
>> 	/* Everything below here is for legacy driver, never use! */
>> 	/* private: */
>> diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
>> new file mode 100644
>> index 000000000000..54527dae7847
>> --- /dev/null
>> +++ b/include/drm/drm_netlink.h
>> @@ -0,0 +1,30 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_NETLINK_H__
>> +#define __DRM_NETLINK_H__
>> +
>> +#include <linux/netdevice.h>
>> +#include <net/genetlink.h>
>> +#include <net/sock.h>
>> +#include <uapi/drm/drm_netlink.h>
>> +
>> +struct drm_device;
>> +
>> +struct driver_genl_ops {
>> +	int		       (*doit)(struct drm_device *dev,
>> +				       struct sk_buff *skb,
>> +				       struct genl_info *info);
>> +};
>> +
>> +int drm_genl_register(struct drm_device *dev);
>> +void drm_genl_exit(void);
>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
>> +struct sk_buff *
>> +drm_genl_alloc_msg(struct drm_device *dev,
>> +		   struct genl_info *info,
>> +		   size_t msg_size, void **usrhdr);
>> +#endif
>> +
>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>> new file mode 100644
>> index 000000000000..aab42147a20e
>> --- /dev/null
>> +++ b/include/uapi/drm/drm_netlink.h
>> @@ -0,0 +1,83 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright 2023 Intel Corporation
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a
>> + * copy of this software and associated documentation files (the "Software"),
>> + * to deal in the Software without restriction, including without limitation
>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>> + * and/or sell copies of the Software, and to permit persons to whom the
>> + * Software is furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice (including the next
>> + * paragraph) shall be included in all copies or substantial portions of the
>> + * Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>> EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
>> MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
>> EVENT SHALL
>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM,
>> DAMAGES OR
>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
>> OTHERWISE,
>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
>> USE OR
>> + * OTHER DEALINGS IN THE SOFTWARE.
>> + */
>> +
>> +#ifndef _DRM_NETLINK_H_
>> +#define _DRM_NETLINK_H_
>> +
>> +#define DRM_GENL_VERSION 1
>> +
>> +#if defined(__cplusplus)
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * enum drm_genl_error_cmds - Supported error commands
>> + *
>> + */
>> +enum drm_genl_error_cmds {
>> +	DRM_CMD_UNSPEC,
>> +	/** @DRM_RAS_CMD_QUERY: Command to list all errors names with
>> config-id */
>> +	DRM_RAS_CMD_QUERY,
>> +	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a
>> specific error */
>> +	DRM_RAS_CMD_READ_ONE,
>> +	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all
>> errors */
>> +	DRM_RAS_CMD_READ_ALL,
>> +
>> +	__DRM_CMD_MAX,
>> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>> +};
>> +
>> +/**
>> + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
>> + *
>> + */
>> +enum drm_error_attr {
>> +	DRM_ATTR_UNSPEC,
>> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>> +	/**
>> +	 * @DRM_RAS_ATTR_REQUEST: Should be used with
>> DRM_RAS_CMD_QUERY,
>> +	 * DRM_RAS_CMD_READ_ALL
>> +	 */
>> +	DRM_RAS_ATTR_REQUEST, /* NLA_U8 */
>> +	/**
>> +	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
>> +	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL
>> commands.
>> +	 */
>> +	DRM_RAS_ATTR_QUERY_REPLY, /*NLA_NESTED*/
>> +	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
>> +	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>> +	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id */
>> +	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
>> +	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
>> +	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
>> +
>> +	__DRM_ATTR_MAX,
>> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>> +};
>> +
>> +#if defined(__cplusplus)
>> +}
>> +#endif
>> +
>> +#endif
>> --
>> 2.25.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (4 preceding siblings ...)
  2023-10-20 15:58 ` [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
@ 2023-10-23 15:29 ` Alex Deucher
  2023-10-24  8:59   ` Zhang, Hawking
  2023-10-26 10:04   ` Lazar, Lijo
  2023-11-10 12:23 ` Tomer Tayar
  6 siblings, 2 replies; 31+ messages in thread
From: Alex Deucher @ 2023-10-23 15:29 UTC (permalink / raw)
  To: Aravind Iddamsetty, Lazar, Lijo
  Cc: ogabbay, Harish.Kasiviswanathan, dri-devel, michael.j.ruhl,
	Luben.Tuikov, ttayar, alexander.deucher, Felix.Kuehling,
	intel-xe, Hawking.Zhang

On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty
<aravind.iddamsetty@linux.intel.com> wrote:
>
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> reporting the errors to the host, which the KMD processes and exposes a
> set of error counters which can be used by observability tools to take
> corrective actions or repairs. Traditionally there were being exposed
> via PMU (for relative counters) and sysfs interface (for absolute
> value) in our internal branch. But, due to the limitations in this
> approach to use two interfaces and also not able to have an event based
> reporting or configurability, an alternative approach to try netlink
> was suggested by community for drm subsystem wide UAPI for RAS and
> telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.

@Hawking Zhang, @Lazar, Lijo

Can you take a look at this series and API and see if it would align
with our RAS requirements going forward?

Alex


>
> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>
> v4:
> 1. Rebase
> 2. rename drm_genl_send to drm_genl_reply
> 3. catch error from xa_store and handle appropriately
> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
> having an early check of IS_DGFX (Michael J. Ruhl)
>
> v3:
> 1. Rebase on latest RAS series for XE
> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
> register to netlink subsystem
>
> v2: define common interfaces to genl netlink subsystem that all drm drivers
> can leverage.
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
> https://patchwork.freedesktop.org/series/118437/#rev2
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name                                                    config-id               counter
>
> error-gt0-correctable-guc                               0x0000000000000001      0
> error-gt0-correctable-slm                               0x0000000000000003      0
> error-gt0-correctable-eu-ic                             0x0000000000000004      0
> error-gt0-correctable-eu-grf                            0x0000000000000005      0
> error-gt0-fatal-guc                                     0x0000000000000009      0
> error-gt0-fatal-slm                                     0x000000000000000d      0
> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> error-gt0-fatal-fpu                                     0x0000000000000010      0
> error-gt0-fatal-tlb                                     0x0000000000000011      0
> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> error-gt0-correctable-subslice                          0x0000000000000013      0
> error-gt0-correctable-l3bank                            0x0000000000000014      0
> error-gt0-fatal-subslice                                0x0000000000000015      0
> error-gt0-fatal-l3bank                                  0x0000000000000016      0
> error-gt0-sgunit-correctable                            0x0000000000000017      0
> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> error-gt0-sgunit-fatal                                  0x0000000000000019      0
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> error-gt0-soc-fatal-punit                               0x000000000000001d      0
> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> error-gt1-correctable-guc                               0x1000000000000001      0
> error-gt1-correctable-slm                               0x1000000000000003      0
> error-gt1-correctable-eu-ic                             0x1000000000000004      0
> error-gt1-correctable-eu-grf                            0x1000000000000005      0
> error-gt1-fatal-guc                                     0x1000000000000009      0
> error-gt1-fatal-slm                                     0x100000000000000d      0
> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> error-gt1-fatal-fpu                                     0x1000000000000010      0
> error-gt1-fatal-tlb                                     0x1000000000000011      0
> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> error-gt1-correctable-subslice                          0x1000000000000013      0
> error-gt1-correctable-l3bank                            0x1000000000000014      0
> error-gt1-fatal-subslice                                0x1000000000000015      0
> error-gt1-fatal-l3bank                                  0x1000000000000016      0
> error-gt1-sgunit-correctable                            0x1000000000000017      0
> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> error-gt1-sgunit-fatal                                  0x1000000000000019      0
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> error-gt1-soc-fatal-punit                               0x100000000000001d      0
> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>
> wait on a error event:
>
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> waiting for error event
> error event received
> counter value 0
>
> list all errors:
>
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name                                                    config-id
>
> error-gt0-correctable-guc                               0x0000000000000001
> error-gt0-correctable-slm                               0x0000000000000003
> error-gt0-correctable-eu-ic                             0x0000000000000004
> error-gt0-correctable-eu-grf                            0x0000000000000005
> error-gt0-fatal-guc                                     0x0000000000000009
> error-gt0-fatal-slm                                     0x000000000000000d
> error-gt0-fatal-eu-grf                                  0x000000000000000f
> error-gt0-fatal-fpu                                     0x0000000000000010
> error-gt0-fatal-tlb                                     0x0000000000000011
> error-gt0-fatal-l3-fabric                               0x0000000000000012
> error-gt0-correctable-subslice                          0x0000000000000013
> error-gt0-correctable-l3bank                            0x0000000000000014
> error-gt0-fatal-subslice                                0x0000000000000015
> error-gt0-fatal-l3bank                                  0x0000000000000016
> error-gt0-sgunit-correctable                            0x0000000000000017
> error-gt0-sgunit-nonfatal                               0x0000000000000018
> error-gt0-sgunit-fatal                                  0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> error-gt0-soc-fatal-punit                               0x000000000000001d
> error-gt0-soc-fatal-psf-0                               0x000000000000001e
> error-gt0-soc-fatal-psf-1                               0x000000000000001f
> error-gt0-soc-fatal-psf-2                               0x0000000000000020
> error-gt0-soc-fatal-cd0                                 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> error-gt1-correctable-guc                               0x1000000000000001
> error-gt1-correctable-slm                               0x1000000000000003
> error-gt1-correctable-eu-ic                             0x1000000000000004
> error-gt1-correctable-eu-grf                            0x1000000000000005
> error-gt1-fatal-guc                                     0x1000000000000009
> error-gt1-fatal-slm                                     0x100000000000000d
> error-gt1-fatal-eu-grf                                  0x100000000000000f
> error-gt1-fatal-fpu                                     0x1000000000000010
> error-gt1-fatal-tlb                                     0x1000000000000011
> error-gt1-fatal-l3-fabric                               0x1000000000000012
> error-gt1-correctable-subslice                          0x1000000000000013
> error-gt1-correctable-l3bank                            0x1000000000000014
> error-gt1-fatal-subslice                                0x1000000000000015
> error-gt1-fatal-l3bank                                  0x1000000000000016
> error-gt1-sgunit-correctable                            0x1000000000000017
> error-gt1-sgunit-nonfatal                               0x1000000000000018
> error-gt1-sgunit-fatal                                  0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> error-gt1-soc-fatal-punit                               0x100000000000001d
> error-gt1-soc-fatal-psf-0                               0x100000000000001e
> error-gt1-soc-fatal-psf-1                               0x100000000000001f
> error-gt1-soc-fatal-psf-2                               0x1000000000000020
> error-gt1-soc-fatal-cd0                                 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Oded Gabbay <ogabbay@kernel.org>
> Cc: Tomer Tayar <ttayar@habana.ai>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>
>
> Aravind Iddamsetty (5):
>   drm/netlink: Add netlink infrastructure
>   drm/xe/RAS: Register netlink capability
>   drm/xe/RAS: Expose the error counters
>   drm/netlink: Define multicast groups
>   drm/xe/RAS: send multicast event on occurrence of an error
>
>  drivers/gpu/drm/Makefile             |   1 +
>  drivers/gpu/drm/drm_drv.c            |   7 +
>  drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>  drivers/gpu/drm/xe/Makefile          |   1 +
>  drivers/gpu/drm/xe/xe_device.c       |   4 +
>  drivers/gpu/drm/xe/xe_device_types.h |   1 +
>  drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>  drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>  include/drm/drm_device.h             |   8 +
>  include/drm/drm_drv.h                |   7 +
>  include/drm/drm_netlink.h            |  35 ++
>  include/uapi/drm/drm_netlink.h       |  87 +++++
>  include/uapi/drm/xe_drm.h            |  81 +++++
>  13 files changed, 977 insertions(+)
>  create mode 100644 drivers/gpu/drm/drm_netlink.c
>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>  create mode 100644 include/drm/drm_netlink.h
>  create mode 100644 include/uapi/drm/drm_netlink.h
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-23 15:29 ` [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Alex Deucher
@ 2023-10-24  8:59   ` Zhang, Hawking
  2023-10-26  9:27     ` Aravind Iddamsetty
  2023-10-26 10:04   ` Lazar, Lijo
  1 sibling, 1 reply; 31+ messages in thread
From: Zhang, Hawking @ 2023-10-24  8:59 UTC (permalink / raw)
  To: Alex Deucher, Aravind Iddamsetty, Lazar, Lijo
  Cc: ogabbay, Kasiviswanathan, Harish, dri-devel, michael.j.ruhl,
	Tuikov, Luben, ttayar, Deucher, Alexander, Kuehling, Felix,
	intel-xe

[AMD Official Use Only - General]

Hi Aravind,

Is it allowed to register multiple genl families per drm_device? Also, is it allowed to customize error type and even error counter (status)?

SOC might integrate different type of controllers that report error in different types. Also, the controllers are capable of convert the error, or change its severity in some circumstances. Mixing severity and error type in a single array may not be the best practice. for example, error-gt0-soc-fatal-hbm-ss0-0 might be converted to non-fatal or deferred error, so driver doesn't need to be response immediately.

Regards,
Hawking

-----Original Message-----
From: Alex Deucher <alexdeucher@gmail.com>
Sent: Monday, October 23, 2023 23:29
To: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>; Lazar, Lijo <Lijo.Lazar@amd.com>
Cc: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; airlied@gmail.com; daniel@ffwll.ch; joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; ttayar@habana.ai; Zhang, Hawking <Hawking.Zhang@amd.com>; Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; michael.j.ruhl@intel.com
Subject: Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> wrote:
>
> Our hardware supports RAS(Reliability, Availability, Serviceability)
> by reporting the errors to the host, which the KMD processes and
> exposes a set of error counters which can be used by observability
> tools to take corrective actions or repairs. Traditionally there were
> being exposed via PMU (for relative counters) and sysfs interface (for
> absolute
> value) in our internal branch. But, due to the limitations in this
> approach to use two interfaces and also not able to have an event
> based reporting or configurability, an alternative approach to try
> netlink was suggested by community for drm subsystem wide UAPI for RAS
> and telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe
> driver instance registers a family and operations to the genl
> subsystem through which it enumerates and reports the error counters.
> An event based notification is also supported to which userpace can
> subscribe to and be notified when any error occurs and read the error
> counter this avoids continuous polling on error counter. This can also
> be extended to threshold based notification.

@Hawking Zhang, @Lazar, Lijo

Can you take a look at this series and API and see if it would align with our RAS requirements going forward?

Alex


>
> [1]:
> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary
> .html
>
> this series is on top of
> https://patchwork.freedesktop.org/series/125373/,
>
> v4:
> 1. Rebase
> 2. rename drm_genl_send to drm_genl_reply 3. catch error from xa_store
> and handle appropriately 4. presently xe_list_errors fills blank data
> for IGFX, prevent it by having an early check of IS_DGFX (Michael J.
> Ruhl)
>
> v3:
> 1. Rebase on latest RAS series for XE
> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
> register to netlink subsystem
>
> v2: define common interfaces to genl netlink subsystem that all drm
> drivers can leverage.
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject "[RFC
> i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
> https://patchwork.freedesktop.org/series/118437/#rev2
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1
> --error_id=0x0000000000000005 counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name                                                    config-id               counter
>
> error-gt0-correctable-guc                               0x0000000000000001      0
> error-gt0-correctable-slm                               0x0000000000000003      0
> error-gt0-correctable-eu-ic                             0x0000000000000004      0
> error-gt0-correctable-eu-grf                            0x0000000000000005      0
> error-gt0-fatal-guc                                     0x0000000000000009      0
> error-gt0-fatal-slm                                     0x000000000000000d      0
> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> error-gt0-fatal-fpu                                     0x0000000000000010      0
> error-gt0-fatal-tlb                                     0x0000000000000011      0
> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> error-gt0-correctable-subslice                          0x0000000000000013      0
> error-gt0-correctable-l3bank                            0x0000000000000014      0
> error-gt0-fatal-subslice                                0x0000000000000015      0
> error-gt0-fatal-l3bank                                  0x0000000000000016      0
> error-gt0-sgunit-correctable                            0x0000000000000017      0
> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> error-gt0-sgunit-fatal                                  0x0000000000000019      0
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> error-gt0-soc-fatal-punit                               0x000000000000001d      0
> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> error-gt1-correctable-guc                               0x1000000000000001      0
> error-gt1-correctable-slm                               0x1000000000000003      0
> error-gt1-correctable-eu-ic                             0x1000000000000004      0
> error-gt1-correctable-eu-grf                            0x1000000000000005      0
> error-gt1-fatal-guc                                     0x1000000000000009      0
> error-gt1-fatal-slm                                     0x100000000000000d      0
> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> error-gt1-fatal-fpu                                     0x1000000000000010      0
> error-gt1-fatal-tlb                                     0x1000000000000011      0
> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> error-gt1-correctable-subslice                          0x1000000000000013      0
> error-gt1-correctable-l3bank                            0x1000000000000014      0
> error-gt1-fatal-subslice                                0x1000000000000015      0
> error-gt1-fatal-l3bank                                  0x1000000000000016      0
> error-gt1-sgunit-correctable                            0x1000000000000017      0
> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> error-gt1-sgunit-fatal                                  0x1000000000000019      0
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> error-gt1-soc-fatal-punit                               0x100000000000001d      0
> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>
> wait on a error event:
>
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 waiting for
> error event error event received counter value 0
>
> list all errors:
>
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name                                                    config-id
>
> error-gt0-correctable-guc                               0x0000000000000001
> error-gt0-correctable-slm                               0x0000000000000003
> error-gt0-correctable-eu-ic                             0x0000000000000004
> error-gt0-correctable-eu-grf                            0x0000000000000005
> error-gt0-fatal-guc                                     0x0000000000000009
> error-gt0-fatal-slm                                     0x000000000000000d
> error-gt0-fatal-eu-grf                                  0x000000000000000f
> error-gt0-fatal-fpu                                     0x0000000000000010
> error-gt0-fatal-tlb                                     0x0000000000000011
> error-gt0-fatal-l3-fabric                               0x0000000000000012
> error-gt0-correctable-subslice                          0x0000000000000013
> error-gt0-correctable-l3bank                            0x0000000000000014
> error-gt0-fatal-subslice                                0x0000000000000015
> error-gt0-fatal-l3bank                                  0x0000000000000016
> error-gt0-sgunit-correctable                            0x0000000000000017
> error-gt0-sgunit-nonfatal                               0x0000000000000018
> error-gt0-sgunit-fatal                                  0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> error-gt0-soc-fatal-punit                               0x000000000000001d
> error-gt0-soc-fatal-psf-0                               0x000000000000001e
> error-gt0-soc-fatal-psf-1                               0x000000000000001f
> error-gt0-soc-fatal-psf-2                               0x0000000000000020
> error-gt0-soc-fatal-cd0                                 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> error-gt1-correctable-guc                               0x1000000000000001
> error-gt1-correctable-slm                               0x1000000000000003
> error-gt1-correctable-eu-ic                             0x1000000000000004
> error-gt1-correctable-eu-grf                            0x1000000000000005
> error-gt1-fatal-guc                                     0x1000000000000009
> error-gt1-fatal-slm                                     0x100000000000000d
> error-gt1-fatal-eu-grf                                  0x100000000000000f
> error-gt1-fatal-fpu                                     0x1000000000000010
> error-gt1-fatal-tlb                                     0x1000000000000011
> error-gt1-fatal-l3-fabric                               0x1000000000000012
> error-gt1-correctable-subslice                          0x1000000000000013
> error-gt1-correctable-l3bank                            0x1000000000000014
> error-gt1-fatal-subslice                                0x1000000000000015
> error-gt1-fatal-l3bank                                  0x1000000000000016
> error-gt1-sgunit-correctable                            0x1000000000000017
> error-gt1-sgunit-nonfatal                               0x1000000000000018
> error-gt1-sgunit-fatal                                  0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> error-gt1-soc-fatal-punit                               0x100000000000001d
> error-gt1-soc-fatal-psf-0                               0x100000000000001e
> error-gt1-soc-fatal-psf-1                               0x100000000000001f
> error-gt1-soc-fatal-psf-2                               0x1000000000000020
> error-gt1-soc-fatal-cd0                                 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Oded Gabbay <ogabbay@kernel.org>
> Cc: Tomer Tayar <ttayar@habana.ai>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>
>
> Aravind Iddamsetty (5):
>   drm/netlink: Add netlink infrastructure
>   drm/xe/RAS: Register netlink capability
>   drm/xe/RAS: Expose the error counters
>   drm/netlink: Define multicast groups
>   drm/xe/RAS: send multicast event on occurrence of an error
>
>  drivers/gpu/drm/Makefile             |   1 +
>  drivers/gpu/drm/drm_drv.c            |   7 +
>  drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>  drivers/gpu/drm/xe/Makefile          |   1 +
>  drivers/gpu/drm/xe/xe_device.c       |   4 +
>  drivers/gpu/drm/xe/xe_device_types.h |   1 +
>  drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>  drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>  include/drm/drm_device.h             |   8 +
>  include/drm/drm_drv.h                |   7 +
>  include/drm/drm_netlink.h            |  35 ++
>  include/uapi/drm/drm_netlink.h       |  87 +++++
>  include/uapi/drm/xe_drm.h            |  81 +++++
>  13 files changed, 977 insertions(+)
>  create mode 100644 drivers/gpu/drm/drm_netlink.c  create mode 100644
> drivers/gpu/drm/xe/xe_netlink.c  create mode 100644
> include/drm/drm_netlink.h  create mode 100644
> include/uapi/drm/drm_netlink.h
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-24  8:59   ` Zhang, Hawking
@ 2023-10-26  9:27     ` Aravind Iddamsetty
  0 siblings, 0 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-26  9:27 UTC (permalink / raw)
  To: Zhang, Hawking, Alex Deucher, Lazar, Lijo
  Cc: ogabbay, Kasiviswanathan, Harish, dri-devel, michael.j.ruhl,
	Tuikov, Luben, ttayar, Deucher, Alexander, Kuehling, Felix,
	intel-xe


On 24/10/23 14:29, Zhang, Hawking wrote:

Hi Hawking,

Thank you for your comment.
> [AMD Official Use Only - General]
>
> Hi Aravind,
>
> Is it allowed to register multiple genl families per drm_device? Also, is it allowed to customize error type and even error counter (status)?

In the present series it registers only one genl family per device, but genl framework shouldn't impose any restriction on multiple family registration as along as the family names are unique, but what is the purpose of it?

for the second part of the question IIUC an error can have different severity, like hbm-ss0-0 can be of fatal or non fatal, so then we could have two entries
for each like how it is done in this series for the same error type which can have different severities, so for hbm-ss0-0 it would enumerate error-gt0-soc-fatal-hbm-ss0-0
and error-gt0-soc-nonfatal-hbm-ss0-0 counters as our HW reports both of these kinds.

Also, to highlight the error management is left to the driver, the drm_netlink doesn't handle any of those it just reports whatever the driver exposes.

please let me know if I didn't get your question right.

Thanks,
Aravind.

>
> SOC might integrate different type of controllers that report error in different types. Also, the controllers are capable of convert the error, or change its severity in some circumstances. Mixing severity and error type in a single array may not be the best practice. for example, error-gt0-soc-fatal-hbm-ss0-0 might be converted to non-fatal or deferred error, so driver doesn't need to be response immediately.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: Monday, October 23, 2023 23:29
> To: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>; Lazar, Lijo <Lijo.Lazar@amd.com>
> Cc: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; airlied@gmail.com; daniel@ffwll.ch; joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; ttayar@habana.ai; Zhang, Hawking <Hawking.Zhang@amd.com>; Kasiviswanathan, Harish <Harish.Kasiviswanathan@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; michael.j.ruhl@intel.com
> Subject: Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
>
> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> wrote:
>> Our hardware supports RAS(Reliability, Availability, Serviceability)
>> by reporting the errors to the host, which the KMD processes and
>> exposes a set of error counters which can be used by observability
>> tools to take corrective actions or repairs. Traditionally there were
>> being exposed via PMU (for relative counters) and sysfs interface (for
>> absolute
>> value) in our internal branch. But, due to the limitations in this
>> approach to use two interfaces and also not able to have an event
>> based reporting or configurability, an alternative approach to try
>> netlink was suggested by community for drm subsystem wide UAPI for RAS
>> and telemetry as discussed in [1].
>>
>> This [1] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe
>> driver instance registers a family and operations to the genl
>> subsystem through which it enumerates and reports the error counters.
>> An event based notification is also supported to which userpace can
>> subscribe to and be notified when any error occurs and read the error
>> counter this avoids continuous polling on error counter. This can also
>> be extended to threshold based notification.
> @Hawking Zhang, @Lazar, Lijo
>
> Can you take a look at this series and API and see if it would align with our RAS requirements going forward?
>
> Alex
>
>
>> [1]:
>> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary
>> .html
>>
>> this series is on top of
>> https://patchwork.freedesktop.org/series/125373/,
>>
>> v4:
>> 1. Rebase
>> 2. rename drm_genl_send to drm_genl_reply 3. catch error from xa_store
>> and handle appropriately 4. presently xe_list_errors fills blank data
>> for IGFX, prevent it by having an early check of IS_DGFX (Michael J.
>> Ruhl)
>>
>> v3:
>> 1. Rebase on latest RAS series for XE
>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>> register to netlink subsystem
>>
>> v2: define common interfaces to genl netlink subsystem that all drm
>> drivers can leverage.
>>
>> Below is an example tool drm_ras which demonstrates the use of the
>> supported commands. The tool will be sent to ML with the subject "[RFC
>> i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>> https://patchwork.freedesktop.org/series/118437/#rev2
>>
>> read single error counter:
>>
>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1
>> --error_id=0x0000000000000005 counter value 0
>>
>> read all error counters:
>>
>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>> name                                                    config-id               counter
>>
>> error-gt0-correctable-guc                               0x0000000000000001      0
>> error-gt0-correctable-slm                               0x0000000000000003      0
>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>> error-gt0-fatal-guc                                     0x0000000000000009      0
>> error-gt0-fatal-slm                                     0x000000000000000d      0
>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>> error-gt0-correctable-subslice                          0x0000000000000013      0
>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>> error-gt0-fatal-subslice                                0x0000000000000015      0
>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>> error-gt1-correctable-guc                               0x1000000000000001      0
>> error-gt1-correctable-slm                               0x1000000000000003      0
>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>> error-gt1-fatal-guc                                     0x1000000000000009      0
>> error-gt1-fatal-slm                                     0x100000000000000d      0
>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>> error-gt1-correctable-subslice                          0x1000000000000013      0
>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>> error-gt1-fatal-subslice                                0x1000000000000015      0
>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>
>> wait on a error event:
>>
>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 waiting for
>> error event error event received counter value 0
>>
>> list all errors:
>>
>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>> name                                                    config-id
>>
>> error-gt0-correctable-guc                               0x0000000000000001
>> error-gt0-correctable-slm                               0x0000000000000003
>> error-gt0-correctable-eu-ic                             0x0000000000000004
>> error-gt0-correctable-eu-grf                            0x0000000000000005
>> error-gt0-fatal-guc                                     0x0000000000000009
>> error-gt0-fatal-slm                                     0x000000000000000d
>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>> error-gt0-fatal-fpu                                     0x0000000000000010
>> error-gt0-fatal-tlb                                     0x0000000000000011
>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>> error-gt0-correctable-subslice                          0x0000000000000013
>> error-gt0-correctable-l3bank                            0x0000000000000014
>> error-gt0-fatal-subslice                                0x0000000000000015
>> error-gt0-fatal-l3bank                                  0x0000000000000016
>> error-gt0-sgunit-correctable                            0x0000000000000017
>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>> error-gt0-sgunit-fatal                                  0x0000000000000019
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>> error-gt0-soc-fatal-punit                               0x000000000000001d
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>> error-gt1-correctable-guc                               0x1000000000000001
>> error-gt1-correctable-slm                               0x1000000000000003
>> error-gt1-correctable-eu-ic                             0x1000000000000004
>> error-gt1-correctable-eu-grf                            0x1000000000000005
>> error-gt1-fatal-guc                                     0x1000000000000009
>> error-gt1-fatal-slm                                     0x100000000000000d
>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>> error-gt1-fatal-fpu                                     0x1000000000000010
>> error-gt1-fatal-tlb                                     0x1000000000000011
>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>> error-gt1-correctable-subslice                          0x1000000000000013
>> error-gt1-correctable-l3bank                            0x1000000000000014
>> error-gt1-fatal-subslice                                0x1000000000000015
>> error-gt1-fatal-l3bank                                  0x1000000000000016
>> error-gt1-sgunit-correctable                            0x1000000000000017
>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>> error-gt1-sgunit-fatal                                  0x1000000000000019
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>> error-gt1-soc-fatal-punit                               0x100000000000001d
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Oded Gabbay <ogabbay@kernel.org>
>> Cc: Tomer Tayar <ttayar@habana.ai>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>
>>
>> Aravind Iddamsetty (5):
>>   drm/netlink: Add netlink infrastructure
>>   drm/xe/RAS: Register netlink capability
>>   drm/xe/RAS: Expose the error counters
>>   drm/netlink: Define multicast groups
>>   drm/xe/RAS: send multicast event on occurrence of an error
>>
>>  drivers/gpu/drm/Makefile             |   1 +
>>  drivers/gpu/drm/drm_drv.c            |   7 +
>>  drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>  drivers/gpu/drm/xe/Makefile          |   1 +
>>  drivers/gpu/drm/xe/xe_device.c       |   4 +
>>  drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>  drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>  drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>  include/drm/drm_device.h             |   8 +
>>  include/drm/drm_drv.h                |   7 +
>>  include/drm/drm_netlink.h            |  35 ++
>>  include/uapi/drm/drm_netlink.h       |  87 +++++
>>  include/uapi/drm/xe_drm.h            |  81 +++++
>>  13 files changed, 977 insertions(+)
>>  create mode 100644 drivers/gpu/drm/drm_netlink.c  create mode 100644
>> drivers/gpu/drm/xe/xe_netlink.c  create mode 100644
>> include/drm/drm_netlink.h  create mode 100644
>> include/uapi/drm/drm_netlink.h
>>
>> --
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-23 15:29 ` [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Alex Deucher
  2023-10-24  8:59   ` Zhang, Hawking
@ 2023-10-26 10:04   ` Lazar, Lijo
  2023-10-30  6:19     ` Aravind Iddamsetty
  1 sibling, 1 reply; 31+ messages in thread
From: Lazar, Lijo @ 2023-10-26 10:04 UTC (permalink / raw)
  To: Alex Deucher, Aravind Iddamsetty
  Cc: ogabbay, Harish.Kasiviswanathan, dri-devel, michael.j.ruhl,
	Luben.Tuikov, ttayar, alexander.deucher, Felix.Kuehling,
	intel-xe, Hawking.Zhang



On 10/23/2023 8:59 PM, Alex Deucher wrote:
> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty
> <aravind.iddamsetty@linux.intel.com> wrote:
>>
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> reporting the errors to the host, which the KMD processes and exposes a
>> set of error counters which can be used by observability tools to take
>> corrective actions or repairs. Traditionally there were being exposed
>> via PMU (for relative counters) and sysfs interface (for absolute
>> value) in our internal branch. But, due to the limitations in this
>> approach to use two interfaces and also not able to have an event based
>> reporting or configurability, an alternative approach to try netlink
>> was suggested by community for drm subsystem wide UAPI for RAS and
>> telemetry as discussed in [1].
>>
>> This [1] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.

The commands used seems very limited. In AMD SOCs, IP blocks, instances 
of IP blocks, block types which support RAS will change across generations.

This series has a single command to query the counters supported. Within 
that it seems to assign unique ids for every combination of error type, 
IP block type and then another for each instance. Not sure how good this 
kind of approach is for an end user. The Ids won't necessarily the stay 
the same across multiple generations. Users will generally be interested 
in specific IP blocks.

For ex: to get HBM errors, it looks like the current patch series 
supports READALL which dumps the whole set of errors. Or, users have to 
figure out the ids of HBM stack instance (whose capacity can change 
depending on the SOC and within a single family multiple configurations 
can exist) errors and do multiple READ_ONE calls. Both don't look good.

It would be better if the command argument format can be well defined so 
that it can be queried based on IP block type, instance, and error types 
supported (CE/UE/fatal/parity/deferred etc.).

Thanks,
Lijo

> 
> @Hawking Zhang, @Lazar, Lijo
> 
> Can you take a look at this series and API and see if it would align
> with our RAS requirements going forward?
> 
> Alex
> 
> 
>>
>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>
>> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>>
>> v4:
>> 1. Rebase
>> 2. rename drm_genl_send to drm_genl_reply
>> 3. catch error from xa_store and handle appropriately
>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>> having an early check of IS_DGFX (Michael J. Ruhl)
>>
>> v3:
>> 1. Rebase on latest RAS series for XE
>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>> register to netlink subsystem
>>
>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>> can leverage.
>>
>> Below is an example tool drm_ras which demonstrates the use of the
>> supported commands. The tool will be sent to ML with the subject
>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>> https://patchwork.freedesktop.org/series/118437/#rev2
>>
>> read single error counter:
>>
>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>> counter value 0
>>
>> read all error counters:
>>
>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>> name                                                    config-id               counter
>>
>> error-gt0-correctable-guc                               0x0000000000000001      0
>> error-gt0-correctable-slm                               0x0000000000000003      0
>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>> error-gt0-fatal-guc                                     0x0000000000000009      0
>> error-gt0-fatal-slm                                     0x000000000000000d      0
>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>> error-gt0-correctable-subslice                          0x0000000000000013      0
>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>> error-gt0-fatal-subslice                                0x0000000000000015      0
>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>> error-gt1-correctable-guc                               0x1000000000000001      0
>> error-gt1-correctable-slm                               0x1000000000000003      0
>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>> error-gt1-fatal-guc                                     0x1000000000000009      0
>> error-gt1-fatal-slm                                     0x100000000000000d      0
>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>> error-gt1-correctable-subslice                          0x1000000000000013      0
>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>> error-gt1-fatal-subslice                                0x1000000000000015      0
>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>
>> wait on a error event:
>>
>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>> waiting for error event
>> error event received
>> counter value 0
>>
>> list all errors:
>>
>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>> name                                                    config-id
>>
>> error-gt0-correctable-guc                               0x0000000000000001
>> error-gt0-correctable-slm                               0x0000000000000003
>> error-gt0-correctable-eu-ic                             0x0000000000000004
>> error-gt0-correctable-eu-grf                            0x0000000000000005
>> error-gt0-fatal-guc                                     0x0000000000000009
>> error-gt0-fatal-slm                                     0x000000000000000d
>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>> error-gt0-fatal-fpu                                     0x0000000000000010
>> error-gt0-fatal-tlb                                     0x0000000000000011
>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>> error-gt0-correctable-subslice                          0x0000000000000013
>> error-gt0-correctable-l3bank                            0x0000000000000014
>> error-gt0-fatal-subslice                                0x0000000000000015
>> error-gt0-fatal-l3bank                                  0x0000000000000016
>> error-gt0-sgunit-correctable                            0x0000000000000017
>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>> error-gt0-sgunit-fatal                                  0x0000000000000019
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>> error-gt0-soc-fatal-punit                               0x000000000000001d
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>> error-gt1-correctable-guc                               0x1000000000000001
>> error-gt1-correctable-slm                               0x1000000000000003
>> error-gt1-correctable-eu-ic                             0x1000000000000004
>> error-gt1-correctable-eu-grf                            0x1000000000000005
>> error-gt1-fatal-guc                                     0x1000000000000009
>> error-gt1-fatal-slm                                     0x100000000000000d
>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>> error-gt1-fatal-fpu                                     0x1000000000000010
>> error-gt1-fatal-tlb                                     0x1000000000000011
>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>> error-gt1-correctable-subslice                          0x1000000000000013
>> error-gt1-correctable-l3bank                            0x1000000000000014
>> error-gt1-fatal-subslice                                0x1000000000000015
>> error-gt1-fatal-l3bank                                  0x1000000000000016
>> error-gt1-sgunit-correctable                            0x1000000000000017
>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>> error-gt1-sgunit-fatal                                  0x1000000000000019
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>> error-gt1-soc-fatal-punit                               0x100000000000001d
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Oded Gabbay <ogabbay@kernel.org>
>> Cc: Tomer Tayar <ttayar@habana.ai>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>
>>
>> Aravind Iddamsetty (5):
>>    drm/netlink: Add netlink infrastructure
>>    drm/xe/RAS: Register netlink capability
>>    drm/xe/RAS: Expose the error counters
>>    drm/netlink: Define multicast groups
>>    drm/xe/RAS: send multicast event on occurrence of an error
>>
>>   drivers/gpu/drm/Makefile             |   1 +
>>   drivers/gpu/drm/drm_drv.c            |   7 +
>>   drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>   drivers/gpu/drm/xe/Makefile          |   1 +
>>   drivers/gpu/drm/xe/xe_device.c       |   4 +
>>   drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>   drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>   drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>   include/drm/drm_device.h             |   8 +
>>   include/drm/drm_drv.h                |   7 +
>>   include/drm/drm_netlink.h            |  35 ++
>>   include/uapi/drm/drm_netlink.h       |  87 +++++
>>   include/uapi/drm/xe_drm.h            |  81 +++++
>>   13 files changed, 977 insertions(+)
>>   create mode 100644 drivers/gpu/drm/drm_netlink.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>   create mode 100644 include/drm/drm_netlink.h
>>   create mode 100644 include/uapi/drm/drm_netlink.h
>>
>> --
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-26 10:04   ` Lazar, Lijo
@ 2023-10-30  6:19     ` Aravind Iddamsetty
  2023-10-30 15:11       ` Lazar, Lijo
  0 siblings, 1 reply; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-10-30  6:19 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher
  Cc: ogabbay, Harish.Kasiviswanathan, dri-devel, michael.j.ruhl,
	Luben.Tuikov, ttayar, alexander.deucher, Felix.Kuehling,
	intel-xe, Hawking.Zhang


On 26/10/23 15:34, Lazar, Lijo wrote:

Hi Lijo,

Thank you for your comments.

>
>
> On 10/23/2023 8:59 PM, Alex Deucher wrote:
>> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty
>> <aravind.iddamsetty@linux.intel.com> wrote:
>>>
>>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>>> reporting the errors to the host, which the KMD processes and exposes a
>>> set of error counters which can be used by observability tools to take
>>> corrective actions or repairs. Traditionally there were being exposed
>>> via PMU (for relative counters) and sysfs interface (for absolute
>>> value) in our internal branch. But, due to the limitations in this
>>> approach to use two interfaces and also not able to have an event based
>>> reporting or configurability, an alternative approach to try netlink
>>> was suggested by community for drm subsystem wide UAPI for RAS and
>>> telemetry as discussed in [1].
>>>
>>> This [1] is the inspiration to this series. It uses the generic
>>> netlink(genl) family subsystem and exposes a set of commands that can
>>> be used by every drm driver, the framework provides a means to have
>>> custom commands too. Each drm driver instance in this example xe driver
>>> instance registers a family and operations to the genl subsystem through
>>> which it enumerates and reports the error counters. An event based
>>> notification is also supported to which userpace can subscribe to and
>>> be notified when any error occurs and read the error counter this avoids
>>> continuous polling on error counter. This can also be extended to
>>> threshold based notification.
>
> The commands used seems very limited. In AMD SOCs, IP blocks, instances of IP blocks, block types which support RAS will change across generations.
>
> This series has a single command to query the counters supported. Within that it seems to assign unique ids for every combination of error type, IP block type and then another for each instance. Not sure how good this kind of approach is for an end user. The Ids won't necessarily the stay the same across multiple generations. Users will generally be interested in specific IP blocks.

Exactly the IDs are UAPI and won't change once defined for a platform and any new SKU or platform will add on top of existing ones. Userspace can include the header and use the defines. The query is used to know what all errors exists on a platform and userspace can process the IDs of IP block of interest. I believe even if we list block wise a query will be needed without which userspace wouldn't know which blocks exist on a platform.

>
> For ex: to get HBM errors, it looks like the current patch series supports READALL which dumps the whole set of errors. Or, users have to figure out the ids of HBM stack instance (whose capacity can change depending on the SOC and within a single family multiple configurations can exist) errors and do multiple READ_ONE calls. Both don't look good.
>
> It would be better if the command argument format can be well defined so that it can be queried based on IP block type, instance, and error types supported (CE/UE/fatal/parity/deferred etc.).

so to mitigate multiple read limitation, we can introduce a new GENL command like READ_MULTI which accepts a list of errors ids which userspace can pass and get all interested error counter as response at once. Also, listing individual errors helps if userspace wants to read a particular error at regular intervals. The intention is also to keep KMD logic simple, userspace can build required model on top of flat enumeration.

Please let me know if this sounds reasonable to you.

Thanks,
Aravind.
>
> Thanks,
> Lijo
>
>>
>> @Hawking Zhang, @Lazar, Lijo
>>
>> Can you take a look at this series and API and see if it would align
>> with our RAS requirements going forward?
>>
>> Alex
>>
>>
>>>
>>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>>
>>> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>>>
>>> v4:
>>> 1. Rebase
>>> 2. rename drm_genl_send to drm_genl_reply
>>> 3. catch error from xa_store and handle appropriately
>>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>>> having an early check of IS_DGFX (Michael J. Ruhl)
>>>
>>> v3:
>>> 1. Rebase on latest RAS series for XE
>>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>>> register to netlink subsystem
>>>
>>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>>> can leverage.
>>>
>>> Below is an example tool drm_ras which demonstrates the use of the
>>> supported commands. The tool will be sent to ML with the subject
>>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>> https://patchwork.freedesktop.org/series/118437/#rev2
>>>
>>> read single error counter:
>>>
>>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>>> counter value 0
>>>
>>> read all error counters:
>>>
>>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>>> name                                                    config-id               counter
>>>
>>> error-gt0-correctable-guc                               0x0000000000000001      0
>>> error-gt0-correctable-slm                               0x0000000000000003      0
>>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>>> error-gt0-fatal-guc                                     0x0000000000000009      0
>>> error-gt0-fatal-slm                                     0x000000000000000d      0
>>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>>> error-gt0-correctable-subslice                          0x0000000000000013      0
>>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>>> error-gt0-fatal-subslice                                0x0000000000000015      0
>>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>>> error-gt1-correctable-guc                               0x1000000000000001      0
>>> error-gt1-correctable-slm                               0x1000000000000003      0
>>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>>> error-gt1-fatal-guc                                     0x1000000000000009      0
>>> error-gt1-fatal-slm                                     0x100000000000000d      0
>>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>>> error-gt1-correctable-subslice                          0x1000000000000013      0
>>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>>> error-gt1-fatal-subslice                                0x1000000000000015      0
>>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>>
>>> wait on a error event:
>>>
>>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>>> waiting for error event
>>> error event received
>>> counter value 0
>>>
>>> list all errors:
>>>
>>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>>> name                                                    config-id
>>>
>>> error-gt0-correctable-guc                               0x0000000000000001
>>> error-gt0-correctable-slm                               0x0000000000000003
>>> error-gt0-correctable-eu-ic                             0x0000000000000004
>>> error-gt0-correctable-eu-grf                            0x0000000000000005
>>> error-gt0-fatal-guc                                     0x0000000000000009
>>> error-gt0-fatal-slm                                     0x000000000000000d
>>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>>> error-gt0-fatal-fpu                                     0x0000000000000010
>>> error-gt0-fatal-tlb                                     0x0000000000000011
>>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>>> error-gt0-correctable-subslice                          0x0000000000000013
>>> error-gt0-correctable-l3bank                            0x0000000000000014
>>> error-gt0-fatal-subslice                                0x0000000000000015
>>> error-gt0-fatal-l3bank                                  0x0000000000000016
>>> error-gt0-sgunit-correctable                            0x0000000000000017
>>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>>> error-gt0-sgunit-fatal                                  0x0000000000000019
>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>>> error-gt0-soc-fatal-punit                               0x000000000000001d
>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>>> error-gt1-correctable-guc                               0x1000000000000001
>>> error-gt1-correctable-slm                               0x1000000000000003
>>> error-gt1-correctable-eu-ic                             0x1000000000000004
>>> error-gt1-correctable-eu-grf                            0x1000000000000005
>>> error-gt1-fatal-guc                                     0x1000000000000009
>>> error-gt1-fatal-slm                                     0x100000000000000d
>>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>>> error-gt1-fatal-fpu                                     0x1000000000000010
>>> error-gt1-fatal-tlb                                     0x1000000000000011
>>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>>> error-gt1-correctable-subslice                          0x1000000000000013
>>> error-gt1-correctable-l3bank                            0x1000000000000014
>>> error-gt1-fatal-subslice                                0x1000000000000015
>>> error-gt1-fatal-l3bank                                  0x1000000000000016
>>> error-gt1-sgunit-correctable                            0x1000000000000017
>>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>>> error-gt1-sgunit-fatal                                  0x1000000000000019
>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>>> error-gt1-soc-fatal-punit                               0x100000000000001d
>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Oded Gabbay <ogabbay@kernel.org>
>>> Cc: Tomer Tayar <ttayar@habana.ai>
>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>>
>>>
>>> Aravind Iddamsetty (5):
>>>    drm/netlink: Add netlink infrastructure
>>>    drm/xe/RAS: Register netlink capability
>>>    drm/xe/RAS: Expose the error counters
>>>    drm/netlink: Define multicast groups
>>>    drm/xe/RAS: send multicast event on occurrence of an error
>>>
>>>   drivers/gpu/drm/Makefile             |   1 +
>>>   drivers/gpu/drm/drm_drv.c            |   7 +
>>>   drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>>   drivers/gpu/drm/xe/Makefile          |   1 +
>>>   drivers/gpu/drm/xe/xe_device.c       |   4 +
>>>   drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>>   drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>>   drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>>   include/drm/drm_device.h             |   8 +
>>>   include/drm/drm_drv.h                |   7 +
>>>   include/drm/drm_netlink.h            |  35 ++
>>>   include/uapi/drm/drm_netlink.h       |  87 +++++
>>>   include/uapi/drm/xe_drm.h            |  81 +++++
>>>   13 files changed, 977 insertions(+)
>>>   create mode 100644 drivers/gpu/drm/drm_netlink.c
>>>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>>   create mode 100644 include/drm/drm_netlink.h
>>>   create mode 100644 include/uapi/drm/drm_netlink.h
>>>
>>> -- 
>>> 2.25.1
>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-30  6:19     ` Aravind Iddamsetty
@ 2023-10-30 15:11       ` Lazar, Lijo
  2023-11-01  8:06         ` Aravind Iddamsetty
  0 siblings, 1 reply; 31+ messages in thread
From: Lazar, Lijo @ 2023-10-30 15:11 UTC (permalink / raw)
  To: Aravind Iddamsetty, Alex Deucher
  Cc: ogabbay, Harish.Kasiviswanathan, dri-devel, michael.j.ruhl,
	Luben.Tuikov, ttayar, alexander.deucher, Felix.Kuehling,
	intel-xe, Hawking.Zhang



On 10/30/2023 11:49 AM, Aravind Iddamsetty wrote:
> 
> On 26/10/23 15:34, Lazar, Lijo wrote:
> 
> Hi Lijo,
> 
> Thank you for your comments.
> 
>>
>>
>> On 10/23/2023 8:59 PM, Alex Deucher wrote:
>>> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty
>>> <aravind.iddamsetty@linux.intel.com> wrote:
>>>>
>>>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>>>> reporting the errors to the host, which the KMD processes and exposes a
>>>> set of error counters which can be used by observability tools to take
>>>> corrective actions or repairs. Traditionally there were being exposed
>>>> via PMU (for relative counters) and sysfs interface (for absolute
>>>> value) in our internal branch. But, due to the limitations in this
>>>> approach to use two interfaces and also not able to have an event based
>>>> reporting or configurability, an alternative approach to try netlink
>>>> was suggested by community for drm subsystem wide UAPI for RAS and
>>>> telemetry as discussed in [1].
>>>>
>>>> This [1] is the inspiration to this series. It uses the generic
>>>> netlink(genl) family subsystem and exposes a set of commands that can
>>>> be used by every drm driver, the framework provides a means to have
>>>> custom commands too. Each drm driver instance in this example xe driver
>>>> instance registers a family and operations to the genl subsystem through
>>>> which it enumerates and reports the error counters. An event based
>>>> notification is also supported to which userpace can subscribe to and
>>>> be notified when any error occurs and read the error counter this avoids
>>>> continuous polling on error counter. This can also be extended to
>>>> threshold based notification.
>>
>> The commands used seems very limited. In AMD SOCs, IP blocks, instances of IP blocks, block types which support RAS will change across generations.
>>
>> This series has a single command to query the counters supported. Within that it seems to assign unique ids for every combination of error type, IP block type and then another for each instance. Not sure how good this kind of approach is for an end user. The Ids won't necessarily the stay the same across multiple generations. Users will generally be interested in specific IP blocks.
> 
> Exactly the IDs are UAPI and won't change once defined for a platform and any new SKU or platform will add on top of existing ones. Userspace can include the header and use the defines. The query is used to know what all errors exists on a platform and userspace can process the IDs of IP block of interest. I believe even if we list block wise a query will be needed without which userspace wouldn't know which blocks exist on a platform.
> 

What I meant is - assigning an id for every combination of IP block/ 
instance number/error type is not maintainable across different SOCs.

Instead, can we have  something like -
	Query -> returns IP block ids, number of instances, error types 
supported by each IP block.
	Read Error -> IP block id | Instance number /Instance ALL | Error type 
id/Error type ALL.

Thanks,
Lijo

>>
>> For ex: to get HBM errors, it looks like the current patch series supports READALL which dumps the whole set of errors. Or, users have to figure out the ids of HBM stack instance (whose capacity can change depending on the SOC and within a single family multiple configurations can exist) errors and do multiple READ_ONE calls. Both don't look good.
>>
>> It would be better if the command argument format can be well defined so that it can be queried based on IP block type, instance, and error types supported (CE/UE/fatal/parity/deferred etc.).
> 
> so to mitigate multiple read limitation, we can introduce a new GENL command like READ_MULTI which accepts a list of errors ids which userspace can pass and get all interested error counter as response at once. Also, listing individual errors helps if userspace wants to read a particular error at regular intervals. The intention is also to keep KMD logic simple, userspace can build required model on top of flat enumeration.
> 
> Please let me know if this sounds reasonable to you.
> 
> Thanks,
> Aravind.
>>
>> Thanks,
>> Lijo
>>
>>>
>>> @Hawking Zhang, @Lazar, Lijo
>>>
>>> Can you take a look at this series and API and see if it would align
>>> with our RAS requirements going forward?
>>>
>>> Alex
>>>
>>>
>>>>
>>>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>>>
>>>> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>>>>
>>>> v4:
>>>> 1. Rebase
>>>> 2. rename drm_genl_send to drm_genl_reply
>>>> 3. catch error from xa_store and handle appropriately
>>>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>>>> having an early check of IS_DGFX (Michael J. Ruhl)
>>>>
>>>> v3:
>>>> 1. Rebase on latest RAS series for XE
>>>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>>>> register to netlink subsystem
>>>>
>>>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>>>> can leverage.
>>>>
>>>> Below is an example tool drm_ras which demonstrates the use of the
>>>> supported commands. The tool will be sent to ML with the subject
>>>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>>> https://patchwork.freedesktop.org/series/118437/#rev2
>>>>
>>>> read single error counter:
>>>>
>>>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>>>> counter value 0
>>>>
>>>> read all error counters:
>>>>
>>>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>>>> name                                                    config-id               counter
>>>>
>>>> error-gt0-correctable-guc                               0x0000000000000001      0
>>>> error-gt0-correctable-slm                               0x0000000000000003      0
>>>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>>>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>>>> error-gt0-fatal-guc                                     0x0000000000000009      0
>>>> error-gt0-fatal-slm                                     0x000000000000000d      0
>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>>>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>>>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>>>> error-gt0-correctable-subslice                          0x0000000000000013      0
>>>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>>>> error-gt0-fatal-subslice                                0x0000000000000015      0
>>>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>>>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>>>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>>>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>>>> error-gt1-correctable-guc                               0x1000000000000001      0
>>>> error-gt1-correctable-slm                               0x1000000000000003      0
>>>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>>>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>>>> error-gt1-fatal-guc                                     0x1000000000000009      0
>>>> error-gt1-fatal-slm                                     0x100000000000000d      0
>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>>>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>>>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>>>> error-gt1-correctable-subslice                          0x1000000000000013      0
>>>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>>>> error-gt1-fatal-subslice                                0x1000000000000015      0
>>>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>>>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>>>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>>>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>>>
>>>> wait on a error event:
>>>>
>>>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>>>> waiting for error event
>>>> error event received
>>>> counter value 0
>>>>
>>>> list all errors:
>>>>
>>>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>>>> name                                                    config-id
>>>>
>>>> error-gt0-correctable-guc                               0x0000000000000001
>>>> error-gt0-correctable-slm                               0x0000000000000003
>>>> error-gt0-correctable-eu-ic                             0x0000000000000004
>>>> error-gt0-correctable-eu-grf                            0x0000000000000005
>>>> error-gt0-fatal-guc                                     0x0000000000000009
>>>> error-gt0-fatal-slm                                     0x000000000000000d
>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>>>> error-gt0-fatal-fpu                                     0x0000000000000010
>>>> error-gt0-fatal-tlb                                     0x0000000000000011
>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>>>> error-gt0-correctable-subslice                          0x0000000000000013
>>>> error-gt0-correctable-l3bank                            0x0000000000000014
>>>> error-gt0-fatal-subslice                                0x0000000000000015
>>>> error-gt0-fatal-l3bank                                  0x0000000000000016
>>>> error-gt0-sgunit-correctable                            0x0000000000000017
>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>>>> error-gt0-sgunit-fatal                                  0x0000000000000019
>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>>>> error-gt0-soc-fatal-punit                               0x000000000000001d
>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>>>> error-gt1-correctable-guc                               0x1000000000000001
>>>> error-gt1-correctable-slm                               0x1000000000000003
>>>> error-gt1-correctable-eu-ic                             0x1000000000000004
>>>> error-gt1-correctable-eu-grf                            0x1000000000000005
>>>> error-gt1-fatal-guc                                     0x1000000000000009
>>>> error-gt1-fatal-slm                                     0x100000000000000d
>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>>>> error-gt1-fatal-fpu                                     0x1000000000000010
>>>> error-gt1-fatal-tlb                                     0x1000000000000011
>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>>>> error-gt1-correctable-subslice                          0x1000000000000013
>>>> error-gt1-correctable-l3bank                            0x1000000000000014
>>>> error-gt1-fatal-subslice                                0x1000000000000015
>>>> error-gt1-fatal-l3bank                                  0x1000000000000016
>>>> error-gt1-sgunit-correctable                            0x1000000000000017
>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>>>> error-gt1-sgunit-fatal                                  0x1000000000000019
>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>>>> error-gt1-soc-fatal-punit                               0x100000000000001d
>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>>>
>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Oded Gabbay <ogabbay@kernel.org>
>>>> Cc: Tomer Tayar <ttayar@habana.ai>
>>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>>>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>>>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>>>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>>>
>>>>
>>>> Aravind Iddamsetty (5):
>>>>     drm/netlink: Add netlink infrastructure
>>>>     drm/xe/RAS: Register netlink capability
>>>>     drm/xe/RAS: Expose the error counters
>>>>     drm/netlink: Define multicast groups
>>>>     drm/xe/RAS: send multicast event on occurrence of an error
>>>>
>>>>    drivers/gpu/drm/Makefile             |   1 +
>>>>    drivers/gpu/drm/drm_drv.c            |   7 +
>>>>    drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>>>    drivers/gpu/drm/xe/Makefile          |   1 +
>>>>    drivers/gpu/drm/xe/xe_device.c       |   4 +
>>>>    drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>>>    drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>>>    drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>>>    include/drm/drm_device.h             |   8 +
>>>>    include/drm/drm_drv.h                |   7 +
>>>>    include/drm/drm_netlink.h            |  35 ++
>>>>    include/uapi/drm/drm_netlink.h       |  87 +++++
>>>>    include/uapi/drm/xe_drm.h            |  81 +++++
>>>>    13 files changed, 977 insertions(+)
>>>>    create mode 100644 drivers/gpu/drm/drm_netlink.c
>>>>    create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>>>    create mode 100644 include/drm/drm_netlink.h
>>>>    create mode 100644 include/uapi/drm/drm_netlink.h
>>>>
>>>> -- 
>>>> 2.25.1
>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-30 15:11       ` Lazar, Lijo
@ 2023-11-01  8:06         ` Aravind Iddamsetty
  2023-11-07  5:30           ` Lazar, Lijo
  0 siblings, 1 reply; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-11-01  8:06 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher
  Cc: ogabbay, Harish.Kasiviswanathan, dri-devel, michael.j.ruhl,
	Luben.Tuikov, ttayar, alexander.deucher, Felix.Kuehling,
	intel-xe, Hawking.Zhang


On 30/10/23 20:41, Lazar, Lijo wrote:
>
>
> On 10/30/2023 11:49 AM, Aravind Iddamsetty wrote:
>>
>> On 26/10/23 15:34, Lazar, Lijo wrote:
>>
>> Hi Lijo,
>>
>> Thank you for your comments.
>>
>>>
>>>
>>> On 10/23/2023 8:59 PM, Alex Deucher wrote:
>>>> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty
>>>> <aravind.iddamsetty@linux.intel.com> wrote:
>>>>>
>>>>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>>>>> reporting the errors to the host, which the KMD processes and exposes a
>>>>> set of error counters which can be used by observability tools to take
>>>>> corrective actions or repairs. Traditionally there were being exposed
>>>>> via PMU (for relative counters) and sysfs interface (for absolute
>>>>> value) in our internal branch. But, due to the limitations in this
>>>>> approach to use two interfaces and also not able to have an event based
>>>>> reporting or configurability, an alternative approach to try netlink
>>>>> was suggested by community for drm subsystem wide UAPI for RAS and
>>>>> telemetry as discussed in [1].
>>>>>
>>>>> This [1] is the inspiration to this series. It uses the generic
>>>>> netlink(genl) family subsystem and exposes a set of commands that can
>>>>> be used by every drm driver, the framework provides a means to have
>>>>> custom commands too. Each drm driver instance in this example xe driver
>>>>> instance registers a family and operations to the genl subsystem through
>>>>> which it enumerates and reports the error counters. An event based
>>>>> notification is also supported to which userpace can subscribe to and
>>>>> be notified when any error occurs and read the error counter this avoids
>>>>> continuous polling on error counter. This can also be extended to
>>>>> threshold based notification.
>>>
>>> The commands used seems very limited. In AMD SOCs, IP blocks, instances of IP blocks, block types which support RAS will change across generations.
>>>
>>> This series has a single command to query the counters supported. Within that it seems to assign unique ids for every combination of error type, IP block type and then another for each instance. Not sure how good this kind of approach is for an end user. The Ids won't necessarily the stay the same across multiple generations. Users will generally be interested in specific IP blocks.
>>
>> Exactly the IDs are UAPI and won't change once defined for a platform and any new SKU or platform will add on top of existing ones. Userspace can include the header and use the defines. The query is used to know what all errors exists on a platform and userspace can process the IDs of IP block of interest. I believe even if we list block wise a query will be needed without which userspace wouldn't know which blocks exist on a platform.
>>
>
> What I meant is - assigning an id for every combination of IP block/ instance number/error type is not maintainable across different SOCs.
>
> Instead, can we have  something like -
>     Query -> returns IP block ids, number of instances, error types supported by each IP block.
>     Read Error -> IP block id | Instance number /Instance ALL | Error type id/Error type ALL.

Hi Lijo,

Would you please elaborate more on what is the issue you fore see with the maintainability. But I have a query on the model suggested

This might work well with user input based tools, but don't think it suits if we want to periodically read a particular counter.

The inspiration to have ID for each is taken from PMU subsystem where every event has an ID and a flat list so no multiple queries and we can read them individually or group together
which can be achieved via READ_MULTI command I proposed earlier.

Thanks,
Aravind.
>
> Thanks,
> Lijo
>
>>>
>>> For ex: to get HBM errors, it looks like the current patch series supports READALL which dumps the whole set of errors. Or, users have to figure out the ids of HBM stack instance (whose capacity can change depending on the SOC and within a single family multiple configurations can exist) errors and do multiple READ_ONE calls. Both don't look good.
>>>
>>> It would be better if the command argument format can be well defined so that it can be queried based on IP block type, instance, and error types supported (CE/UE/fatal/parity/deferred etc.).
>>
>> so to mitigate multiple read limitation, we can introduce a new GENL command like READ_MULTI which accepts a list of errors ids which userspace can pass and get all interested error counter as response at once. Also, listing individual errors helps if userspace wants to read a particular error at regular intervals. The intention is also to keep KMD logic simple, userspace can build required model on top of flat enumeration.
>>
>> Please let me know if this sounds reasonable to you.
>>
>> Thanks,
>> Aravind.
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>
>>>> @Hawking Zhang, @Lazar, Lijo
>>>>
>>>> Can you take a look at this series and API and see if it would align
>>>> with our RAS requirements going forward?
>>>>
>>>> Alex
>>>>
>>>>
>>>>>
>>>>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>>>>
>>>>> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>>>>>
>>>>> v4:
>>>>> 1. Rebase
>>>>> 2. rename drm_genl_send to drm_genl_reply
>>>>> 3. catch error from xa_store and handle appropriately
>>>>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>>>>> having an early check of IS_DGFX (Michael J. Ruhl)
>>>>>
>>>>> v3:
>>>>> 1. Rebase on latest RAS series for XE
>>>>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>>>>> register to netlink subsystem
>>>>>
>>>>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>>>>> can leverage.
>>>>>
>>>>> Below is an example tool drm_ras which demonstrates the use of the
>>>>> supported commands. The tool will be sent to ML with the subject
>>>>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>>>> https://patchwork.freedesktop.org/series/118437/#rev2
>>>>>
>>>>> read single error counter:
>>>>>
>>>>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>>>>> counter value 0
>>>>>
>>>>> read all error counters:
>>>>>
>>>>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>>>>> name                                                    config-id               counter
>>>>>
>>>>> error-gt0-correctable-guc                               0x0000000000000001      0
>>>>> error-gt0-correctable-slm                               0x0000000000000003      0
>>>>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>>>>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>>>>> error-gt0-fatal-guc                                     0x0000000000000009      0
>>>>> error-gt0-fatal-slm                                     0x000000000000000d      0
>>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>>>>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>>>>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>>>>> error-gt0-correctable-subslice                          0x0000000000000013      0
>>>>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>>>>> error-gt0-fatal-subslice                                0x0000000000000015      0
>>>>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>>>>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>>>>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>>>>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>>>>> error-gt1-correctable-guc                               0x1000000000000001      0
>>>>> error-gt1-correctable-slm                               0x1000000000000003      0
>>>>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>>>>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>>>>> error-gt1-fatal-guc                                     0x1000000000000009      0
>>>>> error-gt1-fatal-slm                                     0x100000000000000d      0
>>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>>>>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>>>>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>>>>> error-gt1-correctable-subslice                          0x1000000000000013      0
>>>>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>>>>> error-gt1-fatal-subslice                                0x1000000000000015      0
>>>>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>>>>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>>>>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>>>>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>>>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>>>>
>>>>> wait on a error event:
>>>>>
>>>>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>>>>> waiting for error event
>>>>> error event received
>>>>> counter value 0
>>>>>
>>>>> list all errors:
>>>>>
>>>>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>>>>> name                                                    config-id
>>>>>
>>>>> error-gt0-correctable-guc                               0x0000000000000001
>>>>> error-gt0-correctable-slm                               0x0000000000000003
>>>>> error-gt0-correctable-eu-ic                             0x0000000000000004
>>>>> error-gt0-correctable-eu-grf                            0x0000000000000005
>>>>> error-gt0-fatal-guc                                     0x0000000000000009
>>>>> error-gt0-fatal-slm                                     0x000000000000000d
>>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>>>>> error-gt0-fatal-fpu                                     0x0000000000000010
>>>>> error-gt0-fatal-tlb                                     0x0000000000000011
>>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>>>>> error-gt0-correctable-subslice                          0x0000000000000013
>>>>> error-gt0-correctable-l3bank                            0x0000000000000014
>>>>> error-gt0-fatal-subslice                                0x0000000000000015
>>>>> error-gt0-fatal-l3bank                                  0x0000000000000016
>>>>> error-gt0-sgunit-correctable                            0x0000000000000017
>>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>>>>> error-gt0-sgunit-fatal                                  0x0000000000000019
>>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>>>>> error-gt0-soc-fatal-punit                               0x000000000000001d
>>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>>>>> error-gt1-correctable-guc                               0x1000000000000001
>>>>> error-gt1-correctable-slm                               0x1000000000000003
>>>>> error-gt1-correctable-eu-ic                             0x1000000000000004
>>>>> error-gt1-correctable-eu-grf                            0x1000000000000005
>>>>> error-gt1-fatal-guc                                     0x1000000000000009
>>>>> error-gt1-fatal-slm                                     0x100000000000000d
>>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>>>>> error-gt1-fatal-fpu                                     0x1000000000000010
>>>>> error-gt1-fatal-tlb                                     0x1000000000000011
>>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>>>>> error-gt1-correctable-subslice                          0x1000000000000013
>>>>> error-gt1-correctable-l3bank                            0x1000000000000014
>>>>> error-gt1-fatal-subslice                                0x1000000000000015
>>>>> error-gt1-fatal-l3bank                                  0x1000000000000016
>>>>> error-gt1-sgunit-correctable                            0x1000000000000017
>>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>>>>> error-gt1-sgunit-fatal                                  0x1000000000000019
>>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>>>>> error-gt1-soc-fatal-punit                               0x100000000000001d
>>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>>>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>>>>
>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>> Cc: Oded Gabbay <ogabbay@kernel.org>
>>>>> Cc: Tomer Tayar <ttayar@habana.ai>
>>>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>>>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>>>>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>>>>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>>>>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>>>>
>>>>>
>>>>> Aravind Iddamsetty (5):
>>>>>     drm/netlink: Add netlink infrastructure
>>>>>     drm/xe/RAS: Register netlink capability
>>>>>     drm/xe/RAS: Expose the error counters
>>>>>     drm/netlink: Define multicast groups
>>>>>     drm/xe/RAS: send multicast event on occurrence of an error
>>>>>
>>>>>    drivers/gpu/drm/Makefile             |   1 +
>>>>>    drivers/gpu/drm/drm_drv.c            |   7 +
>>>>>    drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>>>>    drivers/gpu/drm/xe/Makefile          |   1 +
>>>>>    drivers/gpu/drm/xe/xe_device.c       |   4 +
>>>>>    drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>>>>    drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>>>>    drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>>>>    include/drm/drm_device.h             |   8 +
>>>>>    include/drm/drm_drv.h                |   7 +
>>>>>    include/drm/drm_netlink.h            |  35 ++
>>>>>    include/uapi/drm/drm_netlink.h       |  87 +++++
>>>>>    include/uapi/drm/xe_drm.h            |  81 +++++
>>>>>    13 files changed, 977 insertions(+)
>>>>>    create mode 100644 drivers/gpu/drm/drm_netlink.c
>>>>>    create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>>>>    create mode 100644 include/drm/drm_netlink.h
>>>>>    create mode 100644 include/uapi/drm/drm_netlink.h
>>>>>
>>>>> -- 
>>>>> 2.25.1
>>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-11-01  8:06         ` Aravind Iddamsetty
@ 2023-11-07  5:30           ` Lazar, Lijo
  2023-11-08  9:24             ` Aravind Iddamsetty
  0 siblings, 1 reply; 31+ messages in thread
From: Lazar, Lijo @ 2023-11-07  5:30 UTC (permalink / raw)
  To: Aravind Iddamsetty, Alex Deucher
  Cc: ogabbay, Harish.Kasiviswanathan, dri-devel, michael.j.ruhl,
	Luben.Tuikov, ttayar, alexander.deucher, Felix.Kuehling,
	intel-xe, Hawking.Zhang



On 11/1/2023 1:36 PM, Aravind Iddamsetty wrote:
> 
> On 30/10/23 20:41, Lazar, Lijo wrote:
>>
>>
>> On 10/30/2023 11:49 AM, Aravind Iddamsetty wrote:
>>>
>>> On 26/10/23 15:34, Lazar, Lijo wrote:
>>>
>>> Hi Lijo,
>>>
>>> Thank you for your comments.
>>>
>>>>
>>>>
>>>> On 10/23/2023 8:59 PM, Alex Deucher wrote:
>>>>> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty
>>>>> <aravind.iddamsetty@linux.intel.com> wrote:
>>>>>>
>>>>>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>>>>>> reporting the errors to the host, which the KMD processes and exposes a
>>>>>> set of error counters which can be used by observability tools to take
>>>>>> corrective actions or repairs. Traditionally there were being exposed
>>>>>> via PMU (for relative counters) and sysfs interface (for absolute
>>>>>> value) in our internal branch. But, due to the limitations in this
>>>>>> approach to use two interfaces and also not able to have an event based
>>>>>> reporting or configurability, an alternative approach to try netlink
>>>>>> was suggested by community for drm subsystem wide UAPI for RAS and
>>>>>> telemetry as discussed in [1].
>>>>>>
>>>>>> This [1] is the inspiration to this series. It uses the generic
>>>>>> netlink(genl) family subsystem and exposes a set of commands that can
>>>>>> be used by every drm driver, the framework provides a means to have
>>>>>> custom commands too. Each drm driver instance in this example xe driver
>>>>>> instance registers a family and operations to the genl subsystem through
>>>>>> which it enumerates and reports the error counters. An event based
>>>>>> notification is also supported to which userpace can subscribe to and
>>>>>> be notified when any error occurs and read the error counter this avoids
>>>>>> continuous polling on error counter. This can also be extended to
>>>>>> threshold based notification.
>>>>
>>>> The commands used seems very limited. In AMD SOCs, IP blocks, instances of IP blocks, block types which support RAS will change across generations.
>>>>
>>>> This series has a single command to query the counters supported. Within that it seems to assign unique ids for every combination of error type, IP block type and then another for each instance. Not sure how good this kind of approach is for an end user. The Ids won't necessarily the stay the same across multiple generations. Users will generally be interested in specific IP blocks.
>>>
>>> Exactly the IDs are UAPI and won't change once defined for a platform and any new SKU or platform will add on top of existing ones. Userspace can include the header and use the defines. The query is used to know what all errors exists on a platform and userspace can process the IDs of IP block of interest. I believe even if we list block wise a query will be needed without which userspace wouldn't know which blocks exist on a platform.
>>>
>>
>> What I meant is - assigning an id for every combination of IP block/ instance number/error type is not maintainable across different SOCs.
>>
>> Instead, can we have  something like -
>>      Query -> returns IP block ids, number of instances, error types supported by each IP block.
>>      Read Error -> IP block id | Instance number /Instance ALL | Error type id/Error type ALL.
> 
> Hi Lijo,
> 
> Would you please elaborate more on what is the issue you fore see with the maintainability. But I have a query on the model suggested
> 
> This might work well with user input based tools, but don't think it suits if we want to periodically read a particular counter.
> 
> The inspiration to have ID for each is taken from PMU subsystem where every event has an ID and a flat list so no multiple queries and we can read them individually or group together
> which can be achieved via READ_MULTI command I proposed earlier.
> 

The problem is mainly with maintaining a static list including all ip_id 
| instance | err_type combinations.  Instead, preference is for client 
to query the capabilities -> instance/error types supported and then use 
that info later to fetch error info.

Capability query could return something like ip block, total instance 
available and error types supported. This doesn't require to maintain an 
ID list for each combination.

The instances per SOC could be variable. For ex: it's not required that 
all SKUs of your SOC type to have have ss0-ss3 HBMs. For the same SOC 
type or for new SOC type, it could be more or less.

Roughly something like ..

enum ip_block_id
{
	block1,
	block2,
	block3,
	....
	block_all
}

enum ip_sub_block_id (if required)
{
	sub_block1,
	sub_block2,
	....
	sub_block_all
}

#define INSTANCE_ALL  -1

enum ras_error_type
{
	correctable,
	uncorrectable,
	deferred,
	fatal,
	...
	err_all
}

Then define something like below while querying error details.

	<31:24> = Block Id
	<23:16> subblock id
	<15:8> - interested instance
	<7:0> - error_type

Instance number could be 'inst_all' or specific IP instance.

Thanks,
Lijo

> Thanks,
> Aravind.
>>
>> Thanks,
>> Lijo
>>
>>>>
>>>> For ex: to get HBM errors, it looks like the current patch series supports READALL which dumps the whole set of errors. Or, users have to figure out the ids of HBM stack instance (whose capacity can change depending on the SOC and within a single family multiple configurations can exist) errors and do multiple READ_ONE calls. Both don't look good.
>>>>
>>>> It would be better if the command argument format can be well defined so that it can be queried based on IP block type, instance, and error types supported (CE/UE/fatal/parity/deferred etc.).
>>>
>>> so to mitigate multiple read limitation, we can introduce a new GENL command like READ_MULTI which accepts a list of errors ids which userspace can pass and get all interested error counter as response at once. Also, listing individual errors helps if userspace wants to read a particular error at regular intervals. The intention is also to keep KMD logic simple, userspace can build required model on top of flat enumeration.
>>>
>>> Please let me know if this sounds reasonable to you.
>>>
>>> Thanks,
>>> Aravind.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>>
>>>>> @Hawking Zhang, @Lazar, Lijo
>>>>>
>>>>> Can you take a look at this series and API and see if it would align
>>>>> with our RAS requirements going forward?
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>>
>>>>>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>>>>>
>>>>>> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>>>>>>
>>>>>> v4:
>>>>>> 1. Rebase
>>>>>> 2. rename drm_genl_send to drm_genl_reply
>>>>>> 3. catch error from xa_store and handle appropriately
>>>>>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>>>>>> having an early check of IS_DGFX (Michael J. Ruhl)
>>>>>>
>>>>>> v3:
>>>>>> 1. Rebase on latest RAS series for XE
>>>>>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>>>>>> register to netlink subsystem
>>>>>>
>>>>>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>>>>>> can leverage.
>>>>>>
>>>>>> Below is an example tool drm_ras which demonstrates the use of the
>>>>>> supported commands. The tool will be sent to ML with the subject
>>>>>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>>>>> https://patchwork.freedesktop.org/series/118437/#rev2
>>>>>>
>>>>>> read single error counter:
>>>>>>
>>>>>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>>>>>> counter value 0
>>>>>>
>>>>>> read all error counters:
>>>>>>
>>>>>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>>>>>> name                                                    config-id               counter
>>>>>>
>>>>>> error-gt0-correctable-guc                               0x0000000000000001      0
>>>>>> error-gt0-correctable-slm                               0x0000000000000003      0
>>>>>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>>>>>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>>>>>> error-gt0-fatal-guc                                     0x0000000000000009      0
>>>>>> error-gt0-fatal-slm                                     0x000000000000000d      0
>>>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>>>>>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>>>>>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>>>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>>>>>> error-gt0-correctable-subslice                          0x0000000000000013      0
>>>>>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>>>>>> error-gt0-fatal-subslice                                0x0000000000000015      0
>>>>>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>>>>>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>>>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>>>>>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>>>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>>>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>>>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>>>>>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>>>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>>>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>>>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>>>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>>>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>>>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>>>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>>>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>>>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>>>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>>>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>>>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>>>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>>>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>>>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>>>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>>>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>>>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>>>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>>>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>>>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>>>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>>>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>>>>>> error-gt1-correctable-guc                               0x1000000000000001      0
>>>>>> error-gt1-correctable-slm                               0x1000000000000003      0
>>>>>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>>>>>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>>>>>> error-gt1-fatal-guc                                     0x1000000000000009      0
>>>>>> error-gt1-fatal-slm                                     0x100000000000000d      0
>>>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>>>>>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>>>>>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>>>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>>>>>> error-gt1-correctable-subslice                          0x1000000000000013      0
>>>>>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>>>>>> error-gt1-fatal-subslice                                0x1000000000000015      0
>>>>>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>>>>>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>>>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>>>>>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>>>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>>>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>>>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>>>>>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>>>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>>>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>>>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>>>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>>>>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>>>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>>>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>>>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>>>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>>>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>>>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>>>>>
>>>>>> wait on a error event:
>>>>>>
>>>>>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>>>>>> waiting for error event
>>>>>> error event received
>>>>>> counter value 0
>>>>>>
>>>>>> list all errors:
>>>>>>
>>>>>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>>>>>> name                                                    config-id
>>>>>>
>>>>>> error-gt0-correctable-guc                               0x0000000000000001
>>>>>> error-gt0-correctable-slm                               0x0000000000000003
>>>>>> error-gt0-correctable-eu-ic                             0x0000000000000004
>>>>>> error-gt0-correctable-eu-grf                            0x0000000000000005
>>>>>> error-gt0-fatal-guc                                     0x0000000000000009
>>>>>> error-gt0-fatal-slm                                     0x000000000000000d
>>>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>>>>>> error-gt0-fatal-fpu                                     0x0000000000000010
>>>>>> error-gt0-fatal-tlb                                     0x0000000000000011
>>>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>>>>>> error-gt0-correctable-subslice                          0x0000000000000013
>>>>>> error-gt0-correctable-l3bank                            0x0000000000000014
>>>>>> error-gt0-fatal-subslice                                0x0000000000000015
>>>>>> error-gt0-fatal-l3bank                                  0x0000000000000016
>>>>>> error-gt0-sgunit-correctable                            0x0000000000000017
>>>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>>>>>> error-gt0-sgunit-fatal                                  0x0000000000000019
>>>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>>>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>>>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>>>>>> error-gt0-soc-fatal-punit                               0x000000000000001d
>>>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>>>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>>>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>>>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>>>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>>>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>>>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>>>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>>>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>>>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>>>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>>>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>>>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>>>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>>>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>>>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>>>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>>>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>>>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>>>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>>>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>>>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>>>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>>>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>>>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>>>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>>>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>>>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>>>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>>>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>>>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>>>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>>>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>>>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>>>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>>>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>>>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>>>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>>>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>>>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>>>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>>>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>>>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>>>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>>>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>>>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>>>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>>>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>>>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>>>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>>>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>>>>>> error-gt1-correctable-guc                               0x1000000000000001
>>>>>> error-gt1-correctable-slm                               0x1000000000000003
>>>>>> error-gt1-correctable-eu-ic                             0x1000000000000004
>>>>>> error-gt1-correctable-eu-grf                            0x1000000000000005
>>>>>> error-gt1-fatal-guc                                     0x1000000000000009
>>>>>> error-gt1-fatal-slm                                     0x100000000000000d
>>>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>>>>>> error-gt1-fatal-fpu                                     0x1000000000000010
>>>>>> error-gt1-fatal-tlb                                     0x1000000000000011
>>>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>>>>>> error-gt1-correctable-subslice                          0x1000000000000013
>>>>>> error-gt1-correctable-l3bank                            0x1000000000000014
>>>>>> error-gt1-fatal-subslice                                0x1000000000000015
>>>>>> error-gt1-fatal-l3bank                                  0x1000000000000016
>>>>>> error-gt1-sgunit-correctable                            0x1000000000000017
>>>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>>>>>> error-gt1-sgunit-fatal                                  0x1000000000000019
>>>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>>>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>>>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>>>>>> error-gt1-soc-fatal-punit                               0x100000000000001d
>>>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>>>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>>>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>>>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>>>>>> error-gt1-soc-fatal-cd0-mdfi                      ��     0x1000000000000022
>>>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>>>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>>>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>>>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>>>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>>>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>>>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>>>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>>>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>>>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>>>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>>>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>>>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>>>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>>>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>>>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>>>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>>>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>>>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>>>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>>>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>>>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>>>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>>>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>>>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>>>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>>>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>>>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>>>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>>>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>>>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>>>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>>>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>>>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>>>>>
>>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>> Cc: Oded Gabbay <ogabbay@kernel.org>
>>>>>> Cc: Tomer Tayar <ttayar@habana.ai>
>>>>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>>>>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>>>>>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>>>>>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>>>>>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>>>>>
>>>>>>
>>>>>> Aravind Iddamsetty (5):
>>>>>>      drm/netlink: Add netlink infrastructure
>>>>>>      drm/xe/RAS: Register netlink capability
>>>>>>      drm/xe/RAS: Expose the error counters
>>>>>>      drm/netlink: Define multicast groups
>>>>>>      drm/xe/RAS: send multicast event on occurrence of an error
>>>>>>
>>>>>>     drivers/gpu/drm/Makefile             |   1 +
>>>>>>     drivers/gpu/drm/drm_drv.c            |   7 +
>>>>>>     drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>>>>>     drivers/gpu/drm/xe/Makefile          |   1 +
>>>>>>     drivers/gpu/drm/xe/xe_device.c       |   4 +
>>>>>>     drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>>>>>     drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>>>>>     drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>>>>>     include/drm/drm_device.h             |   8 +
>>>>>>     include/drm/drm_drv.h                |   7 +
>>>>>>     include/drm/drm_netlink.h            |  35 ++
>>>>>>     include/uapi/drm/drm_netlink.h       |  87 +++++
>>>>>>     include/uapi/drm/xe_drm.h            |  81 +++++
>>>>>>     13 files changed, 977 insertions(+)
>>>>>>     create mode 100644 drivers/gpu/drm/drm_netlink.c
>>>>>>     create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>>>>>     create mode 100644 include/drm/drm_netlink.h
>>>>>>     create mode 100644 include/uapi/drm/drm_netlink.h
>>>>>>
>>>>>> -- 
>>>>>> 2.25.1
>>>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-11-07  5:30           ` Lazar, Lijo
@ 2023-11-08  9:24             ` Aravind Iddamsetty
  0 siblings, 0 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-11-08  9:24 UTC (permalink / raw)
  To: Lazar, Lijo, Alex Deucher
  Cc: ogabbay, Harish.Kasiviswanathan, dri-devel, michael.j.ruhl,
	Luben.Tuikov, ttayar, alexander.deucher, Felix.Kuehling,
	intel-xe, Hawking.Zhang


On 07/11/23 11:00, Lazar, Lijo wrote:
>
>
> On 11/1/2023 1:36 PM, Aravind Iddamsetty wrote:
>>
>> On 30/10/23 20:41, Lazar, Lijo wrote:
>>>
>>>
>>> On 10/30/2023 11:49 AM, Aravind Iddamsetty wrote:
>>>>
>>>> On 26/10/23 15:34, Lazar, Lijo wrote:
>>>>
>>>> Hi Lijo,
>>>>
>>>> Thank you for your comments.
>>>>
>>>>>
>>>>>
>>>>> On 10/23/2023 8:59 PM, Alex Deucher wrote:
>>>>>> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty
>>>>>> <aravind.iddamsetty@linux.intel.com> wrote:
>>>>>>>
>>>>>>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>>>>>>> reporting the errors to the host, which the KMD processes and exposes a
>>>>>>> set of error counters which can be used by observability tools to take
>>>>>>> corrective actions or repairs. Traditionally there were being exposed
>>>>>>> via PMU (for relative counters) and sysfs interface (for absolute
>>>>>>> value) in our internal branch. But, due to the limitations in this
>>>>>>> approach to use two interfaces and also not able to have an event based
>>>>>>> reporting or configurability, an alternative approach to try netlink
>>>>>>> was suggested by community for drm subsystem wide UAPI for RAS and
>>>>>>> telemetry as discussed in [1].
>>>>>>>
>>>>>>> This [1] is the inspiration to this series. It uses the generic
>>>>>>> netlink(genl) family subsystem and exposes a set of commands that can
>>>>>>> be used by every drm driver, the framework provides a means to have
>>>>>>> custom commands too. Each drm driver instance in this example xe driver
>>>>>>> instance registers a family and operations to the genl subsystem through
>>>>>>> which it enumerates and reports the error counters. An event based
>>>>>>> notification is also supported to which userpace can subscribe to and
>>>>>>> be notified when any error occurs and read the error counter this avoids
>>>>>>> continuous polling on error counter. This can also be extended to
>>>>>>> threshold based notification.
>>>>>
>>>>> The commands used seems very limited. In AMD SOCs, IP blocks, instances of IP blocks, block types which support RAS will change across generations.
>>>>>
>>>>> This series has a single command to query the counters supported. Within that it seems to assign unique ids for every combination of error type, IP block type and then another for each instance. Not sure how good this kind of approach is for an end user. The Ids won't necessarily the stay the same across multiple generations. Users will generally be interested in specific IP blocks.
>>>>
>>>> Exactly the IDs are UAPI and won't change once defined for a platform and any new SKU or platform will add on top of existing ones. Userspace can include the header and use the defines. The query is used to know what all errors exists on a platform and userspace can process the IDs of IP block of interest. I believe even if we list block wise a query will be needed without which userspace wouldn't know which blocks exist on a platform.
>>>>
>>>
>>> What I meant is - assigning an id for every combination of IP block/ instance number/error type is not maintainable across different SOCs.
>>>
>>> Instead, can we have  something like -
>>>      Query -> returns IP block ids, number of instances, error types supported by each IP block.
>>>      Read Error -> IP block id | Instance number /Instance ALL | Error type id/Error type ALL.
>>
>> Hi Lijo,
>>
>> Would you please elaborate more on what is the issue you fore see with the maintainability. But I have a query on the model suggested
>>
>> This might work well with user input based tools, but don't think it suits if we want to periodically read a particular counter.
>>
>> The inspiration to have ID for each is taken from PMU subsystem where every event has an ID and a flat list so no multiple queries and we can read them individually or group together
>> which can be achieved via READ_MULTI command I proposed earlier.
>>
>
> The problem is mainly with maintaining a static list including all ip_id | instance | err_type combinations.  Instead, preference is for client to query the capabilities -> instance/error types supported and then use that info later to fetch error info.
>
> Capability query could return something like ip block, total instance available and error types supported. This doesn't require to maintain an ID list for each combination.
>
> The instances per SOC could be variable. For ex: it's not required that all SKUs of your SOC type to have have ss0-ss3 HBMs. For the same SOC type or for new SOC type, it could be more or less.
>
> Roughly something like ..
>
> enum ip_block_id
> {
>     block1,
>     block2,
>     block3,
>     ....
>     block_all
> }
>
> enum ip_sub_block_id (if required)
> {
>     sub_block1,
>     sub_block2,
>     ....
>     sub_block_all
> }
>
> #define INSTANCE_ALL  -1
>
> enum ras_error_type
> {
>     correctable,
>     uncorrectable,
>     deferred,
>     fatal,
>     ...
>     err_all
> }
>
> Then define something like below while querying error details.
>
>     <31:24> = Block Id
>     <23:16> subblock id
>     <15:8> - interested instance
>     <7:0> - error_type
>
> Instance number could be 'inst_all' or specific IP instance.
Hi Lijo,

Thanks for the explanation, will rework as suggested and re post a new series soon.

Thanks,
Aravind.
>
> Thanks,
> Lijo
>
>> Thanks,
>> Aravind.
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>>
>>>>> For ex: to get HBM errors, it looks like the current patch series supports READALL which dumps the whole set of errors. Or, users have to figure out the ids of HBM stack instance (whose capacity can change depending on the SOC and within a single family multiple configurations can exist) errors and do multiple READ_ONE calls. Both don't look good.
>>>>>
>>>>> It would be better if the command argument format can be well defined so that it can be queried based on IP block type, instance, and error types supported (CE/UE/fatal/parity/deferred etc.).
>>>>
>>>> so to mitigate multiple read limitation, we can introduce a new GENL command like READ_MULTI which accepts a list of errors ids which userspace can pass and get all interested error counter as response at once. Also, listing individual errors helps if userspace wants to read a particular error at regular intervals. The intention is also to keep KMD logic simple, userspace can build required model on top of flat enumeration.
>>>>
>>>> Please let me know if this sounds reasonable to you.
>>>>
>>>> Thanks,
>>>> Aravind.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>
>>>>>> @Hawking Zhang, @Lazar, Lijo
>>>>>>
>>>>>> Can you take a look at this series and API and see if it would align
>>>>>> with our RAS requirements going forward?
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>>>>>>
>>>>>>> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>>>>>>>
>>>>>>> v4:
>>>>>>> 1. Rebase
>>>>>>> 2. rename drm_genl_send to drm_genl_reply
>>>>>>> 3. catch error from xa_store and handle appropriately
>>>>>>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>>>>>>> having an early check of IS_DGFX (Michael J. Ruhl)
>>>>>>>
>>>>>>> v3:
>>>>>>> 1. Rebase on latest RAS series for XE
>>>>>>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>>>>>>> register to netlink subsystem
>>>>>>>
>>>>>>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>>>>>>> can leverage.
>>>>>>>
>>>>>>> Below is an example tool drm_ras which demonstrates the use of the
>>>>>>> supported commands. The tool will be sent to ML with the subject
>>>>>>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>>>>>> https://patchwork.freedesktop.org/series/118437/#rev2
>>>>>>>
>>>>>>> read single error counter:
>>>>>>>
>>>>>>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>>>>>>> counter value 0
>>>>>>>
>>>>>>> read all error counters:
>>>>>>>
>>>>>>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>>>>>>> name                                                    config-id               counter
>>>>>>>
>>>>>>> error-gt0-correctable-guc                               0x0000000000000001      0
>>>>>>> error-gt0-correctable-slm                               0x0000000000000003      0
>>>>>>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>>>>>>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>>>>>>> error-gt0-fatal-guc                                     0x0000000000000009      0
>>>>>>> error-gt0-fatal-slm                                     0x000000000000000d      0
>>>>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>>>>>>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>>>>>>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>>>>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>>>>>>> error-gt0-correctable-subslice                          0x0000000000000013      0
>>>>>>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>>>>>>> error-gt0-fatal-subslice                                0x0000000000000015      0
>>>>>>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>>>>>>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>>>>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>>>>>>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>>>>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>>>>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>>>>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>>>>>>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>>>>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>>>>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>>>>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>>>>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>>>>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>>>>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>>>>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>>>>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>>>>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>>>>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>>>>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>>>>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>>>>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>>>>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>>>>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>>>>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>>>>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>>>>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>>>>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>>>>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>>>>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>>>>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>>>>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>>>>>>> error-gt1-correctable-guc                               0x1000000000000001      0
>>>>>>> error-gt1-correctable-slm                               0x1000000000000003      0
>>>>>>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>>>>>>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>>>>>>> error-gt1-fatal-guc                                     0x1000000000000009      0
>>>>>>> error-gt1-fatal-slm                                     0x100000000000000d      0
>>>>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>>>>>>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>>>>>>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>>>>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>>>>>>> error-gt1-correctable-subslice                          0x1000000000000013      0
>>>>>>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>>>>>>> error-gt1-fatal-subslice                                0x1000000000000015      0
>>>>>>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>>>>>>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>>>>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>>>>>>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>>>>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>>>>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>>>>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>>>>>>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>>>>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>>>>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>>>>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>>>>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>>>>>>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>>>>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>>>>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>>>>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>>>>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>>>>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>>>>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>>>>>>
>>>>>>> wait on a error event:
>>>>>>>
>>>>>>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>>>>>>> waiting for error event
>>>>>>> error event received
>>>>>>> counter value 0
>>>>>>>
>>>>>>> list all errors:
>>>>>>>
>>>>>>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>>>>>>> name                                                    config-id
>>>>>>>
>>>>>>> error-gt0-correctable-guc                               0x0000000000000001
>>>>>>> error-gt0-correctable-slm                               0x0000000000000003
>>>>>>> error-gt0-correctable-eu-ic                             0x0000000000000004
>>>>>>> error-gt0-correctable-eu-grf                            0x0000000000000005
>>>>>>> error-gt0-fatal-guc                                     0x0000000000000009
>>>>>>> error-gt0-fatal-slm                                     0x000000000000000d
>>>>>>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>>>>>>> error-gt0-fatal-fpu                                     0x0000000000000010
>>>>>>> error-gt0-fatal-tlb                                     0x0000000000000011
>>>>>>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>>>>>>> error-gt0-correctable-subslice                          0x0000000000000013
>>>>>>> error-gt0-correctable-l3bank                            0x0000000000000014
>>>>>>> error-gt0-fatal-subslice                                0x0000000000000015
>>>>>>> error-gt0-fatal-l3bank                                  0x0000000000000016
>>>>>>> error-gt0-sgunit-correctable                            0x0000000000000017
>>>>>>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>>>>>>> error-gt0-sgunit-fatal                                  0x0000000000000019
>>>>>>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>>>>>>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>>>>>>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>>>>>>> error-gt0-soc-fatal-punit                               0x000000000000001d
>>>>>>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>>>>>>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>>>>>>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>>>>>>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>>>>>>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>>>>>>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>>>>>>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>>>>>>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>>>>>>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>>>>>>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>>>>>>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>>>>>>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>>>>>>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>>>>>>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>>>>>>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>>>>>>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>>>>>>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>>>>>>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>>>>>>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>>>>>>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>>>>>>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>>>>>>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>>>>>>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>>>>>>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>>>>>>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>>>>>>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>>>>>>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>>>>>>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>>>>>>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>>>>>>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>>>>>>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>>>>>>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>>>>>>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>>>>>>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>>>>>>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>>>>>>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>>>>>>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>>>>>>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>>>>>>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>>>>>>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>>>>>>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>>>>>>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>>>>>>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>>>>>>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>>>>>>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>>>>>>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>>>>>>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>>>>>>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>>>>>>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>>>>>>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>>>>>>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>>>>>>> error-gt1-correctable-guc                               0x1000000000000001
>>>>>>> error-gt1-correctable-slm                               0x1000000000000003
>>>>>>> error-gt1-correctable-eu-ic                             0x1000000000000004
>>>>>>> error-gt1-correctable-eu-grf��                           0x1000000000000005
>>>>>>> error-gt1-fatal-guc                                     0x1000000000000009
>>>>>>> error-gt1-fatal-slm                                     0x100000000000000d
>>>>>>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>>>>>>> error-gt1-fatal-fpu                                     0x1000000000000010
>>>>>>> error-gt1-fatal-tlb                                     0x1000000000000011
>>>>>>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>>>>>>> error-gt1-correctable-subslice                          0x1000000000000013
>>>>>>> error-gt1-correctable-l3bank                            0x1000000000000014
>>>>>>> error-gt1-fatal-subslice                                0x1000000000000015
>>>>>>> error-gt1-fatal-l3bank                                  0x1000000000000016
>>>>>>> error-gt1-sgunit-correctable                            0x1000000000000017
>>>>>>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>>>>>>> error-gt1-sgunit-fatal                                  0x1000000000000019
>>>>>>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>>>>>>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>>>>>>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>>>>>>> error-gt1-soc-fatal-punit                               0x100000000000001d
>>>>>>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>>>>>>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>>>>>>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>>>>>>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>>>>>>> error-gt1-soc-fatal-cd0-mdfi                      ��     0x1000000000000022
>>>>>>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>>>>>>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>>>>>>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>>>>>>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>>>>>>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>>>>>>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>>>>>>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>>>>>>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>>>>>>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>>>>>>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>>>>>>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>>>>>>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>>>>>>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>>>>>>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>>>>>>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>>>>>>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>>>>>>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>>>>>>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>>>>>>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>>>>>>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>>>>>>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>>>>>>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>>>>>>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>>>>>>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>>>>>>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>>>>>>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>>>>>>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>>>>>>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>>>>>>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>>>>>>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>>>>>>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>>>>>>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>>>>>>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>>>>>>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>>>>>>
>>>>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>> Cc: Daniel Vetter <daniel@ffwll.ch>
>>>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>>>> Cc: Oded Gabbay <ogabbay@kernel.org>
>>>>>>> Cc: Tomer Tayar <ttayar@habana.ai>
>>>>>>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>>>>>>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>>>>>>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>>>>>>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>>>>>>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>>>>>>
>>>>>>>
>>>>>>> Aravind Iddamsetty (5):
>>>>>>>      drm/netlink: Add netlink infrastructure
>>>>>>>      drm/xe/RAS: Register netlink capability
>>>>>>>      drm/xe/RAS: Expose the error counters
>>>>>>>      drm/netlink: Define multicast groups
>>>>>>>      drm/xe/RAS: send multicast event on occurrence of an error
>>>>>>>
>>>>>>>     drivers/gpu/drm/Makefile             |   1 +
>>>>>>>     drivers/gpu/drm/drm_drv.c            |   7 +
>>>>>>>     drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>>>>>>     drivers/gpu/drm/xe/Makefile          |   1 +
>>>>>>>     drivers/gpu/drm/xe/xe_device.c       |   4 +
>>>>>>>     drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>>>>>>     drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>>>>>>     drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>>>>>>     include/drm/drm_device.h             |   8 +
>>>>>>>     include/drm/drm_drv.h                |   7 +
>>>>>>>     include/drm/drm_netlink.h            |  35 ++
>>>>>>>     include/uapi/drm/drm_netlink.h       |  87 +++++
>>>>>>>     include/uapi/drm/xe_drm.h            |  81 +++++
>>>>>>>     13 files changed, 977 insertions(+)
>>>>>>>     create mode 100644 drivers/gpu/drm/drm_netlink.c
>>>>>>>     create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>>>>>>     create mode 100644 include/drm/drm_netlink.h
>>>>>>>     create mode 100644 include/uapi/drm/drm_netlink.h
>>>>>>>
>>>>>>> -- 
>>>>>>> 2.25.1
>>>>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (5 preceding siblings ...)
  2023-10-23 15:29 ` [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Alex Deucher
@ 2023-11-10 12:23 ` Tomer Tayar
  2023-11-22 14:28   ` Aravind Iddamsetty
  6 siblings, 1 reply; 31+ messages in thread
From: Tomer Tayar @ 2023-11-10 12:23 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J

On 20/10/2023 18:58, Aravind Iddamsetty wrote:
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> reporting the errors to the host, which the KMD processes and exposes a
> set of error counters which can be used by observability tools to take
> corrective actions or repairs. Traditionally there were being exposed
> via PMU (for relative counters) and sysfs interface (for absolute
> value) in our internal branch. But, due to the limitations in this
> approach to use two interfaces and also not able to have an event based
> reporting or configurability, an alternative approach to try netlink
> was suggested by community for drm subsystem wide UAPI for RAS and
> telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.

Hi Aravind,

I can see that the "nomenclature" in the patch series is mainly around 
errors.
When we refer to RAS can't be other non-error values which might be 
relevant, e.g. statistics, status/state, etc.?

Thanks,
Tomer

> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>
> v4:
> 1. Rebase
> 2. rename drm_genl_send to drm_genl_reply
> 3. catch error from xa_store and handle appropriately
> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
> having an early check of IS_DGFX (Michael J. Ruhl)
>
> v3:
> 1. Rebase on latest RAS series for XE
> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
> register to netlink subsystem
>
> v2: define common interfaces to genl netlink subsystem that all drm drivers
> can leverage.
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
> https://patchwork.freedesktop.org/series/118437/#rev2
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name                                                    config-id               counter
>
> error-gt0-correctable-guc                               0x0000000000000001      0
> error-gt0-correctable-slm                               0x0000000000000003      0
> error-gt0-correctable-eu-ic                             0x0000000000000004      0
> error-gt0-correctable-eu-grf                            0x0000000000000005      0
> error-gt0-fatal-guc                                     0x0000000000000009      0
> error-gt0-fatal-slm                                     0x000000000000000d      0
> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> error-gt0-fatal-fpu                                     0x0000000000000010      0
> error-gt0-fatal-tlb                                     0x0000000000000011      0
> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> error-gt0-correctable-subslice                          0x0000000000000013      0
> error-gt0-correctable-l3bank                            0x0000000000000014      0
> error-gt0-fatal-subslice                                0x0000000000000015      0
> error-gt0-fatal-l3bank                                  0x0000000000000016      0
> error-gt0-sgunit-correctable                            0x0000000000000017      0
> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> error-gt0-sgunit-fatal                                  0x0000000000000019      0
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> error-gt0-soc-fatal-punit                               0x000000000000001d      0
> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> error-gt1-correctable-guc                               0x1000000000000001      0
> error-gt1-correctable-slm                               0x1000000000000003      0
> error-gt1-correctable-eu-ic                             0x1000000000000004      0
> error-gt1-correctable-eu-grf                            0x1000000000000005      0
> error-gt1-fatal-guc                                     0x1000000000000009      0
> error-gt1-fatal-slm                                     0x100000000000000d      0
> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> error-gt1-fatal-fpu                                     0x1000000000000010      0
> error-gt1-fatal-tlb                                     0x1000000000000011      0
> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> error-gt1-correctable-subslice                          0x1000000000000013      0
> error-gt1-correctable-l3bank                            0x1000000000000014      0
> error-gt1-fatal-subslice                                0x1000000000000015      0
> error-gt1-fatal-l3bank                                  0x1000000000000016      0
> error-gt1-sgunit-correctable                            0x1000000000000017      0
> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> error-gt1-sgunit-fatal                                  0x1000000000000019      0
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> error-gt1-soc-fatal-punit                               0x100000000000001d      0
> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>
> wait on a error event:
>
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> waiting for error event
> error event received
> counter value 0
>
> list all errors:
>
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name                                                    config-id
>
> error-gt0-correctable-guc                               0x0000000000000001
> error-gt0-correctable-slm                               0x0000000000000003
> error-gt0-correctable-eu-ic                             0x0000000000000004
> error-gt0-correctable-eu-grf                            0x0000000000000005
> error-gt0-fatal-guc                                     0x0000000000000009
> error-gt0-fatal-slm                                     0x000000000000000d
> error-gt0-fatal-eu-grf                                  0x000000000000000f
> error-gt0-fatal-fpu                                     0x0000000000000010
> error-gt0-fatal-tlb                                     0x0000000000000011
> error-gt0-fatal-l3-fabric                               0x0000000000000012
> error-gt0-correctable-subslice                          0x0000000000000013
> error-gt0-correctable-l3bank                            0x0000000000000014
> error-gt0-fatal-subslice                                0x0000000000000015
> error-gt0-fatal-l3bank                                  0x0000000000000016
> error-gt0-sgunit-correctable                            0x0000000000000017
> error-gt0-sgunit-nonfatal                               0x0000000000000018
> error-gt0-sgunit-fatal                                  0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> error-gt0-soc-fatal-punit                               0x000000000000001d
> error-gt0-soc-fatal-psf-0                               0x000000000000001e
> error-gt0-soc-fatal-psf-1                               0x000000000000001f
> error-gt0-soc-fatal-psf-2                               0x0000000000000020
> error-gt0-soc-fatal-cd0                                 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> error-gt1-correctable-guc                               0x1000000000000001
> error-gt1-correctable-slm                               0x1000000000000003
> error-gt1-correctable-eu-ic                             0x1000000000000004
> error-gt1-correctable-eu-grf                            0x1000000000000005
> error-gt1-fatal-guc                                     0x1000000000000009
> error-gt1-fatal-slm                                     0x100000000000000d
> error-gt1-fatal-eu-grf                                  0x100000000000000f
> error-gt1-fatal-fpu                                     0x1000000000000010
> error-gt1-fatal-tlb                                     0x1000000000000011
> error-gt1-fatal-l3-fabric                               0x1000000000000012
> error-gt1-correctable-subslice                          0x1000000000000013
> error-gt1-correctable-l3bank                            0x1000000000000014
> error-gt1-fatal-subslice                                0x1000000000000015
> error-gt1-fatal-l3bank                                  0x1000000000000016
> error-gt1-sgunit-correctable                            0x1000000000000017
> error-gt1-sgunit-nonfatal                               0x1000000000000018
> error-gt1-sgunit-fatal                                  0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> error-gt1-soc-fatal-punit                               0x100000000000001d
> error-gt1-soc-fatal-psf-0                               0x100000000000001e
> error-gt1-soc-fatal-psf-1                               0x100000000000001f
> error-gt1-soc-fatal-psf-2                               0x1000000000000020
> error-gt1-soc-fatal-cd0                                 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Oded Gabbay <ogabbay@kernel.org>
> Cc: Tomer Tayar <ttayar@habana.ai>
> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>
>
> Aravind Iddamsetty (5):
>    drm/netlink: Add netlink infrastructure
>    drm/xe/RAS: Register netlink capability
>    drm/xe/RAS: Expose the error counters
>    drm/netlink: Define multicast groups
>    drm/xe/RAS: send multicast event on occurrence of an error
>
>   drivers/gpu/drm/Makefile             |   1 +
>   drivers/gpu/drm/drm_drv.c            |   7 +
>   drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>   drivers/gpu/drm/xe/Makefile          |   1 +
>   drivers/gpu/drm/xe/xe_device.c       |   4 +
>   drivers/gpu/drm/xe/xe_device_types.h |   1 +
>   drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>   drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>   include/drm/drm_device.h             |   8 +
>   include/drm/drm_drv.h                |   7 +
>   include/drm/drm_netlink.h            |  35 ++
>   include/uapi/drm/drm_netlink.h       |  87 +++++
>   include/uapi/drm/xe_drm.h            |  81 +++++
>   13 files changed, 977 insertions(+)
>   create mode 100644 drivers/gpu/drm/drm_netlink.c
>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>   create mode 100644 include/drm/drm_netlink.h
>   create mode 100644 include/uapi/drm/drm_netlink.h
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 1/5] drm/netlink: Add netlink infrastructure
  2023-10-20 15:58 ` [RFC v4 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
  2023-10-20 20:36   ` Ruhl, Michael J
@ 2023-11-10 12:24   ` Tomer Tayar
  2023-11-22 14:32     ` Aravind Iddamsetty
  1 sibling, 1 reply; 31+ messages in thread
From: Tomer Tayar @ 2023-11-10 12:24 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J

On 20/10/2023 18:58, Aravind Iddamsetty wrote:
> Define the netlink registration interface and commands, attributes that
> can be commonly used across by drm drivers. This patch intends to use
> the generic netlink family to expose various stats of device. At present
> it defines some commands that shall be used to expose RAS error counters.
>
> v2:
> define common interfaces to genl netlink subsystem that all drm drivers
> can leverage.(Tomer Tayar)
>
> v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
> register to netlink subsystem (Daniel Vetter)
>
> v4:(Michael J. Ruhl)
> 1. rename drm_genl_send to drm_genl_reply
> 2. catch error from xa_store and handle appropriately
>
> Cc: Tomer Tayar<ttayar@habana.ai>
> Cc: Daniel Vetter<daniel@ffwll.ch>
> Cc: Michael J. Ruhl<michael.j.ruhl@intel.com>
>
> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
> ---
>   drivers/gpu/drm/Makefile       |   1 +
>   drivers/gpu/drm/drm_drv.c      |   7 ++
>   drivers/gpu/drm/drm_netlink.c  | 188 +++++++++++++++++++++++++++++++++
>   include/drm/drm_device.h       |   8 ++
>   include/drm/drm_drv.h          |   7 ++
>   include/drm/drm_netlink.h      |  30 ++++++
>   include/uapi/drm/drm_netlink.h |  83 +++++++++++++++
>   7 files changed, 324 insertions(+)
>   create mode 100644 drivers/gpu/drm/drm_netlink.c
>   create mode 100644 include/drm/drm_netlink.h
>   create mode 100644 include/uapi/drm/drm_netlink.h
>
> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
> index ee64c51274ad..60864369adaa 100644
> --- a/drivers/gpu/drm/Makefile
> +++ b/drivers/gpu/drm/Makefile
> @@ -35,6 +35,7 @@ drm-y := \
>   	drm_mode_object.o \
>   	drm_modes.o \
>   	drm_modeset_lock.o \
> +	drm_netlink.o \
>   	drm_plane.o \
>   	drm_prime.o \
>   	drm_print.o \
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 535f16e7882e..31f55c1c7524 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -937,6 +937,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
>   	if (ret)
>   		goto err_minors;
>   
> +	if (driver->genl_ops) {
> +		ret = drm_genl_register(dev);
> +		if (ret)
> +			goto err_minors;
> +	}
> +
>   	ret = create_compat_control_link(dev);
>   	if (ret)
>   		goto err_minors;
> @@ -1074,6 +1080,7 @@ static void drm_core_exit(void)
>   {
>   	drm_privacy_screen_lookup_exit();
>   	accel_core_exit();
> +	drm_genl_exit();
>   	unregister_chrdev(DRM_MAJOR, "drm");
>   	debugfs_remove(drm_debugfs_root);
>   	drm_sysfs_destroy();
> diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
> new file mode 100644
> index 000000000000..8add249c1da3
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_netlink.c
> @@ -0,0 +1,188 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <drm/drm_device.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_managed.h>
> +#include <drm/drm_netlink.h>
> +#include <drm/drm_print.h>
> +
> +DEFINE_XARRAY(drm_dev_xarray);
> +
> +/**
> + * drm_genl_reply - response to a request
> + * @msg: socket buffer
> + * @info: receiver information
> + * @usrhdr: pointer to user specific header in the message buffer
> + *
> + * RETURNS:
> + * 0 on success and negative error code on failure
> + */
> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
> +{
> +	int ret;
> +
> +	genlmsg_end(msg, usrhdr);
> +
> +	ret = genlmsg_reply(msg, info);
> +	if (ret)
> +		nlmsg_free(msg);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(drm_genl_reply);
> +
> +/**
> + * drm_genl_alloc_msg - allocate genl message buffer
> + * @dev: drm_device for which the message is being allocated
> + * @info: receiver information

a description for msg_size is missing

> + * @usrhdr: pointer to user specific header in the message buffer
> + *
> + * RETURNS:
> + * pointer to new allocated buffer on success, NULL on failure
> + */
> +struct sk_buff *
> +drm_genl_alloc_msg(struct drm_device *dev,
> +		   struct genl_info *info,
> +		   size_t msg_size, void **usrhdr)
> +{
> +	struct sk_buff *new_msg;
> +
> +	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
> +	if (!new_msg)
> +		return new_msg;
> +
> +	*usrhdr = genlmsg_put_reply(new_msg, info, &dev->drm_genl_family, 0, info->genlhdr->cmd);
> +	if (!*usrhdr) {
> +		nlmsg_free(new_msg);
> +		new_msg = NULL;
> +	}
> +
> +	return new_msg;
> +}
> +EXPORT_SYMBOL(drm_genl_alloc_msg);
> +
> +static struct drm_device *genl_to_dev(struct genl_info *info)
> +{
> +	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
> +}
> +
> +static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
> +{
> +	struct drm_device *dev = genl_to_dev(info);
> +
> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_REQUEST))
> +		return -EINVAL;
> +
> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
> +		return -EOPNOTSUPP;
> +
> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
> +}
> +
> +static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
> +{
> +	struct drm_device *dev = genl_to_dev(info);
> +
> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
> +		return -EINVAL;
> +
> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
> +		return -EOPNOTSUPP;
> +
> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
> +}
> +
> +/* attribute policies */
> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
> +	[DRM_RAS_ATTR_REQUEST] = { .type = NLA_U8 },
> +};
> +
> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
> +	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
> +};
> +
> +/* drm genl operations definition */
> +const struct genl_ops drm_genl_ops[] = {
> +	{
> +		.cmd = DRM_RAS_CMD_QUERY,
> +		.doit = drm_genl_list_errors,
> +		.policy = drm_attr_policy_query,
> +	},
> +	{
> +		.cmd = DRM_RAS_CMD_READ_ONE,
> +		.doit = drm_genl_read_error,
> +		.policy = drm_attr_policy_read_one,
> +	},
> +	{
> +		.cmd = DRM_RAS_CMD_READ_ALL,
> +		.doit = drm_genl_list_errors,
> +		.policy = drm_attr_policy_query,
> +	},
> +};
> +
> +static void drm_genl_family_init(struct drm_device *dev)
> +{
> +	/* Use drm primary node name eg: card0 to name the genl family */
> +	snprintf(dev->drm_genl_family.name, sizeof(dev->drm_genl_family.name), "%s", dev->primary->kdev->kobj.name);

dev_name() can be used.
Also, what about accel? Maybe check dev->primary and use primary/accel 
accordingly?

> +	dev->drm_genl_family.version = DRM_GENL_VERSION;
> +	dev->drm_genl_family.parallel_ops = true;
> +	dev->drm_genl_family.ops = drm_genl_ops;
> +	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
> +	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
> +	dev->drm_genl_family.module = dev->dev->driver->owner;
> +}
> +
> +static void drm_genl_deregister(struct drm_device *dev,  void *arg)

Redundant space before "void *arg"

> +{
> +	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family.name);
> +
> +	xa_erase(&drm_dev_xarray, dev->drm_genl_family.id);
> +
> +	genl_unregister_family(&dev->drm_genl_family);
> +}
> +
> +/**
> + * drm_genl_register - Register genl family
> + * @dev: drm_device for which genl family needs to be registered
> + *
> + * RETURNS:
> + * 0 on success and negative error code on failure
> + */
> +int drm_genl_register(struct drm_device *dev)
> +{
> +	int ret;
> +
> +	drm_genl_family_init(dev);
> +
> +	ret = genl_register_family(&dev->drm_genl_family);
> +	if (ret < 0) {
> +		drm_warn(dev, "genl family registration failed\n");
> +		return ret;
> +	}
> +
> +	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family.id, dev->drm_genl_family.name);
> +
> +	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family.id, dev, GFP_KERNEL));
> +	if (ret)
> +		goto genl_unregister;
> +
> +	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
> +
> +	return ret;
> +
> +genl_unregister:
> +	genl_unregister_family(&dev->drm_genl_family);
> +	return ret;
> +}
> +
> +/**
> + * drm_genl_exit: destroy drm_dev_xarray
> + */
> +void drm_genl_exit(void)
> +{
> +	xa_destroy(&drm_dev_xarray);
> +}
> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index c490977ee250..d3ae91b7714d 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -8,6 +8,7 @@
>   
>   #include <drm/drm_legacy.h>
>   #include <drm/drm_mode_config.h>
> +#include <drm/drm_netlink.h>
>   
>   struct drm_driver;
>   struct drm_minor;
> @@ -318,6 +319,13 @@ struct drm_device {
>   	 */
>   	struct dentry *debugfs_root;
>   
> +	/**
> +	 * @drm_genl_family:
> +	 *
> +	 * Generic netlink family registration structure.
> +	 */
> +	struct genl_family drm_genl_family;
> +
>   	/* Everything below here is for legacy driver, never use! */
>   	/* private: */
>   #if IS_ENABLED(CONFIG_DRM_LEGACY)
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index e2640dc64e08..ebdb7850d235 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -434,6 +434,13 @@ struct drm_driver {
>   	 */
>   	const struct file_operations *fops;
>   
> +	/**
> +	 * @genl_ops:
> +	 *
> +	 * Drivers private callback to genl commands
> +	 */
> +	const struct driver_genl_ops *genl_ops;
> +
>   #ifdef CONFIG_DRM_LEGACY
>   	/* Everything below here is for legacy driver, never use! */
>   	/* private: */
> diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
> new file mode 100644
> index 000000000000..54527dae7847
> --- /dev/null
> +++ b/include/drm/drm_netlink.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#ifndef __DRM_NETLINK_H__
> +#define __DRM_NETLINK_H__
> +
> +#include <linux/netdevice.h>
> +#include <net/genetlink.h>
> +#include <net/sock.h>
> +#include <uapi/drm/drm_netlink.h>
> +
> +struct drm_device;
> +
> +struct driver_genl_ops {
> +	int		       (*doit)(struct drm_device *dev,
> +				       struct sk_buff *skb,

The skb parameter is currently not used (both xe_genl_list_errors() and 
xe_genl_read_error() allocate a new skb).
Did you add because it might be needed for future ops?

> +				       struct genl_info *info);
> +};
> +
> +int drm_genl_register(struct drm_device *dev);
> +void drm_genl_exit(void);
> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
> +struct sk_buff *
> +drm_genl_alloc_msg(struct drm_device *dev,
> +		   struct genl_info *info,
> +		   size_t msg_size, void **usrhdr);
> +#endif
> +
> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
> new file mode 100644
> index 000000000000..aab42147a20e
> --- /dev/null
> +++ b/include/uapi/drm/drm_netlink.h
> @@ -0,0 +1,83 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright 2023 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + */
> +
> +#ifndef _DRM_NETLINK_H_
> +#define _DRM_NETLINK_H_
> +
> +#define DRM_GENL_VERSION 1
> +
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif
> +
> +/**
> + * enum drm_genl_error_cmds - Supported error commands
> + *
> + */
> +enum drm_genl_error_cmds {
> +	DRM_CMD_UNSPEC,
> +	/** @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id */
> +	DRM_RAS_CMD_QUERY,
> +	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */
> +	DRM_RAS_CMD_READ_ONE,
> +	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
> +	DRM_RAS_CMD_READ_ALL,
> +
> +	__DRM_CMD_MAX,
> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
> +};
> +
> +/**
> + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
> + *
> + */
> +enum drm_error_attr {
> +	DRM_ATTR_UNSPEC,
> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
> +	/**
> +	 * @DRM_RAS_ATTR_REQUEST: Should be used with DRM_RAS_CMD_QUERY,
> +	 * DRM_RAS_CMD_READ_ALL
> +	 */
> +	DRM_RAS_ATTR_REQUEST, /* NLA_U8 */
> +	/**
> +	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
> +	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands.
> +	 */
> +	DRM_RAS_ATTR_QUERY_REPLY, /*NLA_NESTED*/

Maybe a space before and after NLA_NESTED?

Thanks,
Tomer

> +	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
> +	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
> +	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id */
> +	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
> +	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
> +	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
> +
> +	__DRM_ATTR_MAX,
> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
> +};
> +
> +#if defined(__cplusplus)
> +}
> +#endif
> +
> +#endif



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v3 3/5] drm/xe/RAS: Expose the error counters
  2023-10-20 15:58 ` [RFC v3 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
  2023-10-20 20:39   ` Ruhl, Michael J
@ 2023-11-10 12:27   ` Tomer Tayar
  2023-11-22 14:33     ` Aravind Iddamsetty
  1 sibling, 1 reply; 31+ messages in thread
From: Tomer Tayar @ 2023-11-10 12:27 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J

On 20/10/2023 18:58, Aravind Iddamsetty wrote:
> We expose the various error counters supported on a hardware via genl
> subsytem through the registered commands to userspace. The
> DRM_RAS_CMD_QUERY lists the error names with config id,
> DRM_RAD_CMD_READ_ONE returns the counter value for the requested config
> id and the DRM_RAS_CMD_READ_ALL lists the counters for all errors along
> with their names and config ids.
>
> v2: Rebase
>
> v3:
> 1. presently xe_list_errors fills blank data for IGFX, prevent it by
> having an early check of IS_DGFX (Michael J. Ruhl)
> 2. update errors from all sources
>
> Cc: Ruhl, Michael J<michael.j.ruhl@intel.com>
> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
> ---
>   drivers/gpu/drm/xe/xe_netlink.c | 499 +++++++++++++++++++++++++++++++-
>   include/uapi/drm/xe_drm.h       |  81 ++++++
>   2 files changed, 578 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
> index 81d785455632..3e4cdb5e4920 100644
> --- a/drivers/gpu/drm/xe/xe_netlink.c
> +++ b/drivers/gpu/drm/xe/xe_netlink.c
> @@ -2,16 +2,511 @@
>   /*
>    * Copyright © 2023 Intel Corporation
>    */
> +#include <drm/xe_drm.h>
> +
>   #include "xe_device.h"
>   
> -static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
> +#define MAX_ERROR_NAME	100
> +
> +static const char * const xe_hw_error_events[] = {
> +		[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng",
> +		[XE_GENL_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc",
> +		[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler",
> +		[XE_GENL_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm",
> +		[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic",
> +		[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf",
> +		[XE_GENL_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
> +		[XE_GENL_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
> +		[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker",
> +		[XE_GENL_GT_ERROR_FATAL_GUC] = "fatal-guc",
> +		[XE_GENL_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
> +		[XE_GENL_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
> +		[XE_GENL_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
> +		[XE_GENL_GT_ERROR_FATAL_SLM] = "fatal-slm",
> +		[XE_GENL_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
> +		[XE_GENL_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
> +		[XE_GENL_GT_ERROR_FATAL_FPU] = "fatal-fpu",
> +		[XE_GENL_GT_ERROR_FATAL_TLB] = "fatal-tlb",
> +		[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
> +		[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice",
> +		[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank",
> +		[XE_GENL_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
> +		[XE_GENL_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
> +		[XE_GENL_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable",
> +		[XE_GENL_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
> +		[XE_GENL_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
> +		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = "soc-nonfatal-csc-psf-cmd-parity",
> +		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = "soc-nonfatal-csc-psf-unexpected-completion",
> +		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = "soc-nonfatal-csc-psf-unsupported-request",
> +		[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = "soc-nonfatal-anr-mdfi",
> +		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = "soc-nonfatal-mdfi-t2t",
> +		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = "soc-nonfatal-mdfi-t2c",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = "soc-nonfatal-hbm-ss0-0",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = "soc-nonfatal-hbm-ss0-1",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = "soc-nonfatal-hbm-ss0-2",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = "soc-nonfatal-hbm-ss0-3",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = "soc-nonfatal-hbm-ss0-4",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = "soc-nonfatal-hbm-ss0-5",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = "soc-nonfatal-hbm-ss0-6",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = "soc-nonfatal-hbm-ss0-7",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = "soc-nonfatal-hbm-ss1-0",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = "soc-nonfatal-hbm-ss1-1",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = "soc-nonfatal-hbm-ss1-2",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = "soc-nonfatal-hbm-ss1-3",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = "soc-nonfatal-hbm-ss1-4",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = "soc-nonfatal-hbm-ss1-5",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = "soc-nonfatal-hbm-ss1-6",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = "soc-nonfatal-hbm-ss1-7",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = "soc-nonfatal-hbm-ss2-0",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = "soc-nonfatal-hbm-ss2-1",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = "soc-nonfatal-hbm-ss2-2",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = "soc-nonfatal-hbm-ss2-3",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = "soc-nonfatal-hbm-ss2-4",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = "soc-nonfatal-hbm-ss2-5",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = "soc-nonfatal-hbm-ss2-6",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = "soc-nonfatal-hbm-ss2-7",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = "soc-nonfatal-hbm-ss3-0",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = "soc-nonfatal-hbm-ss3-1",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = "soc-nonfatal-hbm-ss3-2",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = "soc-nonfatal-hbm-ss3-3",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = "soc-nonfatal-hbm-ss3-4",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = "soc-nonfatal-hbm-ss3-5",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = "soc-nonfatal-hbm-ss3-6",
> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = "soc-nonfatal-hbm-ss3-7",
> +		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = "soc-fatal-csc-psf-cmd-parity",
> +		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = "soc-fatal-csc-psf-unexpected-completion",
> +		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = "soc-fatal-csc-psf-unsupported-request",
> +		[XE_GENL_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = "soc-fatal-pcie-psf-command-parity",
> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = "soc-fatal-pcie-psf-unexpected-completion",
> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = "soc-fatal-pcie-psf-unsupported-request",
> +		[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = "soc-fatal-anr-mdfi",
> +		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = "soc-fatal-mdfi-t2t",
> +		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = "soc-fatal-mdfi-t2c",
> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = "soc-fatal-malformed-pcie-aer",
> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = "soc-fatal-malformed-pcie-err",
> +		[XE_GENL_SOC_ERROR_FATAL_UR_COND] = "soc-fatal-ur-condition-ieh",
> +		[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = "soc-fatal-from-serr-sources",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6",
> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7",
> +		[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc",
> +		[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown",
> +		[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = "gsc-nonfatal-mia-internal",
> +		[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc",
> +		[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout",
> +		[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity",
> +		[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity",
> +		[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = "gsc-nonfatal-vlt-glitch",
> +		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull",
> +		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check",
> +		[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = "gsc-nonfatal-self-mbist",
> +		[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = "gsc-nonfatal-aon-parity",
> +		[XE_GENL_SGGI_ERROR_NONFATAL] = "sggi-nonfatal-data-parity",
> +		[XE_GENL_SGLI_ERROR_NONFATAL] = "sgli-nonfatal-data-parity",
> +		[XE_GENL_SGCI_ERROR_NONFATAL] = "sgci-nonfatal-data-parity",
> +		[XE_GENL_MERT_ERROR_NONFATAL] = "mert-nonfatal-data-parity",
> +		[XE_GENL_SGGI_ERROR_FATAL] = "sggi-fatal-data-parity",
> +		[XE_GENL_SGLI_ERROR_FATAL] = "sgli-fatal-data-parity",
> +		[XE_GENL_SGCI_ERROR_FATAL] = "sgci-fatal-data-parity",
> +		[XE_GENL_MERT_ERROR_FATAL] = "mert-nonfatal-data-parity",
> +};
> +
> +static const unsigned long xe_hw_error_map[] = {
> +	[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = XE_HW_ERR_GT_CORR_L3_SNG,
> +	[XE_GENL_GT_ERROR_CORRECTABLE_GUC] = XE_HW_ERR_GT_CORR_GUC,
> +	[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = XE_HW_ERR_GT_CORR_SAMPLER,
> +	[XE_GENL_GT_ERROR_CORRECTABLE_SLM] = XE_HW_ERR_GT_CORR_SLM,
> +	[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = XE_HW_ERR_GT_CORR_EU_IC,
> +	[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = XE_HW_ERR_GT_CORR_EU_GRF,
> +	[XE_GENL_GT_ERROR_FATAL_ARR_BIST] = XE_HW_ERR_GT_FATAL_ARR_BIST,
> +	[XE_GENL_GT_ERROR_FATAL_L3_DOUB] = XE_HW_ERR_GT_FATAL_L3_DOUB,
> +	[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = XE_HW_ERR_GT_FATAL_L3_ECC_CHK,
> +	[XE_GENL_GT_ERROR_FATAL_GUC] = XE_HW_ERR_GT_FATAL_GUC,
> +	[XE_GENL_GT_ERROR_FATAL_IDI_PAR] = XE_HW_ERR_GT_FATAL_IDI_PAR,
> +	[XE_GENL_GT_ERROR_FATAL_SQIDI] = XE_HW_ERR_GT_FATAL_SQIDI,
> +	[XE_GENL_GT_ERROR_FATAL_SAMPLER] = XE_HW_ERR_GT_FATAL_SAMPLER,
> +	[XE_GENL_GT_ERROR_FATAL_SLM] = XE_HW_ERR_GT_FATAL_SLM,
> +	[XE_GENL_GT_ERROR_FATAL_EU_IC] = XE_HW_ERR_GT_FATAL_EU_IC,
> +	[XE_GENL_GT_ERROR_FATAL_EU_GRF] = XE_HW_ERR_GT_FATAL_EU_GRF,
> +	[XE_GENL_GT_ERROR_FATAL_FPU] = XE_HW_ERR_GT_FATAL_FPU,
> +	[XE_GENL_GT_ERROR_FATAL_TLB] = XE_HW_ERR_GT_FATAL_TLB,
> +	[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = XE_HW_ERR_GT_FATAL_L3_FABRIC,
> +	[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = XE_HW_ERR_GT_CORR_SUBSLICE,
> +	[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = XE_HW_ERR_GT_CORR_L3BANK,
> +	[XE_GENL_GT_ERROR_FATAL_SUBSLICE] = XE_HW_ERR_GT_FATAL_SUBSLICE,
> +	[XE_GENL_GT_ERROR_FATAL_L3BANK] = XE_HW_ERR_GT_FATAL_L3BANK,
> +	[XE_GENL_SGUNIT_ERROR_CORRECTABLE] = XE_HW_ERR_TILE_CORR_SGUNIT,
> +	[XE_GENL_SGUNIT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGUNIT,
> +	[XE_GENL_SGUNIT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGUNIT,
> +	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMD,
> +	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMP,
> +	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_REQ,
> +	[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = XE_HW_ERR_SOC_NONFATAL_ANR_MDFI,
> +	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2T,
> +	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2C,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL0,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL1,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL2,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL3,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL4,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL5,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL6,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL7,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL0,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL1,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL2,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL3,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL4,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL5,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL6,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL7,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL0,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL1,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL2,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL3,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL4,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL5,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL6,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL7,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL0,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL1,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL2,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL3,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL4,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL5,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL6,
> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL7,
> +	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMD,
> +	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMP,
> +	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_FATAL_CSC_PSF_REQ,
> +	[XE_GENL_SOC_ERROR_FATAL_PUNIT] = XE_HW_ERR_SOC_FATAL_PUNIT,
> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMD,
> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMP,
> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_REQ,
> +	[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = XE_HW_ERR_SOC_FATAL_ANR_MDFI,
> +	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = XE_HW_ERR_SOC_FATAL_MDFI_T2T,
> +	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = XE_HW_ERR_SOC_FATAL_MDFI_T2C,
> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = XE_HW_ERR_SOC_FATAL_PCIE_AER,
> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = XE_HW_ERR_SOC_FATAL_PCIE_ERR,
> +	[XE_GENL_SOC_ERROR_FATAL_UR_COND] = XE_HW_ERR_SOC_FATAL_UR_COND,
> +	[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = XE_HW_ERR_SOC_FATAL_SERR_SRCS,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL0,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL1,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL2,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL3,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL4,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL5,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL6,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL7,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL0,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL1,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL2,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL3,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL4,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL5,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL6,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL7,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL0,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL1,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL2,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL3,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL4,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL5,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL6,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL7,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL0,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL1,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL2,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL3,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL4,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL5,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL6,
> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL7,
> +	[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = XE_HW_ERR_GSC_CORR_SRAM,
> +	[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = XE_HW_ERR_GSC_NONFATAL_MIA_SHUTDOWN,
> +	[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = XE_HW_ERR_GSC_NONFATAL_MIA_INTERNAL,
> +	[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = XE_HW_ERR_GSC_NONFATAL_SRAM,
> +	[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = XE_HW_ERR_GSC_NONFATAL_WDG,
> +	[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = XE_HW_ERR_GSC_NONFATAL_ROM_PARITY,
> +	[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = XE_HW_ERR_GSC_NONFATAL_UCODE_PARITY,
> +	[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = XE_HW_ERR_GSC_NONFATAL_VLT_GLITCH,
> +	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = XE_HW_ERR_GSC_NONFATAL_FUSE_PULL,
> +	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = XE_HW_ERR_GSC_NONFATAL_FUSE_CRC,
> +	[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = XE_HW_ERR_GSC_NONFATAL_SELF_MBIST,
> +	[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = XE_HW_ERR_GSC_NONFATAL_AON_RF_PARITY,
> +	[XE_GENL_SGGI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGGI,
> +	[XE_GENL_SGLI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGLI,
> +	[XE_GENL_SGCI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGCI,
> +	[XE_GENL_MERT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_MERT,
> +	[XE_GENL_SGGI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGGI,
> +	[XE_GENL_SGLI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGLI,
> +	[XE_GENL_SGCI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGCI,
> +	[XE_GENL_MERT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_MERT,
> +};
> +
> +static unsigned int config_gt_id(const u64 config)
> +{
> +	return config >> __XE_PMU_GT_SHIFT;
> +}
> +
> +static u64 config_counter(const u64 config)
>   {
> +	return config & ~(~0ULL << __XE_PMU_GT_SHIFT);
> +}
> +
> +static bool is_gt_error(const u64 config)
> +{
> +	unsigned int error;
> +
> +	error = config_counter(config);
> +	if (error <= XE_GENL_GT_ERROR_FATAL_FPU)
> +		return true;
> +
> +	return false;
> +}
> +
> +static bool is_gt_vector_error(const u64 config)
> +{
> +	unsigned int error;
> +
> +	error = config_counter(config);
> +	if (error >= XE_GENL_GT_ERROR_FATAL_TLB &&
> +	    error <= XE_GENL_GT_ERROR_FATAL_L3BANK)
> +		return true;
> +
> +	return false;
> +}
> +
> +static bool is_pvc_invalid_gt_errors(const u64 config)
> +{
> +	switch (config_counter(config)) {
> +	case XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG:
> +	case XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER:
> +	case XE_GENL_GT_ERROR_FATAL_ARR_BIST:
> +	case XE_GENL_GT_ERROR_FATAL_L3_DOUB:
> +	case XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK:
> +	case XE_GENL_GT_ERROR_FATAL_IDI_PAR:
> +	case XE_GENL_GT_ERROR_FATAL_SQIDI:
> +	case XE_GENL_GT_ERROR_FATAL_SAMPLER:
> +	case XE_GENL_GT_ERROR_FATAL_EU_IC:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
> +static bool is_gsc_hw_error(const u64 config)
> +{
> +	if (config_counter(config) >= XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
> +	    config_counter(config) <= XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY)
> +		return true;
> +
> +	return false;
> +}
> +
> +static bool is_soc_error(const u64 config)
> +{
> +	if (config_counter(config) >= XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD &&
> +	    config_counter(config) <= XE_GENL_SOC_ERROR_FATAL_HBM(1, 15))
> +		return true;
> +
> +	return false;
> +}
> +
> +static int
> +config_status(struct xe_device *xe, u64 config)
> +{
> +	unsigned int gt_id = config_gt_id(config);
> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
> +
> +	if (!IS_DGFX(xe))
> +		return -ENODEV;
> +
> +	if (gt->info.type == XE_GT_TYPE_UNINITIALIZED)
> +		return -ENOENT;
> +
> +	/* GSC HW ERRORS are present on root tile of
> +	 * platform supporting MEMORY SPARING only
> +	 */
> +	if (is_gsc_hw_error(config) && !(xe->info.platform == XE_PVC && !gt_id))
> +		return -ENODEV;
> +
> +	/* GT vectors error  are valid on Platforms supporting error vectors only */
> +	if (is_gt_vector_error(config) && xe->info.platform != XE_PVC)
> +		return -ENODEV;
> +
> +	/* Skip gt errors not supported on pvc */
> +	if (is_pvc_invalid_gt_errors(config) && xe->info.platform == XE_PVC)
> +		return  -ENODEV;
> +
> +	/* FATAL FPU error is valid on PVC only */
> +	if (config_counter(config) == XE_GENL_GT_ERROR_FATAL_FPU &&
> +	    !(xe->info.platform == XE_PVC))
> +		return -ENODEV;
> +
> +	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
> +		return -ENODEV;
> +
> +	return (config_counter(config) >=
> +			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
> +}
> +
> +static u64 get_counter_value(struct xe_device *xe, u64 config)
> +{
> +	const unsigned int gt_id = config_gt_id(config);
> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
> +	unsigned int id = config_counter(config);
> +
> +	if (is_gt_error(config) || is_gt_vector_error(config))
> +		return xa_to_value(xa_load(&gt->errors.hw_error, xe_hw_error_map[id]));
> +
> +	return xa_to_value(xa_load(&gt->tile->errors.hw_error, xe_hw_error_map[id]));
> +}
> +
> +int fill_error_details(struct xe_device *xe, struct genl_info *info, struct sk_buff *new_msg)

Should it be static?

> +{
> +	struct nlattr *entry_attr;
> +	bool counter = false;
> +	struct xe_gt *gt;
> +	int i, j;
> +
> +	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
> +		     ARRAY_SIZE(xe_hw_error_map));
> +
> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL)
> +		counter = true;
> +
> +	entry_attr = nla_nest_start(new_msg, DRM_RAS_ATTR_QUERY_REPLY);
> +	if (!entry_attr)
> +		return -EMSGSIZE;
> +
> +	for_each_gt(gt, xe, j) {
> +		char str[MAX_ERROR_NAME];
> +		u64 val;
> +
> +		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
> +			u64 config = XE_HW_ERROR(j, i);
> +
> +			if (config_status(xe, config))
> +				continue;
> +
> +			/* should this be cleared everytime */
> +			snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]);
> +
> +			if (nla_put_string(new_msg, DRM_RAS_ATTR_ERROR_NAME, str))
> +				goto err;
> +			if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
> +				goto err;
> +			if (counter) {
> +				val = get_counter_value(xe, config);
> +				if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD))
> +					goto err;
> +			}
> +		}
> +	}
> +
> +	nla_nest_end(new_msg, entry_attr);
> +
>   	return 0;
> +err:
> +	drm_dbg_driver(&xe->drm, "msg buff is small\n");
> +	nla_nest_cancel(new_msg, entry_attr);
> +	nlmsg_free(new_msg);
> +
> +	return -EMSGSIZE;
> +}
> +
> +static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
> +{
> +	struct xe_device *xe = to_xe_device(drm);
> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
> +	struct sk_buff *new_msg;
> +	int retries = 2;
> +	void *usrhdr;
> +	int ret = 0;
> +
> +	if (!IS_DGFX(xe))
> +		return -ENODEV;
> +
> +	do {
> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
> +		if (!new_msg)
> +			return -ENOMEM;
> +
> +		ret = fill_error_details(xe, info, new_msg);
> +		if (!ret)
> +			break;
> +
> +		msg_size += NLMSG_DEFAULT_SIZE;
> +	} while (retries--);
> +
> +	if (!ret)
> +		ret = drm_genl_reply(new_msg, info, usrhdr);
> +
> +	return ret;
>   }
>   
>   static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>   {
> -	return 0;
> +	struct xe_device *xe = to_xe_device(drm);
> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
> +	struct sk_buff *new_msg;
> +	void *usrhdr;
> +	int ret = 0;
> +	int retries = 2;
> +	u64 config, val;
> +
> +	config = nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_ID]);
> +	ret = config_status(xe, config);
> +	if (ret)
> +		return ret;
> +	do {
> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
> +		if (!new_msg)
> +			return -ENOMEM;
> +
> +		val = get_counter_value(xe, config);
> +		if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
> +			msg_size += NLMSG_DEFAULT_SIZE;
> +			continue;
> +		}

Here ERROR_ID is provided and ERROR_VALUE is returned, but maybe we can 
return also ERROR_NAME for the "full picture"?
Or do you think that a regular flow would be first listing all errors, 
grep the name of the required error, and use its id to get the value, so 
userspace already has the name?

> +
> +		break;
> +	} while (retries--);

It is really possible that NLMSG_DEFAULT_SIZE won't be enough for a 
single counter read?

Thanks,
Tomer

> +
> +	ret = drm_genl_reply(new_msg, info, usrhdr);
> +
> +	return ret;
>   }
>   
>   /* driver callbacks to DRM netlink commands*/
> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> index 60cc6418d9a7..dbb3f1afba5f 100644
> --- a/include/uapi/drm/xe_drm.h
> +++ b/include/uapi/drm/xe_drm.h
> @@ -1087,6 +1087,87 @@ struct drm_xe_vm_madvise {
>   #define XE_PMU_MEDIA_GROUP_BUSY(gt)		___XE_PMU_OTHER(gt, 3)
>   #define XE_PMU_ANY_ENGINE_GROUP_BUSY(gt)	___XE_PMU_OTHER(gt, 4)
>   
> +/**
> + * DOC: XE GENL netlink event IDs
> + * TODO: Add more details
> + */
> +#define XE_HW_ERROR(gt, id) \
> +	((id) | ((__u64)(gt) << __XE_PMU_GT_SHIFT))
> +
> +#define XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG		(0)
> +#define XE_GENL_GT_ERROR_CORRECTABLE_GUC		(1)
> +#define XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER		(2)
> +#define XE_GENL_GT_ERROR_CORRECTABLE_SLM		(3)
> +#define XE_GENL_GT_ERROR_CORRECTABLE_EU_IC		(4)
> +#define XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF		(5)
> +#define XE_GENL_GT_ERROR_FATAL_ARR_BIST			(6)
> +#define XE_GENL_GT_ERROR_FATAL_L3_DOUB			(7)
> +#define XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK		(8)
> +#define XE_GENL_GT_ERROR_FATAL_GUC			(9)
> +#define XE_GENL_GT_ERROR_FATAL_IDI_PAR			(10)
> +#define XE_GENL_GT_ERROR_FATAL_SQIDI			(11)
> +#define XE_GENL_GT_ERROR_FATAL_SAMPLER			(12)
> +#define XE_GENL_GT_ERROR_FATAL_SLM			(13)
> +#define XE_GENL_GT_ERROR_FATAL_EU_IC			(14)
> +#define XE_GENL_GT_ERROR_FATAL_EU_GRF			(15)
> +#define XE_GENL_GT_ERROR_FATAL_FPU			(16)
> +#define XE_GENL_GT_ERROR_FATAL_TLB			(17)
> +#define XE_GENL_GT_ERROR_FATAL_L3_FABRIC		(18)
> +#define XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE		(19)
> +#define XE_GENL_GT_ERROR_CORRECTABLE_L3BANK		(20)
> +#define XE_GENL_GT_ERROR_FATAL_SUBSLICE			(21)
> +#define XE_GENL_GT_ERROR_FATAL_L3BANK			(22)
> +#define XE_GENL_SGUNIT_ERROR_CORRECTABLE		(23)
> +#define XE_GENL_SGUNIT_ERROR_NONFATAL			(24)
> +#define XE_GENL_SGUNIT_ERROR_FATAL			(25)
> +#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD		(26)
> +#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP		(27)
> +#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ		(28)
> +#define XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI		(29)
> +#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T		(30)
> +#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C		(31)
> +#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD		(32)
> +#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP		(33)
> +#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ		(34)
> +#define XE_GENL_SOC_ERROR_FATAL_PUNIT			(35)
> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD			(36)
> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP			(37)
> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ			(38)
> +#define XE_GENL_SOC_ERROR_FATAL_ANR_MDFI		(39)
> +#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2T		(40)
> +#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2C		(41)
> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_AER		(42)
> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_ERR		(43)
> +#define XE_GENL_SOC_ERROR_FATAL_UR_COND			(44)
> +#define XE_GENL_SOC_ERROR_FATAL_SERR_SRCS		(45)
> +
> +#define XE_GENL_SOC_ERROR_NONFATAL_HBM(ss, n)\
> +		(XE_GENL_SOC_ERROR_FATAL_SERR_SRCS + 0x1 + (ss) * 0x10 + (n))
> +#define XE_GENL_SOC_ERROR_FATAL_HBM(ss, n)\
> +		(XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15) + 0x1 + (ss) * 0x10 + (n))
> +
> +/* 109 is the last ID used by SOC errors */
> +#define XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC		(110)
> +#define XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN		(111)
> +#define XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL		(112)
> +#define XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC		(113)
> +#define XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT		(114)
> +#define XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY		(115)
> +#define XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY		(116)
> +#define XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH		(117)
> +#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL		(118)
> +#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(119)
> +#define XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST		(120)
> +#define XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY	(121)
> +#define XE_GENL_SGGI_ERROR_NONFATAL			(122)
> +#define XE_GENL_SGLI_ERROR_NONFATAL			(123)
> +#define XE_GENL_SGCI_ERROR_NONFATAL			(124)
> +#define XE_GENL_MERT_ERROR_NONFATAL			(125)
> +#define XE_GENL_SGGI_ERROR_FATAL			(126)
> +#define XE_GENL_SGLI_ERROR_FATAL			(127)
> +#define XE_GENL_SGCI_ERROR_FATAL			(128)
> +#define XE_GENL_MERT_ERROR_FATAL			(129)
> +
>   #if defined(__cplusplus)
>   }
>   #endif



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2023-10-20 15:58 ` [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
  2023-10-20 20:40   ` Ruhl, Michael J
@ 2023-11-10 12:27   ` Tomer Tayar
  2023-11-12 15:28     ` Tomer Tayar
  1 sibling, 1 reply; 31+ messages in thread
From: Tomer Tayar @ 2023-11-10 12:27 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J

On 20/10/2023 18:58, Aravind Iddamsetty wrote:
> Whenever a correctable or an uncorrectable error happens an event is sent
> to the corresponding listeners of these groups.
>
> v2: Rebase
>
> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
> ---
>   drivers/gpu/drm/xe/xe_hw_error.c | 33 ++++++++++++++++++++++++++++++++
>   1 file changed, 33 insertions(+)
>
> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
> index bab6d4cf0b69..b0befb5e01cb 100644
> --- a/drivers/gpu/drm/xe/xe_hw_error.c
> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
> @@ -786,6 +786,37 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>   				(HARDWARE_ERROR_MAX << 1) + 1);
>   }
>   
> +static void
> +generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
> +{
> +	struct sk_buff *msg;
> +	void *hdr;
> +
> +	if (!xe->drm.drm_genl_family.module)
> +		return;
> +
> +	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
> +	if (!msg) {
> +		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
> +		return;
> +	}
> +
> +	hdr = genlmsg_put(msg, 0, 0, &xe->drm.drm_genl_family, 0, DRM_RAS_CMD_ERROR_EVENT);
> +	if (!hdr) {
> +		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
> +		nlmsg_free(msg);
> +		return;
> +	}
> +
> +	genlmsg_end(msg, hdr);
> +
> +	genlmsg_multicast(&xe->drm.drm_genl_family, msg, 0,
> +			  hw_err ?
> +			  DRM_GENL_MCAST_UNCORR_ERR
> +			  : DRM_GENL_MCAST_CORR_ERR,
> +			  GFP_ATOMIC);

I agree that hiding/wrapping any netlink/genetlink API/macro with a DRM 
helper would be sometimes redundant,
and that in some cases the specific DRM driver would have to "dirt its 
hands" and deal with netlink (e.g. fill_error_details() in patch #3).
However maybe here a DRM helper would have been useful, so we won't see 
a copy of this sequence in other DRM drivers?

Thanks,
Tomer

> +}
> +
>   static void
>   xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>   {
> @@ -849,6 +880,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
>   	}
>   
>   	xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err), errsrc);
> +
> +	generate_netlink_event(tile_to_xe(tile), hw_err);
>   unlock:
>   	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>   }



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2023-11-10 12:27   ` Tomer Tayar
@ 2023-11-12 15:28     ` Tomer Tayar
  2023-11-22 14:34       ` Aravind Iddamsetty
  0 siblings, 1 reply; 31+ messages in thread
From: Tomer Tayar @ 2023-11-12 15:28 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J

On 10/11/2023 14:27, Tomer Tayar wrote:
> On 20/10/2023 18:58, Aravind Iddamsetty wrote:
>> Whenever a correctable or an uncorrectable error happens an event is sent
>> to the corresponding listeners of these groups.
>>
>> v2: Rebase
>>
>> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
>> ---
>>    drivers/gpu/drm/xe/xe_hw_error.c | 33 ++++++++++++++++++++++++++++++++
>>    1 file changed, 33 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>> index bab6d4cf0b69..b0befb5e01cb 100644
>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>> @@ -786,6 +786,37 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>    				(HARDWARE_ERROR_MAX << 1) + 1);
>>    }
>>    
>> +static void
>> +generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
>> +{
>> +	struct sk_buff *msg;
>> +	void *hdr;
>> +
>> +	if (!xe->drm.drm_genl_family.module)
>> +		return;
>> +
>> +	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
>> +	if (!msg) {
>> +		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
>> +		return;
>> +	}
>> +
>> +	hdr = genlmsg_put(msg, 0, 0, &xe->drm.drm_genl_family, 0, DRM_RAS_CMD_ERROR_EVENT);
>> +	if (!hdr) {
>> +		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
>> +		nlmsg_free(msg);
>> +		return;
>> +	}
>> +
>> +	genlmsg_end(msg, hdr);
>> +
>> +	genlmsg_multicast(&xe->drm.drm_genl_family, msg, 0,
>> +			  hw_err ?
>> +			  DRM_GENL_MCAST_UNCORR_ERR
>> +			  : DRM_GENL_MCAST_CORR_ERR,
>> +			  GFP_ATOMIC);
> I agree that hiding/wrapping any netlink/genetlink API/macro with a DRM
> helper would be sometimes redundant,
> and that in some cases the specific DRM driver would have to "dirt its
> hands" and deal with netlink (e.g. fill_error_details() in patch #3).
> However maybe here a DRM helper would have been useful, so we won't see
> a copy of this sequence in other DRM drivers?
>
> Thanks,
> Tomer

After rethinking, it is possible that different DRM drivers will need 
some flexibility when it comes to calling genlmsg_put(), as they might 
want to have more of this call in order to attach some data related to 
the error indication.
In that case, adding a DRM function that wraps it may me redundant.
What do you think?

>> +}
>> +
>>    static void
>>    xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>    {
>> @@ -849,6 +880,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
>>    	}
>>    
>>    	xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err), errsrc);
>> +
>> +	generate_netlink_event(tile_to_xe(tile), hw_err);
>>    unlock:
>>    	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>>    }
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-11-10 12:23 ` Tomer Tayar
@ 2023-11-22 14:28   ` Aravind Iddamsetty
  0 siblings, 0 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-11-22 14:28 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel, alexander.deucher, airlied,
	daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J


On 11/10/23 17:53, Tomer Tayar wrote:
> On 20/10/2023 18:58, Aravind Iddamsetty wrote:
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> reporting the errors to the host, which the KMD processes and exposes a
>> set of error counters which can be used by observability tools to take
>> corrective actions or repairs. Traditionally there were being exposed
>> via PMU (for relative counters) and sysfs interface (for absolute
>> value) in our internal branch. But, due to the limitations in this
>> approach to use two interfaces and also not able to have an event based
>> reporting or configurability, an alternative approach to try netlink
>> was suggested by community for drm subsystem wide UAPI for RAS and
>> telemetry as discussed in [1].
>>
>> This [1] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.
> Hi Aravind,

Hi Tomer,

sorry for the late response been sick for a while.

>
> I can see that the "nomenclature" in the patch series is mainly around 
> errors.
> When we refer to RAS can't be other non-error values which might be 
> relevant, e.g. statistics, status/state, etc.?
Yes RAS in general involves only error handling and their associated
counters but not any other stats or status.


Thanks,
Aravind.
>
> Thanks,
> Tomer
>
>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>
>> this series is on top of https://patchwork.freedesktop.org/series/125373/,
>>
>> v4:
>> 1. Rebase
>> 2. rename drm_genl_send to drm_genl_reply
>> 3. catch error from xa_store and handle appropriately
>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by
>> having an early check of IS_DGFX (Michael J. Ruhl)
>>
>> v3:
>> 1. Rebase on latest RAS series for XE
>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>> register to netlink subsystem
>>
>> v2: define common interfaces to genl netlink subsystem that all drm drivers
>> can leverage.
>>
>> Below is an example tool drm_ras which demonstrates the use of the
>> supported commands. The tool will be sent to ML with the subject
>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>> https://patchwork.freedesktop.org/series/118437/#rev2
>>
>> read single error counter:
>>
>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>> counter value 0
>>
>> read all error counters:
>>
>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>> name                                                    config-id               counter
>>
>> error-gt0-correctable-guc                               0x0000000000000001      0
>> error-gt0-correctable-slm                               0x0000000000000003      0
>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>> error-gt0-fatal-guc                                     0x0000000000000009      0
>> error-gt0-fatal-slm                                     0x000000000000000d      0
>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>> error-gt0-correctable-subslice                          0x0000000000000013      0
>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>> error-gt0-fatal-subslice                                0x0000000000000015      0
>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>> error-gt1-correctable-guc                               0x1000000000000001      0
>> error-gt1-correctable-slm                               0x1000000000000003      0
>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>> error-gt1-fatal-guc                                     0x1000000000000009      0
>> error-gt1-fatal-slm                                     0x100000000000000d      0
>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>> error-gt1-correctable-subslice                          0x1000000000000013      0
>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>> error-gt1-fatal-subslice                                0x1000000000000015      0
>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>
>> wait on a error event:
>>
>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>> waiting for error event
>> error event received
>> counter value 0
>>
>> list all errors:
>>
>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>> name                                                    config-id
>>
>> error-gt0-correctable-guc                               0x0000000000000001
>> error-gt0-correctable-slm                               0x0000000000000003
>> error-gt0-correctable-eu-ic                             0x0000000000000004
>> error-gt0-correctable-eu-grf                            0x0000000000000005
>> error-gt0-fatal-guc                                     0x0000000000000009
>> error-gt0-fatal-slm                                     0x000000000000000d
>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>> error-gt0-fatal-fpu                                     0x0000000000000010
>> error-gt0-fatal-tlb                                     0x0000000000000011
>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>> error-gt0-correctable-subslice                          0x0000000000000013
>> error-gt0-correctable-l3bank                            0x0000000000000014
>> error-gt0-fatal-subslice                                0x0000000000000015
>> error-gt0-fatal-l3bank                                  0x0000000000000016
>> error-gt0-sgunit-correctable                            0x0000000000000017
>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>> error-gt0-sgunit-fatal                                  0x0000000000000019
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>> error-gt0-soc-fatal-punit                               0x000000000000001d
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>> error-gt1-correctable-guc                               0x1000000000000001
>> error-gt1-correctable-slm                               0x1000000000000003
>> error-gt1-correctable-eu-ic                             0x1000000000000004
>> error-gt1-correctable-eu-grf                            0x1000000000000005
>> error-gt1-fatal-guc                                     0x1000000000000009
>> error-gt1-fatal-slm                                     0x100000000000000d
>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>> error-gt1-fatal-fpu                                     0x1000000000000010
>> error-gt1-fatal-tlb                                     0x1000000000000011
>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>> error-gt1-correctable-subslice                          0x1000000000000013
>> error-gt1-correctable-l3bank                            0x1000000000000014
>> error-gt1-fatal-subslice                                0x1000000000000015
>> error-gt1-fatal-l3bank                                  0x1000000000000016
>> error-gt1-sgunit-correctable                            0x1000000000000017
>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>> error-gt1-sgunit-fatal                                  0x1000000000000019
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>> error-gt1-soc-fatal-punit                               0x100000000000001d
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Oded Gabbay <ogabbay@kernel.org>
>> Cc: Tomer Tayar <ttayar@habana.ai>
>> Cc: Hawking Zhang <Hawking.Zhang@amd.com>
>> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
>> Cc: Kuehling Felix <Felix.Kuehling@amd.com>
>> Cc: Tuikov Luben <Luben.Tuikov@amd.com>
>> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com>
>>
>>
>> Aravind Iddamsetty (5):
>>    drm/netlink: Add netlink infrastructure
>>    drm/xe/RAS: Register netlink capability
>>    drm/xe/RAS: Expose the error counters
>>    drm/netlink: Define multicast groups
>>    drm/xe/RAS: send multicast event on occurrence of an error
>>
>>   drivers/gpu/drm/Makefile             |   1 +
>>   drivers/gpu/drm/drm_drv.c            |   7 +
>>   drivers/gpu/drm/drm_netlink.c        | 195 ++++++++++
>>   drivers/gpu/drm/xe/Makefile          |   1 +
>>   drivers/gpu/drm/xe/xe_device.c       |   4 +
>>   drivers/gpu/drm/xe/xe_device_types.h |   1 +
>>   drivers/gpu/drm/xe/xe_hw_error.c     |  33 ++
>>   drivers/gpu/drm/xe/xe_netlink.c      | 517 +++++++++++++++++++++++++++
>>   include/drm/drm_device.h             |   8 +
>>   include/drm/drm_drv.h                |   7 +
>>   include/drm/drm_netlink.h            |  35 ++
>>   include/uapi/drm/drm_netlink.h       |  87 +++++
>>   include/uapi/drm/xe_drm.h            |  81 +++++
>>   13 files changed, 977 insertions(+)
>>   create mode 100644 drivers/gpu/drm/drm_netlink.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>   create mode 100644 include/drm/drm_netlink.h
>>   create mode 100644 include/uapi/drm/drm_netlink.h
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 1/5] drm/netlink: Add netlink infrastructure
  2023-11-10 12:24   ` Tomer Tayar
@ 2023-11-22 14:32     ` Aravind Iddamsetty
  2023-11-23  7:26       ` Tomer Tayar
  0 siblings, 1 reply; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-11-22 14:32 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel, alexander.deucher, airlied,
	daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J


On 11/10/23 17:54, Tomer Tayar wrote:
> On 20/10/2023 18:58, Aravind Iddamsetty wrote:
>> Define the netlink registration interface and commands, attributes that
>> can be commonly used across by drm drivers. This patch intends to use
>> the generic netlink family to expose various stats of device. At present
>> it defines some commands that shall be used to expose RAS error counters.
>>
>> v2:
>> define common interfaces to genl netlink subsystem that all drm drivers
>> can leverage.(Tomer Tayar)
>>
>> v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>> register to netlink subsystem (Daniel Vetter)
>>
>> v4:(Michael J. Ruhl)
>> 1. rename drm_genl_send to drm_genl_reply
>> 2. catch error from xa_store and handle appropriately
>>
>> Cc: Tomer Tayar<ttayar@habana.ai>
>> Cc: Daniel Vetter<daniel@ffwll.ch>
>> Cc: Michael J. Ruhl<michael.j.ruhl@intel.com>
>>
>> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
>> ---
>>   drivers/gpu/drm/Makefile       |   1 +
>>   drivers/gpu/drm/drm_drv.c      |   7 ++
>>   drivers/gpu/drm/drm_netlink.c  | 188 +++++++++++++++++++++++++++++++++
>>   include/drm/drm_device.h       |   8 ++
>>   include/drm/drm_drv.h          |   7 ++
>>   include/drm/drm_netlink.h      |  30 ++++++
>>   include/uapi/drm/drm_netlink.h |  83 +++++++++++++++
>>   7 files changed, 324 insertions(+)
>>   create mode 100644 drivers/gpu/drm/drm_netlink.c
>>   create mode 100644 include/drm/drm_netlink.h
>>   create mode 100644 include/uapi/drm/drm_netlink.h
>>
>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>> index ee64c51274ad..60864369adaa 100644
>> --- a/drivers/gpu/drm/Makefile
>> +++ b/drivers/gpu/drm/Makefile
>> @@ -35,6 +35,7 @@ drm-y := \
>>   	drm_mode_object.o \
>>   	drm_modes.o \
>>   	drm_modeset_lock.o \
>> +	drm_netlink.o \
>>   	drm_plane.o \
>>   	drm_prime.o \
>>   	drm_print.o \
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 535f16e7882e..31f55c1c7524 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -937,6 +937,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
>>   	if (ret)
>>   		goto err_minors;
>>   
>> +	if (driver->genl_ops) {
>> +		ret = drm_genl_register(dev);
>> +		if (ret)
>> +			goto err_minors;
>> +	}
>> +
>>   	ret = create_compat_control_link(dev);
>>   	if (ret)
>>   		goto err_minors;
>> @@ -1074,6 +1080,7 @@ static void drm_core_exit(void)
>>   {
>>   	drm_privacy_screen_lookup_exit();
>>   	accel_core_exit();
>> +	drm_genl_exit();
>>   	unregister_chrdev(DRM_MAJOR, "drm");
>>   	debugfs_remove(drm_debugfs_root);
>>   	drm_sysfs_destroy();
>> diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
>> new file mode 100644
>> index 000000000000..8add249c1da3
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_netlink.c
>> @@ -0,0 +1,188 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_device.h>
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_file.h>
>> +#include <drm/drm_managed.h>
>> +#include <drm/drm_netlink.h>
>> +#include <drm/drm_print.h>
>> +
>> +DEFINE_XARRAY(drm_dev_xarray);
>> +
>> +/**
>> + * drm_genl_reply - response to a request
>> + * @msg: socket buffer
>> + * @info: receiver information
>> + * @usrhdr: pointer to user specific header in the message buffer
>> + *
>> + * RETURNS:
>> + * 0 on success and negative error code on failure
>> + */
>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
>> +{
>> +	int ret;
>> +
>> +	genlmsg_end(msg, usrhdr);
>> +
>> +	ret = genlmsg_reply(msg, info);
>> +	if (ret)
>> +		nlmsg_free(msg);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(drm_genl_reply);
>> +
>> +/**
>> + * drm_genl_alloc_msg - allocate genl message buffer
>> + * @dev: drm_device for which the message is being allocated
>> + * @info: receiver information
> a description for msg_size is missing
Thanks for catching it will add.
>
>> + * @usrhdr: pointer to user specific header in the message buffer
>> + *
>> + * RETURNS:
>> + * pointer to new allocated buffer on success, NULL on failure
>> + */
>> +struct sk_buff *
>> +drm_genl_alloc_msg(struct drm_device *dev,
>> +		   struct genl_info *info,
>> +		   size_t msg_size, void **usrhdr)
>> +{
>> +	struct sk_buff *new_msg;
>> +
>> +	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
>> +	if (!new_msg)
>> +		return new_msg;
>> +
>> +	*usrhdr = genlmsg_put_reply(new_msg, info, &dev->drm_genl_family, 0, info->genlhdr->cmd);
>> +	if (!*usrhdr) {
>> +		nlmsg_free(new_msg);
>> +		new_msg = NULL;
>> +	}
>> +
>> +	return new_msg;
>> +}
>> +EXPORT_SYMBOL(drm_genl_alloc_msg);
>> +
>> +static struct drm_device *genl_to_dev(struct genl_info *info)
>> +{
>> +	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
>> +}
>> +
>> +static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct drm_device *dev = genl_to_dev(info);
>> +
>> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_REQUEST))
>> +		return -EINVAL;
>> +
>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>> +		return -EOPNOTSUPP;
>> +
>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
>> +}
>> +
>> +static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct drm_device *dev = genl_to_dev(info);
>> +
>> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
>> +		return -EINVAL;
>> +
>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>> +		return -EOPNOTSUPP;
>> +
>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
>> +}
>> +
>> +/* attribute policies */
>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>> +	[DRM_RAS_ATTR_REQUEST] = { .type = NLA_U8 },
>> +};
>> +
>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
>> +	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
>> +};
>> +
>> +/* drm genl operations definition */
>> +const struct genl_ops drm_genl_ops[] = {
>> +	{
>> +		.cmd = DRM_RAS_CMD_QUERY,
>> +		.doit = drm_genl_list_errors,
>> +		.policy = drm_attr_policy_query,
>> +	},
>> +	{
>> +		.cmd = DRM_RAS_CMD_READ_ONE,
>> +		.doit = drm_genl_read_error,
>> +		.policy = drm_attr_policy_read_one,
>> +	},
>> +	{
>> +		.cmd = DRM_RAS_CMD_READ_ALL,
>> +		.doit = drm_genl_list_errors,
>> +		.policy = drm_attr_policy_query,
>> +	},
>> +};
>> +
>> +static void drm_genl_family_init(struct drm_device *dev)
>> +{
>> +	/* Use drm primary node name eg: card0 to name the genl family */
>> +	snprintf(dev->drm_genl_family.name, sizeof(dev->drm_genl_family.name), "%s", dev->primary->kdev->kobj.name);
> dev_name() can be used.
> Also, what about accel? Maybe check dev->primary and use primary/accel 
> accordingly?
the present series is adding this feature for primary device only and has
no knowledge how it will be used for accel device, so when accel device
start using this infra should make that particular change or do you think
it should be added as part of this series only?
>
>> +	dev->drm_genl_family.version = DRM_GENL_VERSION;
>> +	dev->drm_genl_family.parallel_ops = true;
>> +	dev->drm_genl_family.ops = drm_genl_ops;
>> +	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
>> +	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
>> +	dev->drm_genl_family.module = dev->dev->driver->owner;
>> +}
>> +
>> +static void drm_genl_deregister(struct drm_device *dev,  void *arg)
> Redundant space before "void *arg"
will clean it.
>
>> +{
>> +	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family.name);
>> +
>> +	xa_erase(&drm_dev_xarray, dev->drm_genl_family.id);
>> +
>> +	genl_unregister_family(&dev->drm_genl_family);
>> +}
>> +
>> +/**
>> + * drm_genl_register - Register genl family
>> + * @dev: drm_device for which genl family needs to be registered
>> + *
>> + * RETURNS:
>> + * 0 on success and negative error code on failure
>> + */
>> +int drm_genl_register(struct drm_device *dev)
>> +{
>> +	int ret;
>> +
>> +	drm_genl_family_init(dev);
>> +
>> +	ret = genl_register_family(&dev->drm_genl_family);
>> +	if (ret < 0) {
>> +		drm_warn(dev, "genl family registration failed\n");
>> +		return ret;
>> +	}
>> +
>> +	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family.id, dev->drm_genl_family.name);
>> +
>> +	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family.id, dev, GFP_KERNEL));
>> +	if (ret)
>> +		goto genl_unregister;
>> +
>> +	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
>> +
>> +	return ret;
>> +
>> +genl_unregister:
>> +	genl_unregister_family(&dev->drm_genl_family);
>> +	return ret;
>> +}
>> +
>> +/**
>> + * drm_genl_exit: destroy drm_dev_xarray
>> + */
>> +void drm_genl_exit(void)
>> +{
>> +	xa_destroy(&drm_dev_xarray);
>> +}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index c490977ee250..d3ae91b7714d 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -8,6 +8,7 @@
>>   
>>   #include <drm/drm_legacy.h>
>>   #include <drm/drm_mode_config.h>
>> +#include <drm/drm_netlink.h>
>>   
>>   struct drm_driver;
>>   struct drm_minor;
>> @@ -318,6 +319,13 @@ struct drm_device {
>>   	 */
>>   	struct dentry *debugfs_root;
>>   
>> +	/**
>> +	 * @drm_genl_family:
>> +	 *
>> +	 * Generic netlink family registration structure.
>> +	 */
>> +	struct genl_family drm_genl_family;
>> +
>>   	/* Everything below here is for legacy driver, never use! */
>>   	/* private: */
>>   #if IS_ENABLED(CONFIG_DRM_LEGACY)
>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>> index e2640dc64e08..ebdb7850d235 100644
>> --- a/include/drm/drm_drv.h
>> +++ b/include/drm/drm_drv.h
>> @@ -434,6 +434,13 @@ struct drm_driver {
>>   	 */
>>   	const struct file_operations *fops;
>>   
>> +	/**
>> +	 * @genl_ops:
>> +	 *
>> +	 * Drivers private callback to genl commands
>> +	 */
>> +	const struct driver_genl_ops *genl_ops;
>> +
>>   #ifdef CONFIG_DRM_LEGACY
>>   	/* Everything below here is for legacy driver, never use! */
>>   	/* private: */
>> diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
>> new file mode 100644
>> index 000000000000..54527dae7847
>> --- /dev/null
>> +++ b/include/drm/drm_netlink.h
>> @@ -0,0 +1,30 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#ifndef __DRM_NETLINK_H__
>> +#define __DRM_NETLINK_H__
>> +
>> +#include <linux/netdevice.h>
>> +#include <net/genetlink.h>
>> +#include <net/sock.h>
>> +#include <uapi/drm/drm_netlink.h>
>> +
>> +struct drm_device;
>> +
>> +struct driver_genl_ops {
>> +	int		       (*doit)(struct drm_device *dev,
>> +				       struct sk_buff *skb,
> The skb parameter is currently not used (both xe_genl_list_errors() and 
> xe_genl_read_error() allocate a new skb).
> Did you add because it might be needed for future ops?
well I wanted to pass on the details the netlink subsystem sends and leave it to the driver
if it wants to use it anyway.
>
>> +				       struct genl_info *info);
>> +};
>> +
>> +int drm_genl_register(struct drm_device *dev);
>> +void drm_genl_exit(void);
>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
>> +struct sk_buff *
>> +drm_genl_alloc_msg(struct drm_device *dev,
>> +		   struct genl_info *info,
>> +		   size_t msg_size, void **usrhdr);
>> +#endif
>> +
>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>> new file mode 100644
>> index 000000000000..aab42147a20e
>> --- /dev/null
>> +++ b/include/uapi/drm/drm_netlink.h
>> @@ -0,0 +1,83 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright 2023 Intel Corporation
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a
>> + * copy of this software and associated documentation files (the "Software"),
>> + * to deal in the Software without restriction, including without limitation
>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>> + * and/or sell copies of the Software, and to permit persons to whom the
>> + * Software is furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice (including the next
>> + * paragraph) shall be included in all copies or substantial portions of the
>> + * Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
>> + * OTHER DEALINGS IN THE SOFTWARE.
>> + */
>> +
>> +#ifndef _DRM_NETLINK_H_
>> +#define _DRM_NETLINK_H_
>> +
>> +#define DRM_GENL_VERSION 1
>> +
>> +#if defined(__cplusplus)
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * enum drm_genl_error_cmds - Supported error commands
>> + *
>> + */
>> +enum drm_genl_error_cmds {
>> +	DRM_CMD_UNSPEC,
>> +	/** @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id */
>> +	DRM_RAS_CMD_QUERY,
>> +	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */
>> +	DRM_RAS_CMD_READ_ONE,
>> +	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
>> +	DRM_RAS_CMD_READ_ALL,
>> +
>> +	__DRM_CMD_MAX,
>> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>> +};
>> +
>> +/**
>> + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
>> + *
>> + */
>> +enum drm_error_attr {
>> +	DRM_ATTR_UNSPEC,
>> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>> +	/**
>> +	 * @DRM_RAS_ATTR_REQUEST: Should be used with DRM_RAS_CMD_QUERY,
>> +	 * DRM_RAS_CMD_READ_ALL
>> +	 */
>> +	DRM_RAS_ATTR_REQUEST, /* NLA_U8 */
>> +	/**
>> +	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
>> +	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands.
>> +	 */
>> +	DRM_RAS_ATTR_QUERY_REPLY, /*NLA_NESTED*/
> Maybe a space before and after NLA_NESTED?

right missed that.

Thanks,
Aravind.
>
> Thanks,
> Tomer
>
>> +	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
>> +	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>> +	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id */
>> +	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
>> +	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
>> +	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
>> +
>> +	__DRM_ATTR_MAX,
>> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>> +};
>> +
>> +#if defined(__cplusplus)
>> +}
>> +#endif
>> +
>> +#endif
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v3 3/5] drm/xe/RAS: Expose the error counters
  2023-11-10 12:27   ` Tomer Tayar
@ 2023-11-22 14:33     ` Aravind Iddamsetty
  0 siblings, 0 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-11-22 14:33 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel, alexander.deucher, airlied,
	daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J


On 11/10/23 17:57, Tomer Tayar wrote:
> On 20/10/2023 18:58, Aravind Iddamsetty wrote:
>> We expose the various error counters supported on a hardware via genl
>> subsytem through the registered commands to userspace. The
>> DRM_RAS_CMD_QUERY lists the error names with config id,
>> DRM_RAD_CMD_READ_ONE returns the counter value for the requested config
>> id and the DRM_RAS_CMD_READ_ALL lists the counters for all errors along
>> with their names and config ids.
>>
>> v2: Rebase
>>
>> v3:
>> 1. presently xe_list_errors fills blank data for IGFX, prevent it by
>> having an early check of IS_DGFX (Michael J. Ruhl)
>> 2. update errors from all sources
>>
>> Cc: Ruhl, Michael J<michael.j.ruhl@intel.com>
>> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_netlink.c | 499 +++++++++++++++++++++++++++++++-
>>   include/uapi/drm/xe_drm.h       |  81 ++++++
>>   2 files changed, 578 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
>> index 81d785455632..3e4cdb5e4920 100644
>> --- a/drivers/gpu/drm/xe/xe_netlink.c
>> +++ b/drivers/gpu/drm/xe/xe_netlink.c
>> @@ -2,16 +2,511 @@
>>   /*
>>    * Copyright © 2023 Intel Corporation
>>    */
>> +#include <drm/xe_drm.h>
>> +
>>   #include "xe_device.h"
>>   
>> -static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>> +#define MAX_ERROR_NAME	100
>> +
>> +static const char * const xe_hw_error_events[] = {
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng",
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc",
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler",
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm",
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic",
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf",
>> +		[XE_GENL_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
>> +		[XE_GENL_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
>> +		[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker",
>> +		[XE_GENL_GT_ERROR_FATAL_GUC] = "fatal-guc",
>> +		[XE_GENL_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
>> +		[XE_GENL_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
>> +		[XE_GENL_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
>> +		[XE_GENL_GT_ERROR_FATAL_SLM] = "fatal-slm",
>> +		[XE_GENL_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
>> +		[XE_GENL_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
>> +		[XE_GENL_GT_ERROR_FATAL_FPU] = "fatal-fpu",
>> +		[XE_GENL_GT_ERROR_FATAL_TLB] = "fatal-tlb",
>> +		[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice",
>> +		[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank",
>> +		[XE_GENL_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
>> +		[XE_GENL_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
>> +		[XE_GENL_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable",
>> +		[XE_GENL_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
>> +		[XE_GENL_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = "soc-nonfatal-csc-psf-cmd-parity",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = "soc-nonfatal-csc-psf-unexpected-completion",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = "soc-nonfatal-csc-psf-unsupported-request",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = "soc-nonfatal-anr-mdfi",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = "soc-nonfatal-mdfi-t2t",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = "soc-nonfatal-mdfi-t2c",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = "soc-nonfatal-hbm-ss0-0",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = "soc-nonfatal-hbm-ss0-1",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = "soc-nonfatal-hbm-ss0-2",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = "soc-nonfatal-hbm-ss0-3",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = "soc-nonfatal-hbm-ss0-4",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = "soc-nonfatal-hbm-ss0-5",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = "soc-nonfatal-hbm-ss0-6",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = "soc-nonfatal-hbm-ss0-7",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = "soc-nonfatal-hbm-ss1-0",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = "soc-nonfatal-hbm-ss1-1",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = "soc-nonfatal-hbm-ss1-2",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = "soc-nonfatal-hbm-ss1-3",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = "soc-nonfatal-hbm-ss1-4",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = "soc-nonfatal-hbm-ss1-5",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = "soc-nonfatal-hbm-ss1-6",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = "soc-nonfatal-hbm-ss1-7",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = "soc-nonfatal-hbm-ss2-0",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = "soc-nonfatal-hbm-ss2-1",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = "soc-nonfatal-hbm-ss2-2",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = "soc-nonfatal-hbm-ss2-3",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = "soc-nonfatal-hbm-ss2-4",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = "soc-nonfatal-hbm-ss2-5",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = "soc-nonfatal-hbm-ss2-6",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = "soc-nonfatal-hbm-ss2-7",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = "soc-nonfatal-hbm-ss3-0",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = "soc-nonfatal-hbm-ss3-1",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = "soc-nonfatal-hbm-ss3-2",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = "soc-nonfatal-hbm-ss3-3",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = "soc-nonfatal-hbm-ss3-4",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = "soc-nonfatal-hbm-ss3-5",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = "soc-nonfatal-hbm-ss3-6",
>> +		[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = "soc-nonfatal-hbm-ss3-7",
>> +		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = "soc-fatal-csc-psf-cmd-parity",
>> +		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = "soc-fatal-csc-psf-unexpected-completion",
>> +		[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = "soc-fatal-csc-psf-unsupported-request",
>> +		[XE_GENL_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
>> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = "soc-fatal-pcie-psf-command-parity",
>> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = "soc-fatal-pcie-psf-unexpected-completion",
>> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = "soc-fatal-pcie-psf-unsupported-request",
>> +		[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = "soc-fatal-anr-mdfi",
>> +		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = "soc-fatal-mdfi-t2t",
>> +		[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = "soc-fatal-mdfi-t2c",
>> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = "soc-fatal-malformed-pcie-aer",
>> +		[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = "soc-fatal-malformed-pcie-err",
>> +		[XE_GENL_SOC_ERROR_FATAL_UR_COND] = "soc-fatal-ur-condition-ieh",
>> +		[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = "soc-fatal-from-serr-sources",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6",
>> +		[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7",
>> +		[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = "gsc-nonfatal-mia-internal",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = "gsc-nonfatal-vlt-glitch",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = "gsc-nonfatal-self-mbist",
>> +		[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = "gsc-nonfatal-aon-parity",
>> +		[XE_GENL_SGGI_ERROR_NONFATAL] = "sggi-nonfatal-data-parity",
>> +		[XE_GENL_SGLI_ERROR_NONFATAL] = "sgli-nonfatal-data-parity",
>> +		[XE_GENL_SGCI_ERROR_NONFATAL] = "sgci-nonfatal-data-parity",
>> +		[XE_GENL_MERT_ERROR_NONFATAL] = "mert-nonfatal-data-parity",
>> +		[XE_GENL_SGGI_ERROR_FATAL] = "sggi-fatal-data-parity",
>> +		[XE_GENL_SGLI_ERROR_FATAL] = "sgli-fatal-data-parity",
>> +		[XE_GENL_SGCI_ERROR_FATAL] = "sgci-fatal-data-parity",
>> +		[XE_GENL_MERT_ERROR_FATAL] = "mert-nonfatal-data-parity",
>> +};
>> +
>> +static const unsigned long xe_hw_error_map[] = {
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG] = XE_HW_ERR_GT_CORR_L3_SNG,
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_GUC] = XE_HW_ERR_GT_CORR_GUC,
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER] = XE_HW_ERR_GT_CORR_SAMPLER,
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_SLM] = XE_HW_ERR_GT_CORR_SLM,
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_EU_IC] = XE_HW_ERR_GT_CORR_EU_IC,
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF] = XE_HW_ERR_GT_CORR_EU_GRF,
>> +	[XE_GENL_GT_ERROR_FATAL_ARR_BIST] = XE_HW_ERR_GT_FATAL_ARR_BIST,
>> +	[XE_GENL_GT_ERROR_FATAL_L3_DOUB] = XE_HW_ERR_GT_FATAL_L3_DOUB,
>> +	[XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK] = XE_HW_ERR_GT_FATAL_L3_ECC_CHK,
>> +	[XE_GENL_GT_ERROR_FATAL_GUC] = XE_HW_ERR_GT_FATAL_GUC,
>> +	[XE_GENL_GT_ERROR_FATAL_IDI_PAR] = XE_HW_ERR_GT_FATAL_IDI_PAR,
>> +	[XE_GENL_GT_ERROR_FATAL_SQIDI] = XE_HW_ERR_GT_FATAL_SQIDI,
>> +	[XE_GENL_GT_ERROR_FATAL_SAMPLER] = XE_HW_ERR_GT_FATAL_SAMPLER,
>> +	[XE_GENL_GT_ERROR_FATAL_SLM] = XE_HW_ERR_GT_FATAL_SLM,
>> +	[XE_GENL_GT_ERROR_FATAL_EU_IC] = XE_HW_ERR_GT_FATAL_EU_IC,
>> +	[XE_GENL_GT_ERROR_FATAL_EU_GRF] = XE_HW_ERR_GT_FATAL_EU_GRF,
>> +	[XE_GENL_GT_ERROR_FATAL_FPU] = XE_HW_ERR_GT_FATAL_FPU,
>> +	[XE_GENL_GT_ERROR_FATAL_TLB] = XE_HW_ERR_GT_FATAL_TLB,
>> +	[XE_GENL_GT_ERROR_FATAL_L3_FABRIC] = XE_HW_ERR_GT_FATAL_L3_FABRIC,
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE] = XE_HW_ERR_GT_CORR_SUBSLICE,
>> +	[XE_GENL_GT_ERROR_CORRECTABLE_L3BANK] = XE_HW_ERR_GT_CORR_L3BANK,
>> +	[XE_GENL_GT_ERROR_FATAL_SUBSLICE] = XE_HW_ERR_GT_FATAL_SUBSLICE,
>> +	[XE_GENL_GT_ERROR_FATAL_L3BANK] = XE_HW_ERR_GT_FATAL_L3BANK,
>> +	[XE_GENL_SGUNIT_ERROR_CORRECTABLE] = XE_HW_ERR_TILE_CORR_SGUNIT,
>> +	[XE_GENL_SGUNIT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGUNIT,
>> +	[XE_GENL_SGUNIT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGUNIT,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMD,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_CMP,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_NONFATAL_CSC_PSF_REQ,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI] = XE_HW_ERR_SOC_NONFATAL_ANR_MDFI,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2T,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C] = XE_HW_ERR_SOC_NONFATAL_MDFI_T2C,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL0,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL1,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL2,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL3,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL4,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL5,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL6,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM0_CHNL7,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL0,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL1,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL2,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL3,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL4,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL5,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL6,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(0, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM1_CHNL7,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 0)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL0,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 1)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL1,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 2)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL2,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 3)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL3,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 4)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL4,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 5)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL5,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 6)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL6,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 7)] = XE_HW_ERR_SOC_NONFATAL_HBM2_CHNL7,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 8)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL0,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 9)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL1,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 10)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL2,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 11)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL3,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 12)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL4,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 13)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL5,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 14)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL6,
>> +	[XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15)] = XE_HW_ERR_SOC_NONFATAL_HBM3_CHNL7,
>> +	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMD,
>> +	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP] = XE_HW_ERR_SOC_FATAL_CSC_PSF_CMP,
>> +	[XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ] = XE_HW_ERR_SOC_FATAL_CSC_PSF_REQ,
>> +	[XE_GENL_SOC_ERROR_FATAL_PUNIT] = XE_HW_ERR_SOC_FATAL_PUNIT,
>> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMD,
>> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_CMP,
>> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ] = XE_HW_ERR_SOC_FATAL_PCIE_PSF_REQ,
>> +	[XE_GENL_SOC_ERROR_FATAL_ANR_MDFI] = XE_HW_ERR_SOC_FATAL_ANR_MDFI,
>> +	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2T] = XE_HW_ERR_SOC_FATAL_MDFI_T2T,
>> +	[XE_GENL_SOC_ERROR_FATAL_MDFI_T2C] = XE_HW_ERR_SOC_FATAL_MDFI_T2C,
>> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_AER] = XE_HW_ERR_SOC_FATAL_PCIE_AER,
>> +	[XE_GENL_SOC_ERROR_FATAL_PCIE_ERR] = XE_HW_ERR_SOC_FATAL_PCIE_ERR,
>> +	[XE_GENL_SOC_ERROR_FATAL_UR_COND] = XE_HW_ERR_SOC_FATAL_UR_COND,
>> +	[XE_GENL_SOC_ERROR_FATAL_SERR_SRCS] = XE_HW_ERR_SOC_FATAL_SERR_SRCS,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 0)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL0,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 1)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL1,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 2)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL2,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 3)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL3,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 4)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL4,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 5)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL5,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 6)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL6,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 7)] = XE_HW_ERR_SOC_FATAL_HBM0_CHNL7,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 8)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL0,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 9)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL1,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 10)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL2,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 11)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL3,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 12)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL4,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 13)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL5,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 14)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL6,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(0, 15)] = XE_HW_ERR_SOC_FATAL_HBM1_CHNL7,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 0)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL0,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 1)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL1,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 2)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL2,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 3)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL3,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 4)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL4,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 5)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL5,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 6)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL6,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 7)] = XE_HW_ERR_SOC_FATAL_HBM2_CHNL7,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 8)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL0,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 9)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL1,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 10)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL2,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 11)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL3,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 12)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL4,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 13)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL5,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 14)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL6,
>> +	[XE_GENL_SOC_ERROR_FATAL_HBM(1, 15)] = XE_HW_ERR_SOC_FATAL_HBM3_CHNL7,
>> +	[XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC] = XE_HW_ERR_GSC_CORR_SRAM,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = XE_HW_ERR_GSC_NONFATAL_MIA_SHUTDOWN,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL] = XE_HW_ERR_GSC_NONFATAL_MIA_INTERNAL,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC] = XE_HW_ERR_GSC_NONFATAL_SRAM,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = XE_HW_ERR_GSC_NONFATAL_WDG,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY] = XE_HW_ERR_GSC_NONFATAL_ROM_PARITY,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY] = XE_HW_ERR_GSC_NONFATAL_UCODE_PARITY,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH] = XE_HW_ERR_GSC_NONFATAL_VLT_GLITCH,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL] = XE_HW_ERR_GSC_NONFATAL_FUSE_PULL,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = XE_HW_ERR_GSC_NONFATAL_FUSE_CRC,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST] = XE_HW_ERR_GSC_NONFATAL_SELF_MBIST,
>> +	[XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY] = XE_HW_ERR_GSC_NONFATAL_AON_RF_PARITY,
>> +	[XE_GENL_SGGI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGGI,
>> +	[XE_GENL_SGLI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGLI,
>> +	[XE_GENL_SGCI_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_SGCI,
>> +	[XE_GENL_MERT_ERROR_NONFATAL] = XE_HW_ERR_TILE_NONFATAL_MERT,
>> +	[XE_GENL_SGGI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGGI,
>> +	[XE_GENL_SGLI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGLI,
>> +	[XE_GENL_SGCI_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_SGCI,
>> +	[XE_GENL_MERT_ERROR_FATAL] = XE_HW_ERR_TILE_FATAL_MERT,
>> +};
>> +
>> +static unsigned int config_gt_id(const u64 config)
>> +{
>> +	return config >> __XE_PMU_GT_SHIFT;
>> +}
>> +
>> +static u64 config_counter(const u64 config)
>>   {
>> +	return config & ~(~0ULL << __XE_PMU_GT_SHIFT);
>> +}
>> +
>> +static bool is_gt_error(const u64 config)
>> +{
>> +	unsigned int error;
>> +
>> +	error = config_counter(config);
>> +	if (error <= XE_GENL_GT_ERROR_FATAL_FPU)
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static bool is_gt_vector_error(const u64 config)
>> +{
>> +	unsigned int error;
>> +
>> +	error = config_counter(config);
>> +	if (error >= XE_GENL_GT_ERROR_FATAL_TLB &&
>> +	    error <= XE_GENL_GT_ERROR_FATAL_L3BANK)
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static bool is_pvc_invalid_gt_errors(const u64 config)
>> +{
>> +	switch (config_counter(config)) {
>> +	case XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG:
>> +	case XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER:
>> +	case XE_GENL_GT_ERROR_FATAL_ARR_BIST:
>> +	case XE_GENL_GT_ERROR_FATAL_L3_DOUB:
>> +	case XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK:
>> +	case XE_GENL_GT_ERROR_FATAL_IDI_PAR:
>> +	case XE_GENL_GT_ERROR_FATAL_SQIDI:
>> +	case XE_GENL_GT_ERROR_FATAL_SAMPLER:
>> +	case XE_GENL_GT_ERROR_FATAL_EU_IC:
>> +		return true;
>> +	default:
>> +		return false;
>> +	}
>> +}
>> +
>> +static bool is_gsc_hw_error(const u64 config)
>> +{
>> +	if (config_counter(config) >= XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
>> +	    config_counter(config) <= XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY)
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static bool is_soc_error(const u64 config)
>> +{
>> +	if (config_counter(config) >= XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD &&
>> +	    config_counter(config) <= XE_GENL_SOC_ERROR_FATAL_HBM(1, 15))
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static int
>> +config_status(struct xe_device *xe, u64 config)
>> +{
>> +	unsigned int gt_id = config_gt_id(config);
>> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
>> +
>> +	if (!IS_DGFX(xe))
>> +		return -ENODEV;
>> +
>> +	if (gt->info.type == XE_GT_TYPE_UNINITIALIZED)
>> +		return -ENOENT;
>> +
>> +	/* GSC HW ERRORS are present on root tile of
>> +	 * platform supporting MEMORY SPARING only
>> +	 */
>> +	if (is_gsc_hw_error(config) && !(xe->info.platform == XE_PVC && !gt_id))
>> +		return -ENODEV;
>> +
>> +	/* GT vectors error  are valid on Platforms supporting error vectors only */
>> +	if (is_gt_vector_error(config) && xe->info.platform != XE_PVC)
>> +		return -ENODEV;
>> +
>> +	/* Skip gt errors not supported on pvc */
>> +	if (is_pvc_invalid_gt_errors(config) && xe->info.platform == XE_PVC)
>> +		return  -ENODEV;
>> +
>> +	/* FATAL FPU error is valid on PVC only */
>> +	if (config_counter(config) == XE_GENL_GT_ERROR_FATAL_FPU &&
>> +	    !(xe->info.platform == XE_PVC))
>> +		return -ENODEV;
>> +
>> +	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
>> +		return -ENODEV;
>> +
>> +	return (config_counter(config) >=
>> +			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
>> +}
>> +
>> +static u64 get_counter_value(struct xe_device *xe, u64 config)
>> +{
>> +	const unsigned int gt_id = config_gt_id(config);
>> +	struct xe_gt *gt = xe_device_get_gt(xe, gt_id);
>> +	unsigned int id = config_counter(config);
>> +
>> +	if (is_gt_error(config) || is_gt_vector_error(config))
>> +		return xa_to_value(xa_load(&gt->errors.hw_error, xe_hw_error_map[id]));
>> +
>> +	return xa_to_value(xa_load(&gt->tile->errors.hw_error, xe_hw_error_map[id]));
>> +}
>> +
>> +int fill_error_details(struct xe_device *xe, struct genl_info *info, struct sk_buff *new_msg)
> Should it be static?
yes can be static. will change it.
>
>> +{
>> +	struct nlattr *entry_attr;
>> +	bool counter = false;
>> +	struct xe_gt *gt;
>> +	int i, j;
>> +
>> +	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
>> +		     ARRAY_SIZE(xe_hw_error_map));
>> +
>> +	if (info->genlhdr->cmd == DRM_RAS_CMD_READ_ALL)
>> +		counter = true;
>> +
>> +	entry_attr = nla_nest_start(new_msg, DRM_RAS_ATTR_QUERY_REPLY);
>> +	if (!entry_attr)
>> +		return -EMSGSIZE;
>> +
>> +	for_each_gt(gt, xe, j) {
>> +		char str[MAX_ERROR_NAME];
>> +		u64 val;
>> +
>> +		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
>> +			u64 config = XE_HW_ERROR(j, i);
>> +
>> +			if (config_status(xe, config))
>> +				continue;
>> +
>> +			/* should this be cleared everytime */
>> +			snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]);
>> +
>> +			if (nla_put_string(new_msg, DRM_RAS_ATTR_ERROR_NAME, str))
>> +				goto err;
>> +			if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
>> +				goto err;
>> +			if (counter) {
>> +				val = get_counter_value(xe, config);
>> +				if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD))
>> +					goto err;
>> +			}
>> +		}
>> +	}
>> +
>> +	nla_nest_end(new_msg, entry_attr);
>> +
>>   	return 0;
>> +err:
>> +	drm_dbg_driver(&xe->drm, "msg buff is small\n");
>> +	nla_nest_cancel(new_msg, entry_attr);
>> +	nlmsg_free(new_msg);
>> +
>> +	return -EMSGSIZE;
>> +}
>> +
>> +static int xe_genl_list_errors(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	struct xe_device *xe = to_xe_device(drm);
>> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
>> +	struct sk_buff *new_msg;
>> +	int retries = 2;
>> +	void *usrhdr;
>> +	int ret = 0;
>> +
>> +	if (!IS_DGFX(xe))
>> +		return -ENODEV;
>> +
>> +	do {
>> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
>> +		if (!new_msg)
>> +			return -ENOMEM;
>> +
>> +		ret = fill_error_details(xe, info, new_msg);
>> +		if (!ret)
>> +			break;
>> +
>> +		msg_size += NLMSG_DEFAULT_SIZE;
>> +	} while (retries--);
>> +
>> +	if (!ret)
>> +		ret = drm_genl_reply(new_msg, info, usrhdr);
>> +
>> +	return ret;
>>   }
>>   
>>   static int xe_genl_read_error(struct drm_device *drm, struct sk_buff *msg, struct genl_info *info)
>>   {
>> -	return 0;
>> +	struct xe_device *xe = to_xe_device(drm);
>> +	size_t msg_size = NLMSG_DEFAULT_SIZE;
>> +	struct sk_buff *new_msg;
>> +	void *usrhdr;
>> +	int ret = 0;
>> +	int retries = 2;
>> +	u64 config, val;
>> +
>> +	config = nla_get_u64(info->attrs[DRM_RAS_ATTR_ERROR_ID]);
>> +	ret = config_status(xe, config);
>> +	if (ret)
>> +		return ret;
>> +	do {
>> +		new_msg = drm_genl_alloc_msg(drm, info, msg_size, &usrhdr);
>> +		if (!new_msg)
>> +			return -ENOMEM;
>> +
>> +		val = get_counter_value(xe, config);
>> +		if (nla_put_u64_64bit(new_msg, DRM_RAS_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
>> +			msg_size += NLMSG_DEFAULT_SIZE;
>> +			continue;
>> +		}
> Here ERROR_ID is provided and ERROR_VALUE is returned, but maybe we can 
> return also ERROR_NAME for the "full picture"?
> Or do you think that a regular flow would be first listing all errors, 
> grep the name of the required error, and use its id to get the value, so 
> userspace already has the name?
yes that was the flow i imagine userspace to use, to get the error ID one would have to query first and would
get name and id in response to it and this would be the flow with the new design suggested by Lijo as well.
>
>> +
>> +		break;
>> +	} while (retries--);
> It is really possible that NLMSG_DEFAULT_SIZE won't be enough for a 
> single counter read?

it should be, but i had the fallback just in case it fails but i do not think it is a possibility.

Thanks,
Aravind.
>
> Thanks,
> Tomer
>
>> +
>> +	ret = drm_genl_reply(new_msg, info, usrhdr);
>> +
>> +	return ret;
>>   }
>>   
>>   /* driver callbacks to DRM netlink commands*/
>> diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
>> index 60cc6418d9a7..dbb3f1afba5f 100644
>> --- a/include/uapi/drm/xe_drm.h
>> +++ b/include/uapi/drm/xe_drm.h
>> @@ -1087,6 +1087,87 @@ struct drm_xe_vm_madvise {
>>   #define XE_PMU_MEDIA_GROUP_BUSY(gt)		___XE_PMU_OTHER(gt, 3)
>>   #define XE_PMU_ANY_ENGINE_GROUP_BUSY(gt)	___XE_PMU_OTHER(gt, 4)
>>   
>> +/**
>> + * DOC: XE GENL netlink event IDs
>> + * TODO: Add more details
>> + */
>> +#define XE_HW_ERROR(gt, id) \
>> +	((id) | ((__u64)(gt) << __XE_PMU_GT_SHIFT))
>> +
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_L3_SNG		(0)
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_GUC		(1)
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_SAMPLER		(2)
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_SLM		(3)
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_EU_IC		(4)
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_EU_GRF		(5)
>> +#define XE_GENL_GT_ERROR_FATAL_ARR_BIST			(6)
>> +#define XE_GENL_GT_ERROR_FATAL_L3_DOUB			(7)
>> +#define XE_GENL_GT_ERROR_FATAL_L3_ECC_CHK		(8)
>> +#define XE_GENL_GT_ERROR_FATAL_GUC			(9)
>> +#define XE_GENL_GT_ERROR_FATAL_IDI_PAR			(10)
>> +#define XE_GENL_GT_ERROR_FATAL_SQIDI			(11)
>> +#define XE_GENL_GT_ERROR_FATAL_SAMPLER			(12)
>> +#define XE_GENL_GT_ERROR_FATAL_SLM			(13)
>> +#define XE_GENL_GT_ERROR_FATAL_EU_IC			(14)
>> +#define XE_GENL_GT_ERROR_FATAL_EU_GRF			(15)
>> +#define XE_GENL_GT_ERROR_FATAL_FPU			(16)
>> +#define XE_GENL_GT_ERROR_FATAL_TLB			(17)
>> +#define XE_GENL_GT_ERROR_FATAL_L3_FABRIC		(18)
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_SUBSLICE		(19)
>> +#define XE_GENL_GT_ERROR_CORRECTABLE_L3BANK		(20)
>> +#define XE_GENL_GT_ERROR_FATAL_SUBSLICE			(21)
>> +#define XE_GENL_GT_ERROR_FATAL_L3BANK			(22)
>> +#define XE_GENL_SGUNIT_ERROR_CORRECTABLE		(23)
>> +#define XE_GENL_SGUNIT_ERROR_NONFATAL			(24)
>> +#define XE_GENL_SGUNIT_ERROR_FATAL			(25)
>> +#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMD		(26)
>> +#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_CMP		(27)
>> +#define XE_GENL_SOC_ERROR_NONFATAL_CSC_PSF_REQ		(28)
>> +#define XE_GENL_SOC_ERROR_NONFATAL_ANR_MDFI		(29)
>> +#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2T		(30)
>> +#define XE_GENL_SOC_ERROR_NONFATAL_MDFI_T2C		(31)
>> +#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMD		(32)
>> +#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_CMP		(33)
>> +#define XE_GENL_SOC_ERROR_FATAL_CSC_PSF_REQ		(34)
>> +#define XE_GENL_SOC_ERROR_FATAL_PUNIT			(35)
>> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMD			(36)
>> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_CMP			(37)
>> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_PSF_REQ			(38)
>> +#define XE_GENL_SOC_ERROR_FATAL_ANR_MDFI		(39)
>> +#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2T		(40)
>> +#define XE_GENL_SOC_ERROR_FATAL_MDFI_T2C		(41)
>> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_AER		(42)
>> +#define XE_GENL_SOC_ERROR_FATAL_PCIE_ERR		(43)
>> +#define XE_GENL_SOC_ERROR_FATAL_UR_COND			(44)
>> +#define XE_GENL_SOC_ERROR_FATAL_SERR_SRCS		(45)
>> +
>> +#define XE_GENL_SOC_ERROR_NONFATAL_HBM(ss, n)\
>> +		(XE_GENL_SOC_ERROR_FATAL_SERR_SRCS + 0x1 + (ss) * 0x10 + (n))
>> +#define XE_GENL_SOC_ERROR_FATAL_HBM(ss, n)\
>> +		(XE_GENL_SOC_ERROR_NONFATAL_HBM(1, 15) + 0x1 + (ss) * 0x10 + (n))
>> +
>> +/* 109 is the last ID used by SOC errors */
>> +#define XE_GENL_GSC_ERROR_CORRECTABLE_SRAM_ECC		(110)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_MIA_SHUTDOWN		(111)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_MIA_INTERNAL		(112)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_SRAM_ECC		(113)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_WDG_TIMEOUT		(114)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_ROM_PARITY		(115)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_UCODE_PARITY		(116)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_VLT_GLITCH		(117)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_PULL		(118)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(119)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_SELF_MBIST		(120)
>> +#define XE_GENL_GSC_ERROR_NONFATAL_AON_RF_PARITY	(121)
>> +#define XE_GENL_SGGI_ERROR_NONFATAL			(122)
>> +#define XE_GENL_SGLI_ERROR_NONFATAL			(123)
>> +#define XE_GENL_SGCI_ERROR_NONFATAL			(124)
>> +#define XE_GENL_MERT_ERROR_NONFATAL			(125)
>> +#define XE_GENL_SGGI_ERROR_FATAL			(126)
>> +#define XE_GENL_SGLI_ERROR_FATAL			(127)
>> +#define XE_GENL_SGCI_ERROR_FATAL			(128)
>> +#define XE_GENL_MERT_ERROR_FATAL			(129)
>> +
>>   #if defined(__cplusplus)
>>   }
>>   #endif
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2023-11-12 15:28     ` Tomer Tayar
@ 2023-11-22 14:34       ` Aravind Iddamsetty
  0 siblings, 0 replies; 31+ messages in thread
From: Aravind Iddamsetty @ 2023-11-22 14:34 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel, alexander.deucher, airlied,
	daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J


On 11/12/23 20:58, Tomer Tayar wrote:
> On 10/11/2023 14:27, Tomer Tayar wrote:
>> On 20/10/2023 18:58, Aravind Iddamsetty wrote:
>>> Whenever a correctable or an uncorrectable error happens an event is sent
>>> to the corresponding listeners of these groups.
>>>
>>> v2: Rebase
>>>
>>> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/xe/xe_hw_error.c | 33 ++++++++++++++++++++++++++++++++
>>>    1 file changed, 33 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_hw_error.c b/drivers/gpu/drm/xe/xe_hw_error.c
>>> index bab6d4cf0b69..b0befb5e01cb 100644
>>> --- a/drivers/gpu/drm/xe/xe_hw_error.c
>>> +++ b/drivers/gpu/drm/xe/xe_hw_error.c
>>> @@ -786,6 +786,37 @@ xe_soc_hw_error_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>>    				(HARDWARE_ERROR_MAX << 1) + 1);
>>>    }
>>>    
>>> +static void
>>> +generate_netlink_event(struct xe_device *xe, const enum hardware_error hw_err)
>>> +{
>>> +	struct sk_buff *msg;
>>> +	void *hdr;
>>> +
>>> +	if (!xe->drm.drm_genl_family.module)
>>> +		return;
>>> +
>>> +	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
>>> +	if (!msg) {
>>> +		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
>>> +		return;
>>> +	}
>>> +
>>> +	hdr = genlmsg_put(msg, 0, 0, &xe->drm.drm_genl_family, 0, DRM_RAS_CMD_ERROR_EVENT);
>>> +	if (!hdr) {
>>> +		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
>>> +		nlmsg_free(msg);
>>> +		return;
>>> +	}
>>> +
>>> +	genlmsg_end(msg, hdr);
>>> +
>>> +	genlmsg_multicast(&xe->drm.drm_genl_family, msg, 0,
>>> +			  hw_err ?
>>> +			  DRM_GENL_MCAST_UNCORR_ERR
>>> +			  : DRM_GENL_MCAST_CORR_ERR,
>>> +			  GFP_ATOMIC);
>> I agree that hiding/wrapping any netlink/genetlink API/macro with a DRM
>> helper would be sometimes redundant,
>> and that in some cases the specific DRM driver would have to "dirt its
>> hands" and deal with netlink (e.g. fill_error_details() in patch #3).
>> However maybe here a DRM helper would have been useful, so we won't see
>> a copy of this sequence in other DRM drivers?
>>
>> Thanks,
>> Tomer
> After rethinking, it is possible that different DRM drivers will need 
> some flexibility when it comes to calling genlmsg_put(), as they might 
> want to have more of this call in order to attach some data related to 
> the error indication.
> In that case, adding a DRM function that wraps it may me redundant.
> What do you think?
I think we can expose this base level call to every drm driver and if it wants
to add any custom msg would define it own helper that should be ok i believe.


Thanks,
Aravind.
>
>>> +}
>>> +
>>>    static void
>>>    xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_err)
>>>    {
>>> @@ -849,6 +880,8 @@ xe_hw_error_source_handler(struct xe_tile *tile, const enum hardware_error hw_er
>>>    	}
>>>    
>>>    	xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err), errsrc);
>>> +
>>> +	generate_netlink_event(tile_to_xe(tile), hw_err);
>>>    unlock:
>>>    	spin_unlock_irqrestore(&tile_to_xe(tile)->irq.lock, flags);
>>>    }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v4 1/5] drm/netlink: Add netlink infrastructure
  2023-11-22 14:32     ` Aravind Iddamsetty
@ 2023-11-23  7:26       ` Tomer Tayar
  0 siblings, 0 replies; 31+ messages in thread
From: Tomer Tayar @ 2023-11-23  7:26 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher,
	airlied, daniel, joonas.lahtinen, ogabbay, Hawking.Zhang,
	Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, Ruhl,
	Michael J

On 22/11/2023 16:32, Aravind Iddamsetty wrote:
> On 11/10/23 17:54, Tomer Tayar wrote:
>> On 20/10/2023 18:58, Aravind Iddamsetty wrote:
>>> Define the netlink registration interface and commands, attributes that
>>> can be commonly used across by drm drivers. This patch intends to use
>>> the generic netlink family to expose various stats of device. At present
>>> it defines some commands that shall be used to expose RAS error counters.
>>>
>>> v2:
>>> define common interfaces to genl netlink subsystem that all drm drivers
>>> can leverage.(Tomer Tayar)
>>>
>>> v3: drop DRIVER_NETLINK flag and use the driver_genl_ops structure to
>>> register to netlink subsystem (Daniel Vetter)
>>>
>>> v4:(Michael J. Ruhl)
>>> 1. rename drm_genl_send to drm_genl_reply
>>> 2. catch error from xa_store and handle appropriately
>>>
>>> Cc: Tomer Tayar<ttayar@habana.ai>
>>> Cc: Daniel Vetter<daniel@ffwll.ch>
>>> Cc: Michael J. Ruhl<michael.j.ruhl@intel.com>
>>>
>>> Signed-off-by: Aravind Iddamsetty<aravind.iddamsetty@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/Makefile       |   1 +
>>>    drivers/gpu/drm/drm_drv.c      |   7 ++
>>>    drivers/gpu/drm/drm_netlink.c  | 188 +++++++++++++++++++++++++++++++++
>>>    include/drm/drm_device.h       |   8 ++
>>>    include/drm/drm_drv.h          |   7 ++
>>>    include/drm/drm_netlink.h      |  30 ++++++
>>>    include/uapi/drm/drm_netlink.h |  83 +++++++++++++++
>>>    7 files changed, 324 insertions(+)
>>>    create mode 100644 drivers/gpu/drm/drm_netlink.c
>>>    create mode 100644 include/drm/drm_netlink.h
>>>    create mode 100644 include/uapi/drm/drm_netlink.h
>>>
>>> diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
>>> index ee64c51274ad..60864369adaa 100644
>>> --- a/drivers/gpu/drm/Makefile
>>> +++ b/drivers/gpu/drm/Makefile
>>> @@ -35,6 +35,7 @@ drm-y := \
>>>    	drm_mode_object.o \
>>>    	drm_modes.o \
>>>    	drm_modeset_lock.o \
>>> +	drm_netlink.o \
>>>    	drm_plane.o \
>>>    	drm_prime.o \
>>>    	drm_print.o \
>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>> index 535f16e7882e..31f55c1c7524 100644
>>> --- a/drivers/gpu/drm/drm_drv.c
>>> +++ b/drivers/gpu/drm/drm_drv.c
>>> @@ -937,6 +937,12 @@ int drm_dev_register(struct drm_device *dev, unsigned long flags)
>>>    	if (ret)
>>>    		goto err_minors;
>>>    
>>> +	if (driver->genl_ops) {
>>> +		ret = drm_genl_register(dev);
>>> +		if (ret)
>>> +			goto err_minors;
>>> +	}
>>> +
>>>    	ret = create_compat_control_link(dev);
>>>    	if (ret)
>>>    		goto err_minors;
>>> @@ -1074,6 +1080,7 @@ static void drm_core_exit(void)
>>>    {
>>>    	drm_privacy_screen_lookup_exit();
>>>    	accel_core_exit();
>>> +	drm_genl_exit();
>>>    	unregister_chrdev(DRM_MAJOR, "drm");
>>>    	debugfs_remove(drm_debugfs_root);
>>>    	drm_sysfs_destroy();
>>> diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c
>>> new file mode 100644
>>> index 000000000000..8add249c1da3
>>> --- /dev/null
>>> +++ b/drivers/gpu/drm/drm_netlink.c
>>> @@ -0,0 +1,188 @@
>>> +// SPDX-License-Identifier: MIT
>>> +/*
>>> + * Copyright © 2023 Intel Corporation
>>> + */
>>> +
>>> +#include <drm/drm_device.h>
>>> +#include <drm/drm_drv.h>
>>> +#include <drm/drm_file.h>
>>> +#include <drm/drm_managed.h>
>>> +#include <drm/drm_netlink.h>
>>> +#include <drm/drm_print.h>
>>> +
>>> +DEFINE_XARRAY(drm_dev_xarray);
>>> +
>>> +/**
>>> + * drm_genl_reply - response to a request
>>> + * @msg: socket buffer
>>> + * @info: receiver information
>>> + * @usrhdr: pointer to user specific header in the message buffer
>>> + *
>>> + * RETURNS:
>>> + * 0 on success and negative error code on failure
>>> + */
>>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
>>> +{
>>> +	int ret;
>>> +
>>> +	genlmsg_end(msg, usrhdr);
>>> +
>>> +	ret = genlmsg_reply(msg, info);
>>> +	if (ret)
>>> +		nlmsg_free(msg);
>>> +
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL(drm_genl_reply);
>>> +
>>> +/**
>>> + * drm_genl_alloc_msg - allocate genl message buffer
>>> + * @dev: drm_device for which the message is being allocated
>>> + * @info: receiver information
>> a description for msg_size is missing
> Thanks for catching it will add.
>>> + * @usrhdr: pointer to user specific header in the message buffer
>>> + *
>>> + * RETURNS:
>>> + * pointer to new allocated buffer on success, NULL on failure
>>> + */
>>> +struct sk_buff *
>>> +drm_genl_alloc_msg(struct drm_device *dev,
>>> +		   struct genl_info *info,
>>> +		   size_t msg_size, void **usrhdr)
>>> +{
>>> +	struct sk_buff *new_msg;
>>> +
>>> +	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
>>> +	if (!new_msg)
>>> +		return new_msg;
>>> +
>>> +	*usrhdr = genlmsg_put_reply(new_msg, info, &dev->drm_genl_family, 0, info->genlhdr->cmd);
>>> +	if (!*usrhdr) {
>>> +		nlmsg_free(new_msg);
>>> +		new_msg = NULL;
>>> +	}
>>> +
>>> +	return new_msg;
>>> +}
>>> +EXPORT_SYMBOL(drm_genl_alloc_msg);
>>> +
>>> +static struct drm_device *genl_to_dev(struct genl_info *info)
>>> +{
>>> +	return xa_load(&drm_dev_xarray, info->nlhdr->nlmsg_type);
>>> +}
>>> +
>>> +static int drm_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
>>> +{
>>> +	struct drm_device *dev = genl_to_dev(info);
>>> +
>>> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_REQUEST))
>>> +		return -EINVAL;
>>> +
>>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
>>> +}
>>> +
>>> +static int drm_genl_read_error(struct sk_buff *msg, struct genl_info *info)
>>> +{
>>> +	struct drm_device *dev = genl_to_dev(info);
>>> +
>>> +	if (GENL_REQ_ATTR_CHECK(info, DRM_RAS_ATTR_ERROR_ID))
>>> +		return -EINVAL;
>>> +
>>> +	if (WARN_ON(!dev->driver->genl_ops[info->genlhdr->cmd].doit))
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	return dev->driver->genl_ops[info->genlhdr->cmd].doit(dev, msg, info);
>>> +}
>>> +
>>> +/* attribute policies */
>>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>>> +	[DRM_RAS_ATTR_REQUEST] = { .type = NLA_U8 },
>>> +};
>>> +
>>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
>>> +	[DRM_RAS_ATTR_ERROR_ID] = { .type = NLA_U64 },
>>> +};
>>> +
>>> +/* drm genl operations definition */
>>> +const struct genl_ops drm_genl_ops[] = {
>>> +	{
>>> +		.cmd = DRM_RAS_CMD_QUERY,
>>> +		.doit = drm_genl_list_errors,
>>> +		.policy = drm_attr_policy_query,
>>> +	},
>>> +	{
>>> +		.cmd = DRM_RAS_CMD_READ_ONE,
>>> +		.doit = drm_genl_read_error,
>>> +		.policy = drm_attr_policy_read_one,
>>> +	},
>>> +	{
>>> +		.cmd = DRM_RAS_CMD_READ_ALL,
>>> +		.doit = drm_genl_list_errors,
>>> +		.policy = drm_attr_policy_query,
>>> +	},
>>> +};
>>> +
>>> +static void drm_genl_family_init(struct drm_device *dev)
>>> +{
>>> +	/* Use drm primary node name eg: card0 to name the genl family */
>>> +	snprintf(dev->drm_genl_family.name, sizeof(dev->drm_genl_family.name), "%s", dev->primary->kdev->kobj.name);
>> dev_name() can be used.
>> Also, what about accel? Maybe check dev->primary and use primary/accel
>> accordingly?
> the present series is adding this feature for primary device only and has
> no knowledge how it will be used for accel device, so when accel device
> start using this infra should make that particular change or do you think
> it should be added as part of this series only?

I think that accel is considered a part of the drm subsystem, so we can 
refer to all minor types when adding a general drm feature.
But I understand your argument and if you prefer to postpone it until it 
is used for some accel device then no problem.

Thanks,
Tomer

>
>>> +	dev->drm_genl_family.version = DRM_GENL_VERSION;
>>> +	dev->drm_genl_family.parallel_ops = true;
>>> +	dev->drm_genl_family.ops = drm_genl_ops;
>>> +	dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops);
>>> +	dev->drm_genl_family.maxattr = DRM_ATTR_MAX;
>>> +	dev->drm_genl_family.module = dev->dev->driver->owner;
>>> +}
>>> +
>>> +static void drm_genl_deregister(struct drm_device *dev,  void *arg)
>> Redundant space before "void *arg"
> will clean it.
>>> +{
>>> +	drm_dbg_driver(dev, "unregistering genl family %s\n", dev->drm_genl_family.name);
>>> +
>>> +	xa_erase(&drm_dev_xarray, dev->drm_genl_family.id);
>>> +
>>> +	genl_unregister_family(&dev->drm_genl_family);
>>> +}
>>> +
>>> +/**
>>> + * drm_genl_register - Register genl family
>>> + * @dev: drm_device for which genl family needs to be registered
>>> + *
>>> + * RETURNS:
>>> + * 0 on success and negative error code on failure
>>> + */
>>> +int drm_genl_register(struct drm_device *dev)
>>> +{
>>> +	int ret;
>>> +
>>> +	drm_genl_family_init(dev);
>>> +
>>> +	ret = genl_register_family(&dev->drm_genl_family);
>>> +	if (ret < 0) {
>>> +		drm_warn(dev, "genl family registration failed\n");
>>> +		return ret;
>>> +	}
>>> +
>>> +	drm_dbg_driver(dev, "genl family id %d and name %s\n", dev->drm_genl_family.id, dev->drm_genl_family.name);
>>> +
>>> +	ret = xa_err(xa_store(&drm_dev_xarray, dev->drm_genl_family.id, dev, GFP_KERNEL));
>>> +	if (ret)
>>> +		goto genl_unregister;
>>> +
>>> +	ret = drmm_add_action_or_reset(dev, drm_genl_deregister, NULL);
>>> +
>>> +	return ret;
>>> +
>>> +genl_unregister:
>>> +	genl_unregister_family(&dev->drm_genl_family);
>>> +	return ret;
>>> +}
>>> +
>>> +/**
>>> + * drm_genl_exit: destroy drm_dev_xarray
>>> + */
>>> +void drm_genl_exit(void)
>>> +{
>>> +	xa_destroy(&drm_dev_xarray);
>>> +}
>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>>> index c490977ee250..d3ae91b7714d 100644
>>> --- a/include/drm/drm_device.h
>>> +++ b/include/drm/drm_device.h
>>> @@ -8,6 +8,7 @@
>>>    
>>>    #include <drm/drm_legacy.h>
>>>    #include <drm/drm_mode_config.h>
>>> +#include <drm/drm_netlink.h>
>>>    
>>>    struct drm_driver;
>>>    struct drm_minor;
>>> @@ -318,6 +319,13 @@ struct drm_device {
>>>    	 */
>>>    	struct dentry *debugfs_root;
>>>    
>>> +	/**
>>> +	 * @drm_genl_family:
>>> +	 *
>>> +	 * Generic netlink family registration structure.
>>> +	 */
>>> +	struct genl_family drm_genl_family;
>>> +
>>>    	/* Everything below here is for legacy driver, never use! */
>>>    	/* private: */
>>>    #if IS_ENABLED(CONFIG_DRM_LEGACY)
>>> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
>>> index e2640dc64e08..ebdb7850d235 100644
>>> --- a/include/drm/drm_drv.h
>>> +++ b/include/drm/drm_drv.h
>>> @@ -434,6 +434,13 @@ struct drm_driver {
>>>    	 */
>>>    	const struct file_operations *fops;
>>>    
>>> +	/**
>>> +	 * @genl_ops:
>>> +	 *
>>> +	 * Drivers private callback to genl commands
>>> +	 */
>>> +	const struct driver_genl_ops *genl_ops;
>>> +
>>>    #ifdef CONFIG_DRM_LEGACY
>>>    	/* Everything below here is for legacy driver, never use! */
>>>    	/* private: */
>>> diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h
>>> new file mode 100644
>>> index 000000000000..54527dae7847
>>> --- /dev/null
>>> +++ b/include/drm/drm_netlink.h
>>> @@ -0,0 +1,30 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2023 Intel Corporation
>>> + */
>>> +
>>> +#ifndef __DRM_NETLINK_H__
>>> +#define __DRM_NETLINK_H__
>>> +
>>> +#include <linux/netdevice.h>
>>> +#include <net/genetlink.h>
>>> +#include <net/sock.h>
>>> +#include <uapi/drm/drm_netlink.h>
>>> +
>>> +struct drm_device;
>>> +
>>> +struct driver_genl_ops {
>>> +	int		       (*doit)(struct drm_device *dev,
>>> +				       struct sk_buff *skb,
>> The skb parameter is currently not used (both xe_genl_list_errors() and
>> xe_genl_read_error() allocate a new skb).
>> Did you add because it might be needed for future ops?
> well I wanted to pass on the details the netlink subsystem sends and leave it to the driver
> if it wants to use it anyway.
>>> +				       struct genl_info *info);
>>> +};
>>> +
>>> +int drm_genl_register(struct drm_device *dev);
>>> +void drm_genl_exit(void);
>>> +int drm_genl_reply(struct sk_buff *msg, struct genl_info *info, void *usrhdr);
>>> +struct sk_buff *
>>> +drm_genl_alloc_msg(struct drm_device *dev,
>>> +		   struct genl_info *info,
>>> +		   size_t msg_size, void **usrhdr);
>>> +#endif
>>> +
>>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>>> new file mode 100644
>>> index 000000000000..aab42147a20e
>>> --- /dev/null
>>> +++ b/include/uapi/drm/drm_netlink.h
>>> @@ -0,0 +1,83 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright 2023 Intel Corporation
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a
>>> + * copy of this software and associated documentation files (the "Software"),
>>> + * to deal in the Software without restriction, including without limitation
>>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>>> + * and/or sell copies of the Software, and to permit persons to whom the
>>> + * Software is furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice (including the next
>>> + * paragraph) shall be included in all copies or substantial portions of the
>>> + * Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
>>> + * OTHER DEALINGS IN THE SOFTWARE.
>>> + */
>>> +
>>> +#ifndef _DRM_NETLINK_H_
>>> +#define _DRM_NETLINK_H_
>>> +
>>> +#define DRM_GENL_VERSION 1
>>> +
>>> +#if defined(__cplusplus)
>>> +extern "C" {
>>> +#endif
>>> +
>>> +/**
>>> + * enum drm_genl_error_cmds - Supported error commands
>>> + *
>>> + */
>>> +enum drm_genl_error_cmds {
>>> +	DRM_CMD_UNSPEC,
>>> +	/** @DRM_RAS_CMD_QUERY: Command to list all errors names with config-id */
>>> +	DRM_RAS_CMD_QUERY,
>>> +	/** @DRM_RAS_CMD_READ_ONE: Command to get a counter for a specific error */
>>> +	DRM_RAS_CMD_READ_ONE,
>>> +	/** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */
>>> +	DRM_RAS_CMD_READ_ALL,
>>> +
>>> +	__DRM_CMD_MAX,
>>> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>>> +};
>>> +
>>> +/**
>>> + * enum drm_error_attr - Attributes to use with drm_genl_error_cmds
>>> + *
>>> + */
>>> +enum drm_error_attr {
>>> +	DRM_ATTR_UNSPEC,
>>> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>>> +	/**
>>> +	 * @DRM_RAS_ATTR_REQUEST: Should be used with DRM_RAS_CMD_QUERY,
>>> +	 * DRM_RAS_CMD_READ_ALL
>>> +	 */
>>> +	DRM_RAS_ATTR_REQUEST, /* NLA_U8 */
>>> +	/**
>>> +	 * @DRM_RAS_ATTR_QUERY_REPLY: First Nested attributed sent as a
>>> +	 * response to DRM_RAS_CMD_QUERY, DRM_RAS_CMD_READ_ALL commands.
>>> +	 */
>>> +	DRM_RAS_ATTR_QUERY_REPLY, /*NLA_NESTED*/
>> Maybe a space before and after NLA_NESTED?
> right missed that.
>
> Thanks,
> Aravind.
>> Thanks,
>> Tomer
>>
>>> +	/** @DRM_RAS_ATTR_ERROR_NAME: Used to pass error name */
>>> +	DRM_RAS_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>>> +	/** @DRM_RAS_ATTR_ERROR_ID: Used to pass error id */
>>> +	DRM_RAS_ATTR_ERROR_ID, /* NLA_U64 */
>>> +	/** @DRM_RAS_ATTR_ERROR_VALUE: Used to pass error value */
>>> +	DRM_RAS_ATTR_ERROR_VALUE, /* NLA_U64 */
>>> +
>>> +	__DRM_ATTR_MAX,
>>> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>>> +};
>>> +
>>> +#if defined(__cplusplus)
>>> +}
>>> +#endif
>>> +
>>> +#endif



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-11-23  9:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-20 15:58 [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
2023-10-20 15:58 ` [RFC v4 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
2023-10-20 20:36   ` Ruhl, Michael J
2023-10-21  1:10     ` Aravind Iddamsetty
2023-11-10 12:24   ` Tomer Tayar
2023-11-22 14:32     ` Aravind Iddamsetty
2023-11-23  7:26       ` Tomer Tayar
2023-10-20 15:58 ` [RFC v2 2/5] drm/xe/RAS: Register netlink capability Aravind Iddamsetty
2023-10-20 20:37   ` Ruhl, Michael J
2023-10-20 15:58 ` [RFC v3 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
2023-10-20 20:39   ` Ruhl, Michael J
2023-11-10 12:27   ` Tomer Tayar
2023-11-22 14:33     ` Aravind Iddamsetty
2023-10-20 15:58 ` [RFC 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty
2023-10-20 20:39   ` Ruhl, Michael J
2023-10-20 15:58 ` [RFC v2 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
2023-10-20 20:40   ` Ruhl, Michael J
2023-11-10 12:27   ` Tomer Tayar
2023-11-12 15:28     ` Tomer Tayar
2023-11-22 14:34       ` Aravind Iddamsetty
2023-10-23 15:29 ` [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Alex Deucher
2023-10-24  8:59   ` Zhang, Hawking
2023-10-26  9:27     ` Aravind Iddamsetty
2023-10-26 10:04   ` Lazar, Lijo
2023-10-30  6:19     ` Aravind Iddamsetty
2023-10-30 15:11       ` Lazar, Lijo
2023-11-01  8:06         ` Aravind Iddamsetty
2023-11-07  5:30           ` Lazar, Lijo
2023-11-08  9:24             ` Aravind Iddamsetty
2023-11-10 12:23 ` Tomer Tayar
2023-11-22 14:28   ` Aravind Iddamsetty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).