dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
@ 2023-05-26 16:20 Aravind Iddamsetty
  2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw)
  To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay

Our hardware supports RAS(Reliability, Availability, Serviceability) by
exposing a set of error counters which can be used by observability
tools to take corrective actions or repairs. Traditionally there were
being exposed via PMU (for relative counters) and sysfs interface (for
absolute value) in our internal branch. But, due to the limitations in
this approach to use two interfaces and also not able to have an event
based reporting or configurability, an alternative approach to try
netlink was suggested by community for drm subsystem wide UAPI for RAS
and telemetry as discussed in [1]. 

This [1] is the inspiration to this series. It uses the generic
netlink(genl) family subsystem and exposes a set of commands that can
be used by every drm driver, the framework provides a means to have
custom commands too. Each drm driver instance in this example xe driver
instance registers a family and operations to the genl subsystem through
which it enumerates and reports the error counters. An event based
notification is also supported to which userpace can subscribe to and
be notified when any error occurs and read the error counter this avoids
continuous polling on error counter. This can also be extended to
threshold based notification.

[1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

this series is on top of https://patchwork.freedesktop.org/series/116181/

Below is an example tool drm_ras which demonstrates the use of the
supported commands. The tool will be sent to ML with the subject
"[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"

read single error counter:

$ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
counter value 0

read all error counters:

$ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
name                                                    config-id               counter

error-gt0-correctable-guc                               0x0000000000000001      0
error-gt0-correctable-slm                               0x0000000000000003      0
error-gt0-correctable-eu-ic                             0x0000000000000004      0
error-gt0-correctable-eu-grf                            0x0000000000000005      0
error-gt0-fatal-guc                                     0x0000000000000009      0
error-gt0-fatal-slm                                     0x000000000000000d      0
error-gt0-fatal-eu-grf                                  0x000000000000000f      0
error-gt0-fatal-fpu                                     0x0000000000000010      0
error-gt0-fatal-tlb                                     0x0000000000000011      0
error-gt0-fatal-l3-fabric                               0x0000000000000012      0
error-gt0-correctable-subslice                          0x0000000000000013      0
error-gt0-correctable-l3bank                            0x0000000000000014      0
error-gt0-fatal-subslice                                0x0000000000000015      0
error-gt0-fatal-l3bank                                  0x0000000000000016      0
error-gt0-sgunit-correctable                            0x0000000000000017      0
error-gt0-sgunit-nonfatal                               0x0000000000000018      0
error-gt0-sgunit-fatal                                  0x0000000000000019      0
error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
error-gt0-soc-fatal-punit                               0x000000000000001d      0
error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
error-gt1-correctable-guc                               0x1000000000000001      0
error-gt1-correctable-slm                               0x1000000000000003      0
error-gt1-correctable-eu-ic                             0x1000000000000004      0
error-gt1-correctable-eu-grf                            0x1000000000000005      0
error-gt1-fatal-guc                                     0x1000000000000009      0
error-gt1-fatal-slm                                     0x100000000000000d      0
error-gt1-fatal-eu-grf                                  0x100000000000000f      0
error-gt1-fatal-fpu                                     0x1000000000000010      0
error-gt1-fatal-tlb                                     0x1000000000000011      0
error-gt1-fatal-l3-fabric                               0x1000000000000012      0
error-gt1-correctable-subslice                          0x1000000000000013      0
error-gt1-correctable-l3bank                            0x1000000000000014      0
error-gt1-fatal-subslice                                0x1000000000000015      0
error-gt1-fatal-l3bank                                  0x1000000000000016      0
error-gt1-sgunit-correctable                            0x1000000000000017      0
error-gt1-sgunit-nonfatal                               0x1000000000000018      0
error-gt1-sgunit-fatal                                  0x1000000000000019      0
error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
error-gt1-soc-fatal-punit                               0x100000000000001d      0
error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0

wait on a error event:

$ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
waiting for error event
error event received
counter value 0

list all errors:

$ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
name                                                    config-id

error-gt0-correctable-guc                               0x0000000000000001
error-gt0-correctable-slm                               0x0000000000000003
error-gt0-correctable-eu-ic                             0x0000000000000004
error-gt0-correctable-eu-grf                            0x0000000000000005
error-gt0-fatal-guc                                     0x0000000000000009
error-gt0-fatal-slm                                     0x000000000000000d
error-gt0-fatal-eu-grf                                  0x000000000000000f
error-gt0-fatal-fpu                                     0x0000000000000010
error-gt0-fatal-tlb                                     0x0000000000000011
error-gt0-fatal-l3-fabric                               0x0000000000000012
error-gt0-correctable-subslice                          0x0000000000000013
error-gt0-correctable-l3bank                            0x0000000000000014
error-gt0-fatal-subslice                                0x0000000000000015
error-gt0-fatal-l3bank                                  0x0000000000000016
error-gt0-sgunit-correctable                            0x0000000000000017
error-gt0-sgunit-nonfatal                               0x0000000000000018
error-gt0-sgunit-fatal                                  0x0000000000000019
error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
error-gt0-soc-fatal-punit                               0x000000000000001d
error-gt0-soc-fatal-psf-0                               0x000000000000001e
error-gt0-soc-fatal-psf-1                               0x000000000000001f
error-gt0-soc-fatal-psf-2                               0x0000000000000020
error-gt0-soc-fatal-cd0                                 0x0000000000000021
error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
error-gt1-correctable-guc                               0x1000000000000001
error-gt1-correctable-slm                               0x1000000000000003
error-gt1-correctable-eu-ic                             0x1000000000000004
error-gt1-correctable-eu-grf                            0x1000000000000005
error-gt1-fatal-guc                                     0x1000000000000009
error-gt1-fatal-slm                                     0x100000000000000d
error-gt1-fatal-eu-grf                                  0x100000000000000f
error-gt1-fatal-fpu                                     0x1000000000000010
error-gt1-fatal-tlb                                     0x1000000000000011
error-gt1-fatal-l3-fabric                               0x1000000000000012
error-gt1-correctable-subslice                          0x1000000000000013
error-gt1-correctable-l3bank                            0x1000000000000014
error-gt1-fatal-subslice                                0x1000000000000015
error-gt1-fatal-l3bank                                  0x1000000000000016
error-gt1-sgunit-correctable                            0x1000000000000017
error-gt1-sgunit-nonfatal                               0x1000000000000018
error-gt1-sgunit-fatal                                  0x1000000000000019
error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
error-gt1-soc-fatal-punit                               0x100000000000001d
error-gt1-soc-fatal-psf-0                               0x100000000000001e
error-gt1-soc-fatal-psf-1                               0x100000000000001f
error-gt1-soc-fatal-psf-2                               0x1000000000000020
error-gt1-soc-fatal-cd0                                 0x1000000000000021
error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Oded Gabbay <ogabbay@kernel.org>


Aravind Iddamsetty (5):
  drm/netlink: Add netlink infrastructure
  drm/xe/RAS: Register a genl netlink family
  drm/xe/RAS: Expose the error counters
  drm/netlink: define multicast groups
  drm/xe/RAS: send multicast event on occurrence of an error

 drivers/gpu/drm/xe/Makefile          |   1 +
 drivers/gpu/drm/xe/xe_device.c       |   3 +
 drivers/gpu/drm/xe/xe_device_types.h |   2 +
 drivers/gpu/drm/xe/xe_irq.c          |  32 ++
 drivers/gpu/drm/xe/xe_module.c       |   2 +
 drivers/gpu/drm/xe/xe_netlink.c      | 526 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_netlink.h      |  14 +
 include/uapi/drm/drm_netlink.h       |  81 +++++
 include/uapi/drm/xe_drm.h            |  64 ++++
 9 files changed, 725 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
 create mode 100644 include/uapi/drm/drm_netlink.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC 1/5] drm/netlink: Add netlink infrastructure
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
@ 2023-05-26 16:20 ` Aravind Iddamsetty
  2023-06-04 17:07   ` [Intel-xe] " Tomer Tayar
  2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw)
  To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay

Define the netlink commands and attributes that can be commonly used
across by drm drivers.

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
---
 include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)
 create mode 100644 include/uapi/drm/drm_netlink.h

diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
new file mode 100644
index 000000000000..28e7a334d0c7
--- /dev/null
+++ b/include/uapi/drm/drm_netlink.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright 2023 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _DRM_NETLINK_H_
+#define _DRM_NETLINK_H_
+
+#include <linux/netdevice.h>
+#include <net/genetlink.h>
+#include <net/sock.h>
+
+#define DRM_GENL_VERSION 1
+
+enum error_cmds {
+	DRM_CMD_UNSPEC,
+	/* command to list all errors names with config-id */
+	DRM_CMD_QUERY,
+	/* command to get a counter for a specific error */
+	DRM_CMD_READ_ONE,
+	/* command to get counters of all errors */
+	DRM_CMD_READ_ALL,
+
+	__DRM_CMD_MAX,
+	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
+};
+
+enum error_attr {
+	DRM_ATTR_UNSPEC,
+	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
+	DRM_ATTR_REQUEST, /* NLA_U8 */
+	DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/
+	DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
+	DRM_ATTR_ERROR_ID, /* NLA_U64 */
+	DRM_ATTR_ERROR_VALUE, /* NLA_U64 */
+
+	__DRM_ATTR_MAX,
+	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
+};
+
+/* attribute policies */
+static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
+	[DRM_ATTR_REQUEST] = { .type = NLA_U8 },
+};
+
+static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
+	[DRM_ATTR_ERROR_ID] = { .type = NLA_U64 },
+};
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 2/5] drm/xe/RAS: Register a genl netlink family
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
  2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
@ 2023-05-26 16:20 ` Aravind Iddamsetty
  2023-06-04 17:09   ` [Intel-xe] " Tomer Tayar
  2023-05-26 16:20 ` [RFC 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw)
  To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay

Use the generic netlink(genl) subsystem to expose the RAS counters to
userspace. We define a family structure and operations and register to
genl subsystem and these callbacks will be invoked by genl subsystem when
userspace sends a registered command with a family identifier to genl
subsystem.

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
---
 drivers/gpu/drm/xe/Makefile          |  1 +
 drivers/gpu/drm/xe/xe_device.c       |  3 +
 drivers/gpu/drm/xe/xe_device_types.h |  2 +
 drivers/gpu/drm/xe/xe_module.c       |  2 +
 drivers/gpu/drm/xe/xe_netlink.c      | 89 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_netlink.h      | 14 +++++
 6 files changed, 111 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
 create mode 100644 drivers/gpu/drm/xe/xe_netlink.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index b84e191ba14f..2b42165bc824 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -67,6 +67,7 @@ xe-y += xe_bb.o \
 	xe_mmio.o \
 	xe_mocs.o \
 	xe_module.o \
+	xe_netlink.o \
 	xe_pat.o \
 	xe_pci.o \
 	xe_pcode.o \
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 323356a44e7f..aa12ef12d9dc 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -24,6 +24,7 @@
 #include "xe_irq.h"
 #include "xe_mmio.h"
 #include "xe_module.h"
+#include "xe_netlink.h"
 #include "xe_pcode.h"
 #include "xe_pm.h"
 #include "xe_query.h"
@@ -317,6 +318,8 @@ int xe_device_probe(struct xe_device *xe)
 
 	xe_display_register(xe);
 
+	xe_genl_register(xe);
+
 	xe_debugfs_register(xe);
 
 	err = drmm_add_action_or_reset(&xe->drm, xe_device_sanitize, xe);
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 682ebdd1c09e..c9612a54c48f 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -10,6 +10,7 @@
 
 #include <drm/drm_device.h>
 #include <drm/drm_file.h>
+#include <drm/drm_netlink.h>
 #include <drm/ttm/ttm_device.h>
 
 #include "xe_gt_types.h"
@@ -347,6 +348,7 @@ struct xe_device {
 		u32 lvds_channel_mode;
 	} params;
 #endif
+	struct genl_family xe_genl_family;
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index 6860586ce7f8..1eb73eb9a9a5 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -11,6 +11,7 @@
 #include "xe_drv.h"
 #include "xe_hw_fence.h"
 #include "xe_module.h"
+#include "xe_netlink.h"
 #include "xe_pci.h"
 #include "xe_sched_job.h"
 
@@ -67,6 +68,7 @@ static void __exit xe_exit(void)
 {
 	int i;
 
+	xe_genl_cleanup();
 	xe_unregister_pci_driver();
 
 	for (i = ARRAY_SIZE(init_funcs) - 1; i >= 0; i--)
diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
new file mode 100644
index 000000000000..63ef238ebc27
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_netlink.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2023 Intel Corporation
+ */
+
+#include <drm/drm_managed.h>
+
+#include "xe_device.h"
+
+DEFINE_XARRAY(xe_xarray);
+
+static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
+{
+	return 0;
+}
+
+static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info)
+{
+	return 0;
+}
+
+/* operations definition */
+static const struct genl_ops xe_genl_ops[] = {
+	{
+		.cmd = DRM_CMD_QUERY,
+		.doit = xe_genl_list_errors,
+		.policy = drm_attr_policy_query,
+	},
+	{
+		.cmd = DRM_CMD_READ_ONE,
+		.doit = xe_genl_read_error,
+		.policy = drm_attr_policy_read_one,
+	},
+	{
+		.cmd = DRM_CMD_READ_ALL,
+		.doit = xe_genl_list_errors,
+		.policy = drm_attr_policy_query,
+	},
+};
+
+static void xe_genl_deregister(struct drm_device *dev,  void *arg)
+{
+	struct xe_device *xe = arg;
+
+	xa_erase(&xe_xarray, xe->xe_genl_family.id);
+
+	drm_dbg_driver(&xe->drm, "unregistering genl family %s\n", xe->xe_genl_family.name);
+
+	genl_unregister_family(&xe->xe_genl_family);
+}
+
+static void xe_genl_family_init(struct xe_device *xe)
+{
+	/* Use drm primary node name eg: card0 to name the genl family */
+	snprintf(xe->xe_genl_family.name, sizeof(xe->xe_genl_family.name), "%s", xe->drm.primary->kdev->kobj.name);
+	xe->xe_genl_family.version = DRM_GENL_VERSION;
+	xe->xe_genl_family.parallel_ops = true;
+	xe->xe_genl_family.ops = xe_genl_ops;
+	xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops);
+	xe->xe_genl_family.maxattr = DRM_ATTR_MAX;
+	xe->xe_genl_family.module = THIS_MODULE;
+}
+
+int xe_genl_register(struct xe_device *xe)
+{
+	int ret;
+
+	xe_genl_family_init(xe);
+
+	ret = genl_register_family(&xe->xe_genl_family);
+	if (ret < 0) {
+		drm_warn(&xe->drm, "xe genl family registration failed\n");
+		return ret;
+	}
+
+	drm_dbg_driver(&xe->drm, "genl family id %d and name %s\n", xe->xe_genl_family.id, xe->xe_genl_family.name);
+
+	xa_store(&xe_xarray, xe->xe_genl_family.id, xe, GFP_KERNEL);
+
+	ret = drmm_add_action_or_reset(&xe->drm, xe_genl_deregister, xe);
+
+	return ret;
+}
+
+void xe_genl_cleanup(void)
+{
+	/* destroy xarray */
+	xa_destroy(&xe_xarray);
+}
diff --git a/drivers/gpu/drm/xe/xe_netlink.h b/drivers/gpu/drm/xe/xe_netlink.h
new file mode 100644
index 000000000000..3bbddb620539
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_netlink.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2021 Intel Corporation
+ */
+
+#ifndef _XE_GENL_H_
+#define _XE_GENL_H_
+
+#include "xe_device.h"
+
+int xe_genl_register(struct xe_device *xe);
+void xe_genl_cleanup(void);
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 3/5] drm/xe/RAS: Expose the error counters
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
  2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
  2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty
@ 2023-05-26 16:20 ` Aravind Iddamsetty
  2023-05-26 16:20 ` [RFC 4/5] drm/netlink: define multicast groups Aravind Iddamsetty
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw)
  To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay

We expose the various error counters supported on a hardware via genl
subsystem through the registered commands to userspace.
The DRM_CMD_QUERY lists the error names with config id, DRM_CMD_READ_ONE
returns the counter value for the requested config id and the
DRM_CMD_READ_ALL list the counters for all errors along with their names
and config ids.

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
---
 drivers/gpu/drm/xe/xe_netlink.c | 439 +++++++++++++++++++++++++++++++-
 include/uapi/drm/xe_drm.h       |  64 +++++
 2 files changed, 501 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
index 63ef238ebc27..2a6965f5cde9 100644
--- a/drivers/gpu/drm/xe/xe_netlink.c
+++ b/drivers/gpu/drm/xe/xe_netlink.c
@@ -4,19 +4,451 @@
  */
 
 #include <drm/drm_managed.h>
+#include <drm/xe_drm.h>
 
 #include "xe_device.h"
 
+#define MAX_ERROR_NAME	50
+
+#define HAS_GT_ERROR_VECTORS(xe)	((xe)->info.has_gt_error_vectors)
+#define HAS_MEM_SPARING_SUPPORT(xe)	((xe)->info.has_mem_sparing)
+
 DEFINE_XARRAY(xe_xarray);
 
-static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
+static const char * const xe_hw_error_events[] = {
+		[XE_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng",
+		[XE_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc",
+		[XE_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler",
+		[XE_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm",
+		[XE_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic",
+		[XE_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf",
+		[XE_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist",
+		[XE_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double",
+		[XE_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker",
+		[XE_GT_ERROR_FATAL_GUC] = "fatal-guc",
+		[XE_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity",
+		[XE_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi",
+		[XE_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler",
+		[XE_GT_ERROR_FATAL_SLM] = "fatal-slm",
+		[XE_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic",
+		[XE_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf",
+		[XE_GT_ERROR_FATAL_FPU] = "fatal-fpu",
+		[XE_GT_ERROR_FATAL_TLB] = "fatal-tlb",
+		[XE_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric",
+		[XE_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice",
+		[XE_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank",
+		[XE_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice",
+		[XE_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank",
+		[XE_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable",
+		[XE_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal",
+		[XE_SGUNIT_ERROR_FATAL] = "sgunit-fatal",
+		[XE_SOC_ERROR_FATAL_PSF_CSC_0] = "soc-fatal-psf-csc-0",
+		[XE_SOC_ERROR_FATAL_PSF_CSC_1] = "soc-fatal-psf-csc-1",
+		[XE_SOC_ERROR_FATAL_PSF_CSC_2] = "soc-fatal-psf-csc-2",
+		[XE_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit",
+		[XE_PVC_SOC_ERROR_FATAL_PSF_0] = "soc-fatal-psf-0",
+		[XE_PVC_SOC_ERROR_FATAL_PSF_1] = "soc-fatal-psf-1",
+		[XE_PVC_SOC_ERROR_FATAL_PSF_2] = "soc-fatal-psf-2",
+		[XE_PVC_SOC_ERROR_FATAL_CD0] = "soc-fatal-cd0",
+		[XE_PVC_SOC_ERROR_FATAL_CD0_MDFI] = "soc-fatal-cd0-mdfi",
+		[XE_PVC_SOC_ERROR_FATAL_MDFI_EAST] = "soc-fatal-mdfi-east",
+		[XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH] = "soc-fatal-mdfi-south",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6",
+		[XE_PVC_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7",
+		[XE_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc",
+		[XE_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown",
+		[XE_GSC_ERROR_NONFATAL_MIA_INT] = "gsc-nonfatal-mia-int",
+		[XE_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc",
+		[XE_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout",
+		[XE_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity",
+		[XE_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity",
+		[XE_GSC_ERROR_NONFATAL_GLITCH_DET] = "gsc-nonfatal-glitch-det",
+		[XE_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull",
+		[XE_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check",
+		[XE_GSC_ERROR_NONFATAL_FUSE_SELFMBIST] = "gsc-nonfatal-selfmbist",
+		[XE_GSC_ERROR_NONFATAL_AON_PARITY] = "gsc-nonfatal-aon-parity",
+};
+
+static const unsigned long xe_hw_error_map[] = {
+	[XE_GT_ERROR_CORRECTABLE_L3_SNG] = INTEL_GT_HW_ERROR_COR_L3_SNG,
+	[XE_GT_ERROR_CORRECTABLE_GUC] = INTEL_GT_HW_ERROR_COR_GUC,
+	[XE_GT_ERROR_CORRECTABLE_SAMPLER] = INTEL_GT_HW_ERROR_COR_SAMPLER,
+	[XE_GT_ERROR_CORRECTABLE_SLM] = INTEL_GT_HW_ERROR_COR_SLM,
+	[XE_GT_ERROR_CORRECTABLE_EU_IC] = INTEL_GT_HW_ERROR_COR_EU_IC,
+	[XE_GT_ERROR_CORRECTABLE_EU_GRF] = INTEL_GT_HW_ERROR_COR_EU_GRF,
+	[XE_GT_ERROR_FATAL_ARR_BIST] = INTEL_GT_HW_ERROR_FAT_ARR_BIST,
+	[XE_GT_ERROR_FATAL_L3_DOUB] = INTEL_GT_HW_ERROR_FAT_L3_DOUB,
+	[XE_GT_ERROR_FATAL_L3_ECC_CHK] = INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK,
+	[XE_GT_ERROR_FATAL_GUC] = INTEL_GT_HW_ERROR_FAT_GUC,
+	[XE_GT_ERROR_FATAL_IDI_PAR] = INTEL_GT_HW_ERROR_FAT_IDI_PAR,
+	[XE_GT_ERROR_FATAL_SQIDI] = INTEL_GT_HW_ERROR_FAT_SQIDI,
+	[XE_GT_ERROR_FATAL_SAMPLER] = INTEL_GT_HW_ERROR_FAT_SAMPLER,
+	[XE_GT_ERROR_FATAL_SLM] = INTEL_GT_HW_ERROR_FAT_SLM,
+	[XE_GT_ERROR_FATAL_EU_IC] = INTEL_GT_HW_ERROR_FAT_EU_IC,
+	[XE_GT_ERROR_FATAL_EU_GRF] = INTEL_GT_HW_ERROR_FAT_EU_GRF,
+	[XE_GT_ERROR_FATAL_FPU] = INTEL_GT_HW_ERROR_FAT_FPU,
+	[XE_GT_ERROR_FATAL_TLB] = INTEL_GT_HW_ERROR_FAT_TLB,
+	[XE_GT_ERROR_FATAL_L3_FABRIC] = INTEL_GT_HW_ERROR_FAT_L3_FABRIC,
+	[XE_GT_ERROR_CORRECTABLE_SUBSLICE] = INTEL_GT_HW_ERROR_COR_SUBSLICE,
+	[XE_GT_ERROR_CORRECTABLE_L3BANK] = INTEL_GT_HW_ERROR_COR_L3BANK,
+	[XE_GT_ERROR_FATAL_SUBSLICE] = INTEL_GT_HW_ERROR_FAT_SUBSLICE,
+	[XE_GT_ERROR_FATAL_L3BANK] = INTEL_GT_HW_ERROR_FAT_L3BANK,
+	[XE_SGUNIT_ERROR_CORRECTABLE] = HARDWARE_ERROR_CORRECTABLE,
+	[XE_SGUNIT_ERROR_NONFATAL] = HARDWARE_ERROR_NONFATAL,
+	[XE_SGUNIT_ERROR_FATAL] = HARDWARE_ERROR_FATAL,
+	[XE_SOC_ERROR_FATAL_PSF_CSC_0] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_0),
+	[XE_SOC_ERROR_FATAL_PSF_CSC_1] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_1),
+	[XE_SOC_ERROR_FATAL_PSF_CSC_2] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_2),
+	[XE_SOC_ERROR_FATAL_PUNIT] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_PUNIT),
+	[XE_PVC_SOC_ERROR_FATAL_PSF_0] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_0),
+	[XE_PVC_SOC_ERROR_FATAL_PSF_1] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_1),
+	[XE_PVC_SOC_ERROR_FATAL_PSF_2] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_2),
+	[XE_PVC_SOC_ERROR_FATAL_CD0] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_CD0),
+	[XE_PVC_SOC_ERROR_FATAL_CD0_MDFI] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_CD0_MDFI),
+	[XE_PVC_SOC_ERROR_FATAL_MDFI_EAST] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_MDFI_EAST),
+	[XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_MDFI_SOUTH),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 0)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_0),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 1)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_1),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 2)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_2),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 3)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_3),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 4)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_4),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 5)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_5),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 6)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_6),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 7)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_7),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 8)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_0),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 9)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_1),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 10)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_2),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 11)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_3),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 12)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_4),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 13)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_5),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 14)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_6),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(0, 15)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_7),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 0)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_0),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 1)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_1),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 2)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_2),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 3)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_3),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 4)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_4),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 5)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_5),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 6)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_6),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 7)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_7),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 8)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_0),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 9)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_1),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 10)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_2),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 11)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_3),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 12)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_4),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 13)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_5),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 14)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_6),
+	[XE_PVC_SOC_ERROR_FATAL_HBM(1, 15)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_7),
+	[XE_GSC_ERROR_CORRECTABLE_SRAM_ECC] = INTEL_GSC_HW_ERROR_COR_SRAM_ECC,
+	[XE_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = INTEL_GSC_HW_ERROR_UNCOR_MIA_SHUTDOWN,
+	[XE_GSC_ERROR_NONFATAL_MIA_INT] = INTEL_GSC_HW_ERROR_UNCOR_MIA_INT,
+	[XE_GSC_ERROR_NONFATAL_SRAM_ECC] = INTEL_GSC_HW_ERROR_UNCOR_SRAM_ECC,
+	[XE_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = INTEL_GSC_HW_ERROR_UNCOR_WDG_TIMEOUT,
+	[XE_GSC_ERROR_NONFATAL_ROM_PARITY] = INTEL_GSC_HW_ERROR_UNCOR_ROM_PARITY,
+	[XE_GSC_ERROR_NONFATAL_UCODE_PARITY] = INTEL_GSC_HW_ERROR_UNCOR_UCODE_PARITY,
+	[XE_GSC_ERROR_NONFATAL_GLITCH_DET] = INTEL_GSC_HW_ERROR_UNCOR_GLITCH_DET,
+	[XE_GSC_ERROR_NONFATAL_FUSE_PULL] = INTEL_GSC_HW_ERROR_UNCOR_FUSE_PULL,
+	[XE_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = INTEL_GSC_HW_ERROR_UNCOR_FUSE_CRC_CHECK,
+	[XE_GSC_ERROR_NONFATAL_FUSE_SELFMBIST] = INTEL_GSC_HW_ERROR_UNCOR_SELFMBIST,
+	[XE_GSC_ERROR_NONFATAL_AON_PARITY] = INTEL_GSC_HW_ERROR_UNCOR_AON_PARITY,
+};
+
+static unsigned int config_gt_id(const u64 config)
+{
+	return config >> __XE_GT_SHIFT;
+}
+
+static u64 config_counter(const u64 config)
+{
+	return config & ~(~0ULL << __XE_GT_SHIFT);
+}
+
+static bool is_gt_vector_error(const u64 config)
 {
+	unsigned int error;
+
+	error = config_counter(config);
+	if (error >= XE_GT_ERROR_FATAL_TLB &&
+	    error <= XE_GT_ERROR_FATAL_L3BANK)
+		return true;
+
+	return false;
+}
+
+static bool is_pvc_invalid_gt_errors(const u64 config)
+{
+	switch (config_counter(config)) {
+	case XE_GT_ERROR_CORRECTABLE_L3_SNG:
+	case XE_GT_ERROR_CORRECTABLE_SAMPLER:
+	case XE_GT_ERROR_FATAL_ARR_BIST:
+	case XE_GT_ERROR_FATAL_L3_DOUB:
+	case XE_GT_ERROR_FATAL_L3_ECC_CHK:
+	case XE_GT_ERROR_FATAL_IDI_PAR:
+	case XE_GT_ERROR_FATAL_SQIDI:
+	case XE_GT_ERROR_FATAL_SAMPLER:
+	case XE_GT_ERROR_FATAL_EU_IC:
+		return true;
+	default:
+		return false;
+	}
+}
+
+static bool is_gsc_hw_error(const u64 config)
+{
+	if (config_counter(config) >= XE_GSC_ERROR_CORRECTABLE_SRAM_ECC &&
+	    config_counter(config) <= XE_GSC_ERROR_NONFATAL_AON_PARITY)
+		return true;
+
+	return false;
+}
+
+static bool is_soc_error(const u64 config)
+{
+	if (config_counter(config) >= XE_SOC_ERROR_FATAL_PSF_CSC_0 &&
+	    config_counter(config) <= XE_PVC_SOC_ERROR_FATAL_HBM(1, 15))
+		return true;
+
+	return false;
+}
+
+static int
+config_status(struct xe_device *xe, u64 config)
+{
+	unsigned int gt_id = config_gt_id(config);
+
+	if (!IS_DGFX(xe))
+		return -ENODEV;
+
+	if (xe->gt[gt_id].info.type == XE_GT_TYPE_UNINITIALIZED)
+		return -ENOENT;
+
+	/* GSC HW ERRORS are present on root tile of
+	 * platform supporting MEMORY SPARING only
+	 */
+	if (is_gsc_hw_error(config) && !(HAS_MEM_SPARING_SUPPORT(xe) && gt_id == 0))
+		return -ENODEV;
+
+	/* GT vectors error  are valid on Platforms supporting error vectors only */
+	if (is_gt_vector_error(config) && !HAS_GT_ERROR_VECTORS(xe))
+		return -ENODEV;
+
+	/* Skip gt errors not supported on pvc */
+	if (is_pvc_invalid_gt_errors(config) && (xe->info.platform == XE_PVC))
+		return  -ENODEV;
+
+	/* FATAL FPU error is valid on PVC only */
+	if (config_counter(config) == XE_GT_ERROR_FATAL_FPU &&
+	    !(xe->info.platform == XE_PVC))
+		return -ENODEV;
+
+	if (is_soc_error(config) && !(xe->info.platform == XE_PVC))
+		return -ENODEV;
+
+	return (config_counter(config) >=
+			ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0;
+}
+
+static u64 get_counter_value(struct xe_device *xe, u64 config)
+{
+	const unsigned int gt_id = config_gt_id(config);
+	unsigned int id = config_counter(config);
+
+	if (is_soc_error(config))
+		return xa_to_value(xa_load(&xe->gt[gt_id].errors.soc, xe_hw_error_map[id]));
+	else if (is_gsc_hw_error(config))
+		return xe->gt[gt_id].errors.gsc_hw[xe_hw_error_map[id]];
+	else if (id >= XE_SGUNIT_ERROR_CORRECTABLE &&
+		 id <= XE_SGUNIT_ERROR_FATAL)
+		return xe->gt[gt_id].errors.sgunit[xe_hw_error_map[id]];
+	else
+		return xe->gt[gt_id].errors.hw[xe_hw_error_map[id]];
+
 	return 0;
 }
 
-static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info)
+static struct xe_device *genl_to_xe(struct genl_info *info)
+{
+	return xa_load(&xe_xarray, info->nlhdr->nlmsg_type);
+}
+
+static int xe_genl_send(struct sk_buff *msg, struct genl_info *info, void *usrhdr)
 {
+	int ret;
+
+	genlmsg_end(msg, usrhdr);
+
+	ret = genlmsg_reply(msg, info);
+	if (ret)
+		nlmsg_free(msg);
+
+	return ret;
+}
+
+static struct sk_buff *
+xe_genl_alloc_msg(struct xe_device *xe,
+		  struct genl_info *info,
+		  size_t msg_size, void **usrhdr)
+{
+	struct sk_buff *new_msg;
+
+	new_msg = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!new_msg)
+		return new_msg;
+
+	*usrhdr = genlmsg_put_reply(new_msg, info, &xe->xe_genl_family, 0, info->genlhdr->cmd);
+	if (!*usrhdr) {
+		nlmsg_free(new_msg);
+		new_msg = NULL;
+	}
+
+	return new_msg;
+}
+
+int fill_error_details(struct genl_info *info, struct sk_buff *new_msg)
+{
+	struct xe_device *xe = genl_to_xe(info);
+	struct nlattr *entry_attr;
+	struct xe_gt *gt;
+	int i, j;
+	bool counter = false;
+
+	if (info->genlhdr->cmd == DRM_CMD_READ_ALL)
+		counter = true;
+
+	entry_attr = nla_nest_start(new_msg, DRM_ATTR_QUERY_REPLY);
+	if (!entry_attr)
+		return -EMSGSIZE;
+
+	for_each_gt(gt, xe, j) {
+		char str[MAX_ERROR_NAME];
+		u64 val;
+
+		for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) {
+			u64 config = XE_HW_ERROR(j, i);
+
+			if (config_status(xe, config))
+				continue;
+
+			/* should this be cleared everytime */
+			snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]);
+
+			if (nla_put_string(new_msg, DRM_ATTR_ERROR_NAME, str))
+				goto err;
+			if (nla_put_u64_64bit(new_msg, DRM_ATTR_ERROR_ID, config, DRM_ATTR_PAD))
+				goto err;
+			if (counter) {
+				val = get_counter_value(xe, config);
+				if (nla_put_u64_64bit(new_msg, DRM_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD))
+					goto err;
+			}
+		}
+	}
+
+	nla_nest_end(new_msg, entry_attr);
+
 	return 0;
+err:
+	drm_dbg_driver(&xe->drm, "msg buff is small\n");
+	nla_nest_cancel(new_msg, entry_attr);
+	nlmsg_free(new_msg);
+
+	return -EMSGSIZE;
+}
+
+static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
+{
+	struct xe_device *xe = genl_to_xe(info);
+	size_t msg_size = NLMSG_DEFAULT_SIZE;
+	struct sk_buff *new_msg;
+	void *usrhdr;
+	int ret = 0;
+	int retries = 2;
+
+	if (GENL_REQ_ATTR_CHECK(info, DRM_ATTR_REQUEST))
+		return -EINVAL;
+
+	do {
+		new_msg = xe_genl_alloc_msg(xe, info, msg_size, &usrhdr);
+		if (!new_msg)
+			return -ENOMEM;
+
+		ret = fill_error_details(info, new_msg);
+		if (!ret)
+			break;
+
+		msg_size += NLMSG_DEFAULT_SIZE;
+	} while (retries--);
+
+	if (!ret)
+		ret = xe_genl_send(new_msg, info, usrhdr);
+
+	return ret;
+}
+
+static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info)
+{
+	struct xe_device *xe = genl_to_xe(info);
+	size_t msg_size = NLMSG_DEFAULT_SIZE;
+	struct sk_buff *new_msg;
+	void *usrhdr;
+	int ret = 0;
+	int retries = 2;
+	u64 config, val;
+
+	if (GENL_REQ_ATTR_CHECK(info, DRM_ATTR_ERROR_ID))
+		return -EINVAL;
+
+	config = nla_get_u64(info->attrs[DRM_ATTR_ERROR_ID]);
+	ret = config_status(xe, config);
+	if (ret)
+		return ret;
+	do {
+		new_msg = xe_genl_alloc_msg(xe, info, msg_size, &usrhdr);
+		if (!new_msg)
+			return -ENOMEM;
+
+		val = get_counter_value(xe, config);
+		if (nla_put_u64_64bit(new_msg, DRM_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) {
+			msg_size += NLMSG_DEFAULT_SIZE;
+			continue;
+		}
+
+		break;
+	} while (retries--);
+
+	ret = xe_genl_send(new_msg, info, usrhdr);
+
+	return ret;
 }
 
 /* operations definition */
@@ -65,6 +497,9 @@ int xe_genl_register(struct xe_device *xe)
 {
 	int ret;
 
+	BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) !=
+		     ARRAY_SIZE(xe_hw_error_map));
+
 	xe_genl_family_init(xe);
 
 	ret = genl_register_family(&xe->xe_genl_family);
diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
index b0b80aae3ee8..a2ea238096df 100644
--- a/include/uapi/drm/xe_drm.h
+++ b/include/uapi/drm/xe_drm.h
@@ -801,6 +801,70 @@ struct drm_xe_vm_madvise {
 	__u64 reserved[2];
 };
 
+/*
+ * HW error IDs
+ */
+
+#define __XE_GT_SHIFT	(60)
+
+#define XE_HW_ERROR(gt, id) \
+	((id) | ((__u64)(gt) << __XE_GT_SHIFT))
+
+#define XE_GT_ERROR_CORRECTABLE_L3_SNG		(0)
+#define XE_GT_ERROR_CORRECTABLE_GUC		(1)
+#define XE_GT_ERROR_CORRECTABLE_SAMPLER		(2)
+#define XE_GT_ERROR_CORRECTABLE_SLM		(3)
+#define XE_GT_ERROR_CORRECTABLE_EU_IC		(4)
+#define XE_GT_ERROR_CORRECTABLE_EU_GRF		(5)
+#define XE_GT_ERROR_FATAL_ARR_BIST		(6)
+#define XE_GT_ERROR_FATAL_L3_DOUB		(7)
+#define XE_GT_ERROR_FATAL_L3_ECC_CHK		(8)
+#define XE_GT_ERROR_FATAL_GUC			(9)
+#define XE_GT_ERROR_FATAL_IDI_PAR		(10)
+#define XE_GT_ERROR_FATAL_SQIDI			(11)
+#define XE_GT_ERROR_FATAL_SAMPLER		(12)
+#define XE_GT_ERROR_FATAL_SLM			(13)
+#define XE_GT_ERROR_FATAL_EU_IC			(14)
+#define XE_GT_ERROR_FATAL_EU_GRF		(15)
+#define XE_GT_ERROR_FATAL_FPU			(16)
+#define XE_GT_ERROR_FATAL_TLB			(17)
+#define XE_GT_ERROR_FATAL_L3_FABRIC		(18)
+#define XE_GT_ERROR_CORRECTABLE_SUBSLICE	(19)
+#define XE_GT_ERROR_CORRECTABLE_L3BANK		(20)
+#define XE_GT_ERROR_FATAL_SUBSLICE		(21)
+#define XE_GT_ERROR_FATAL_L3BANK		(22)
+#define XE_SGUNIT_ERROR_CORRECTABLE		(23)
+#define XE_SGUNIT_ERROR_NONFATAL		(24)
+#define XE_SGUNIT_ERROR_FATAL			(25)
+#define XE_SOC_ERROR_FATAL_PSF_CSC_0		(26)
+#define XE_SOC_ERROR_FATAL_PSF_CSC_1		(27)
+#define XE_SOC_ERROR_FATAL_PSF_CSC_2		(28)
+#define XE_SOC_ERROR_FATAL_PUNIT		(29)
+#define XE_PVC_SOC_ERROR_FATAL_PSF_0		(30)
+#define XE_PVC_SOC_ERROR_FATAL_PSF_1		(31)
+#define XE_PVC_SOC_ERROR_FATAL_PSF_2		(32)
+#define XE_PVC_SOC_ERROR_FATAL_CD0		(33)
+#define XE_PVC_SOC_ERROR_FATAL_CD0_MDFI		(34)
+#define XE_PVC_SOC_ERROR_FATAL_MDFI_EAST	(35)
+#define XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH	(36)
+
+#define XE_PVC_SOC_ERROR_FATAL_HBM(ss, n)\
+		(XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH + 0x1 + (ss) * 0x10 + (n))
+
+/* 68 is the last ID used by SOC errors */
+#define XE_GSC_ERROR_CORRECTABLE_SRAM_ECC	(69)
+#define XE_GSC_ERROR_NONFATAL_MIA_SHUTDOWN	(70)
+#define XE_GSC_ERROR_NONFATAL_MIA_INT		(71)
+#define XE_GSC_ERROR_NONFATAL_SRAM_ECC		(72)
+#define XE_GSC_ERROR_NONFATAL_WDG_TIMEOUT	(73)
+#define XE_GSC_ERROR_NONFATAL_ROM_PARITY	(74)
+#define XE_GSC_ERROR_NONFATAL_UCODE_PARITY	(75)
+#define XE_GSC_ERROR_NONFATAL_GLITCH_DET	(76)
+#define XE_GSC_ERROR_NONFATAL_FUSE_PULL		(77)
+#define XE_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK	(78)
+#define XE_GSC_ERROR_NONFATAL_FUSE_SELFMBIST	(79)
+#define XE_GSC_ERROR_NONFATAL_AON_PARITY	(80)
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 4/5] drm/netlink: define multicast groups
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (2 preceding siblings ...)
  2023-05-26 16:20 ` [RFC 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
@ 2023-05-26 16:20 ` Aravind Iddamsetty
  2023-05-26 16:20 ` [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw)
  To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay

Netlink subsystem supports event notifications to userspace. we define
two multicast groups for correctable and uncorrectable errors to which
userspace can subscribe and be notified when any of those errors happen.
The group names are local to the driver's genl netlink family.

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
---
 include/uapi/drm/drm_netlink.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
index 28e7a334d0c7..bd3a8b293979 100644
--- a/include/uapi/drm/drm_netlink.h
+++ b/include/uapi/drm/drm_netlink.h
@@ -29,6 +29,8 @@
 #include <net/sock.h>
 
 #define DRM_GENL_VERSION 1
+#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR	"drm_corr_err"
+#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR	"drm_uncorr_err"
 
 enum error_cmds {
 	DRM_CMD_UNSPEC,
@@ -38,6 +40,7 @@ enum error_cmds {
 	DRM_CMD_READ_ONE,
 	/* command to get counters of all errors */
 	DRM_CMD_READ_ALL,
+	DRM_CMD_ERROR_EVENT,
 
 	__DRM_CMD_MAX,
 	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
@@ -65,4 +68,14 @@ static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
 	[DRM_ATTR_ERROR_ID] = { .type = NLA_U64 },
 };
 
+enum mcgrps_events {
+	DRM_GENL_MCAST_CORR_ERR,
+	DRM_GENL_MCAST_UNCORR_ERR,
+};
+
+static const struct genl_multicast_group drm_event_mcgrps[] = {
+	[DRM_GENL_MCAST_CORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, },
+	[DRM_GENL_MCAST_UNCORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, },
+};
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (3 preceding siblings ...)
  2023-05-26 16:20 ` [RFC 4/5] drm/netlink: define multicast groups Aravind Iddamsetty
@ 2023-05-26 16:20 ` Aravind Iddamsetty
  2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw)
  To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay

Whenever a correctable or an uncorrectable error happens an event is sent
to the corresponding listeners of these groups.

Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
---
 drivers/gpu/drm/xe/xe_irq.c     | 32 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_netlink.c |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c
index 226be96e341a..1b415c8585a4 100644
--- a/drivers/gpu/drm/xe/xe_irq.c
+++ b/drivers/gpu/drm/xe/xe_irq.c
@@ -1073,6 +1073,37 @@ xe_gsc_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err)
 	xe_mmio_write32(gt, GSC_HEC_CORR_UNCORR_ERR_STATUS(base, hw_err).reg, err_status);
 }
 
+static void generate_netlink_event(struct xe_gt *gt, const enum hardware_error hw_err)
+{
+	struct xe_device *xe = gt->xe;
+	struct sk_buff *msg;
+	void *hdr;
+
+	if (!xe->xe_genl_family.module)
+		return;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC);
+	if (!msg) {
+		drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n");
+		return;
+	}
+
+	hdr = genlmsg_put(msg, 0, 0, &xe->xe_genl_family, 0, DRM_CMD_ERROR_EVENT);
+	if (!hdr) {
+		drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n");
+		nlmsg_free(msg);
+		return;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	genlmsg_multicast(&xe->xe_genl_family, msg, 0,
+			  hw_err ?
+			  DRM_GENL_MCAST_UNCORR_ERR
+			  : DRM_GENL_MCAST_CORR_ERR,
+			  GFP_ATOMIC);
+}
+
 static void
 xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
 {
@@ -1103,6 +1134,7 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err)
 
 	xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc);
 
+	generate_netlink_event(gt, hw_err);
 out_unlock:
 	spin_unlock_irqrestore(&gt_to_xe(gt)->irq.lock, flags);
 }
diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
index 2a6965f5cde9..0c1d51e1a9a5 100644
--- a/drivers/gpu/drm/xe/xe_netlink.c
+++ b/drivers/gpu/drm/xe/xe_netlink.c
@@ -490,6 +490,8 @@ static void xe_genl_family_init(struct xe_device *xe)
 	xe->xe_genl_family.ops = xe_genl_ops;
 	xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops);
 	xe->xe_genl_family.maxattr = DRM_ATTR_MAX;
+	xe->xe_genl_family.mcgrps = drm_event_mcgrps;
+	xe->xe_genl_family.n_mcgrps = ARRAY_SIZE(drm_event_mcgrps);
 	xe->xe_genl_family.module = THIS_MODULE;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (4 preceding siblings ...)
  2023-05-26 16:20 ` [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
@ 2023-06-04 17:07 ` Tomer Tayar
  2023-06-05 17:17   ` Iddamsetty, Aravind
  2023-06-05 16:47 ` Alex Deucher
  2023-06-21 17:24 ` Sebastian Wick
  7 siblings, 1 reply; 18+ messages in thread
From: Tomer Tayar @ 2023-06-04 17:07 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay

On 26/05/2023 19:20, Aravind Iddamsetty wrote:
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> exposing a set of error counters which can be used by observability
> tools to take corrective actions or repairs. Traditionally there were
> being exposed via PMU (for relative counters) and sysfs interface (for
> absolute value) in our internal branch. But, due to the limitations in
> this approach to use two interfaces and also not able to have an event
> based reporting or configurability, an alternative approach to try
> netlink was suggested by community for drm subsystem wide UAPI for RAS
> and telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.

Hi Aravind,

The habanalabs driver is another candidate to use this netlink-based drm 
framework.
As a single-user device, we have an additional "control" device that 
allows multiple applications to query for information and to monitor the 
"compute" device.
And while we are about to move the compute device to the accel nodes, we 
don't have a real replacement there for the control device.

Another possible usage of this framework for habanalabs is the events 
notification.
Currently we have an eventfd-based mechanism, and after being notified 
about an event, user starts querying about the event and the relevant 
info, usually in several requests.
With this framework we should be allegedly possible to gather all 
relevant info together with the event itself.

The current implementation seems intended more to errors (and quite 
"tailored" to Xe needs ...), while in habanalabs we would need it also 
for non-error static/dynamic info.
Maybe we should revise the existing commands/attributes to be more generic?

Moreover, the drm part is very small, while most of the netlink "mess" 
is still done by the specific driver.
So what is the added value in making it a "drm framework"? Do we enforce 
something here for drm drivers that use it? Do we help them with simpler 
APIs and hiding the internals of netlink?
Maybe it would be worth moving some functionality from the Xe driver 
into drm helpers?

Thanks,
Tomer

> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/116181/
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name                                                    config-id               counter
>
> error-gt0-correctable-guc                               0x0000000000000001      0
> error-gt0-correctable-slm                               0x0000000000000003      0
> error-gt0-correctable-eu-ic                             0x0000000000000004      0
> error-gt0-correctable-eu-grf                            0x0000000000000005      0
> error-gt0-fatal-guc                                     0x0000000000000009      0
> error-gt0-fatal-slm                                     0x000000000000000d      0
> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> error-gt0-fatal-fpu                                     0x0000000000000010      0
> error-gt0-fatal-tlb                                     0x0000000000000011      0
> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> error-gt0-correctable-subslice                          0x0000000000000013      0
> error-gt0-correctable-l3bank                            0x0000000000000014      0
> error-gt0-fatal-subslice                                0x0000000000000015      0
> error-gt0-fatal-l3bank                                  0x0000000000000016      0
> error-gt0-sgunit-correctable                            0x0000000000000017      0
> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> error-gt0-sgunit-fatal                                  0x0000000000000019      0
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> error-gt0-soc-fatal-punit                               0x000000000000001d      0
> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> error-gt1-correctable-guc                               0x1000000000000001      0
> error-gt1-correctable-slm                               0x1000000000000003      0
> error-gt1-correctable-eu-ic                             0x1000000000000004      0
> error-gt1-correctable-eu-grf                            0x1000000000000005      0
> error-gt1-fatal-guc                                     0x1000000000000009      0
> error-gt1-fatal-slm                                     0x100000000000000d      0
> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> error-gt1-fatal-fpu                                     0x1000000000000010      0
> error-gt1-fatal-tlb                                     0x1000000000000011      0
> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> error-gt1-correctable-subslice                          0x1000000000000013      0
> error-gt1-correctable-l3bank                            0x1000000000000014      0
> error-gt1-fatal-subslice                                0x1000000000000015      0
> error-gt1-fatal-l3bank                                  0x1000000000000016      0
> error-gt1-sgunit-correctable                            0x1000000000000017      0
> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> error-gt1-sgunit-fatal                                  0x1000000000000019      0
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> error-gt1-soc-fatal-punit                               0x100000000000001d      0
> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>
> wait on a error event:
>
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> waiting for error event
> error event received
> counter value 0
>
> list all errors:
>
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name                                                    config-id
>
> error-gt0-correctable-guc                               0x0000000000000001
> error-gt0-correctable-slm                               0x0000000000000003
> error-gt0-correctable-eu-ic                             0x0000000000000004
> error-gt0-correctable-eu-grf                            0x0000000000000005
> error-gt0-fatal-guc                                     0x0000000000000009
> error-gt0-fatal-slm                                     0x000000000000000d
> error-gt0-fatal-eu-grf                                  0x000000000000000f
> error-gt0-fatal-fpu                                     0x0000000000000010
> error-gt0-fatal-tlb                                     0x0000000000000011
> error-gt0-fatal-l3-fabric                               0x0000000000000012
> error-gt0-correctable-subslice                          0x0000000000000013
> error-gt0-correctable-l3bank                            0x0000000000000014
> error-gt0-fatal-subslice                                0x0000000000000015
> error-gt0-fatal-l3bank                                  0x0000000000000016
> error-gt0-sgunit-correctable                            0x0000000000000017
> error-gt0-sgunit-nonfatal                               0x0000000000000018
> error-gt0-sgunit-fatal                                  0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> error-gt0-soc-fatal-punit                               0x000000000000001d
> error-gt0-soc-fatal-psf-0                               0x000000000000001e
> error-gt0-soc-fatal-psf-1                               0x000000000000001f
> error-gt0-soc-fatal-psf-2                               0x0000000000000020
> error-gt0-soc-fatal-cd0                                 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> error-gt1-correctable-guc                               0x1000000000000001
> error-gt1-correctable-slm                               0x1000000000000003
> error-gt1-correctable-eu-ic                             0x1000000000000004
> error-gt1-correctable-eu-grf                            0x1000000000000005
> error-gt1-fatal-guc                                     0x1000000000000009
> error-gt1-fatal-slm                                     0x100000000000000d
> error-gt1-fatal-eu-grf                                  0x100000000000000f
> error-gt1-fatal-fpu                                     0x1000000000000010
> error-gt1-fatal-tlb                                     0x1000000000000011
> error-gt1-fatal-l3-fabric                               0x1000000000000012
> error-gt1-correctable-subslice                          0x1000000000000013
> error-gt1-correctable-l3bank                            0x1000000000000014
> error-gt1-fatal-subslice                                0x1000000000000015
> error-gt1-fatal-l3bank                                  0x1000000000000016
> error-gt1-sgunit-correctable                            0x1000000000000017
> error-gt1-sgunit-nonfatal                               0x1000000000000018
> error-gt1-sgunit-fatal                                  0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> error-gt1-soc-fatal-punit                               0x100000000000001d
> error-gt1-soc-fatal-psf-0                               0x100000000000001e
> error-gt1-soc-fatal-psf-1                               0x100000000000001f
> error-gt1-soc-fatal-psf-2                               0x1000000000000020
> error-gt1-soc-fatal-cd0                                 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Oded Gabbay <ogabbay@kernel.org>
>
>
> Aravind Iddamsetty (5):
>    drm/netlink: Add netlink infrastructure
>    drm/xe/RAS: Register a genl netlink family
>    drm/xe/RAS: Expose the error counters
>    drm/netlink: define multicast groups
>    drm/xe/RAS: send multicast event on occurrence of an error
>
>   drivers/gpu/drm/xe/Makefile          |   1 +
>   drivers/gpu/drm/xe/xe_device.c       |   3 +
>   drivers/gpu/drm/xe/xe_device_types.h |   2 +
>   drivers/gpu/drm/xe/xe_irq.c          |  32 ++
>   drivers/gpu/drm/xe/xe_module.c       |   2 +
>   drivers/gpu/drm/xe/xe_netlink.c      | 526 +++++++++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_netlink.h      |  14 +
>   include/uapi/drm/drm_netlink.h       |  81 +++++
>   include/uapi/drm/xe_drm.h            |  64 ++++
>   9 files changed, 725 insertions(+)
>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
>   create mode 100644 include/uapi/drm/drm_netlink.h
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure
  2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
@ 2023-06-04 17:07   ` Tomer Tayar
  2023-06-05 17:18     ` Iddamsetty, Aravind
  0 siblings, 1 reply; 18+ messages in thread
From: Tomer Tayar @ 2023-06-04 17:07 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay

On 26/05/2023 19:20, Aravind Iddamsetty wrote:
> Define the netlink commands and attributes that can be commonly used
> across by drm drivers.
>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> ---
>   include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++
>   1 file changed, 68 insertions(+)
>   create mode 100644 include/uapi/drm/drm_netlink.h
>
> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
> new file mode 100644
> index 000000000000..28e7a334d0c7
> --- /dev/null
> +++ b/include/uapi/drm/drm_netlink.h
> @@ -0,0 +1,68 @@
> +/*
> + * Copyright 2023 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> + * OTHER DEALINGS IN THE SOFTWARE.
> + */
> +
> +#ifndef _DRM_NETLINK_H_
> +#define _DRM_NETLINK_H_
> +
> +#include <linux/netdevice.h>
> +#include <net/genetlink.h>
> +#include <net/sock.h>

This is a uapi header.
Are all header files here available for user?
Also, should we add here "#if defined(__cplusplus) extern "C" { ..."?

> +
> +#define DRM_GENL_VERSION 1
> +
> +enum error_cmds {
> +	DRM_CMD_UNSPEC,
> +	/* command to list all errors names with config-id */
> +	DRM_CMD_QUERY,
> +	/* command to get a counter for a specific error */
> +	DRM_CMD_READ_ONE,
> +	/* command to get counters of all errors */
> +	DRM_CMD_READ_ALL,
> +
> +	__DRM_CMD_MAX,
> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
> +};
> +
> +enum error_attr {
> +	DRM_ATTR_UNSPEC,
> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
> +	DRM_ATTR_REQUEST, /* NLA_U8 */
> +	DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/

Missing spaces in /*NLA_NESTED*/

> +	DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
> +	DRM_ATTR_ERROR_ID, /* NLA_U64 */
> +	DRM_ATTR_ERROR_VALUE, /* NLA_U64 */
> +
> +	__DRM_ATTR_MAX,
> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
> +};
> +
> +/* attribute policies */
> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
> +	[DRM_ATTR_REQUEST] = { .type = NLA_U8 },
> +};

Should these policies structures be in uapi?

> +
> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
> +	[DRM_ATTR_ERROR_ID] = { .type = NLA_U64 },
> +};

I might miss something here, but why it is not a single policy structure 
with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID?

Thanks,
Tomer

> +
> +#endif



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 2/5] drm/xe/RAS: Register a genl netlink family
  2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty
@ 2023-06-04 17:09   ` Tomer Tayar
  2023-06-05 17:21     ` Iddamsetty, Aravind
  0 siblings, 1 reply; 18+ messages in thread
From: Tomer Tayar @ 2023-06-04 17:09 UTC (permalink / raw)
  To: Aravind Iddamsetty, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay

On 26/05/2023 19:20, Aravind Iddamsetty wrote:
> Use the generic netlink(genl) subsystem to expose the RAS counters to
> userspace. We define a family structure and operations and register to
> genl subsystem and these callbacks will be invoked by genl subsystem when
> userspace sends a registered command with a family identifier to genl
> subsystem.
>
> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> ---
>   drivers/gpu/drm/xe/Makefile          |  1 +
>   drivers/gpu/drm/xe/xe_device.c       |  3 +
>   drivers/gpu/drm/xe/xe_device_types.h |  2 +
>   drivers/gpu/drm/xe/xe_module.c       |  2 +
>   drivers/gpu/drm/xe/xe_netlink.c      | 89 ++++++++++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_netlink.h      | 14 +++++
>   6 files changed, 111 insertions(+)
>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
>
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index b84e191ba14f..2b42165bc824 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -67,6 +67,7 @@ xe-y += xe_bb.o \
>   	xe_mmio.o \
>   	xe_mocs.o \
>   	xe_module.o \
> +	xe_netlink.o \
>   	xe_pat.o \
>   	xe_pci.o \
>   	xe_pcode.o \
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 323356a44e7f..aa12ef12d9dc 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -24,6 +24,7 @@
>   #include "xe_irq.h"
>   #include "xe_mmio.h"
>   #include "xe_module.h"
> +#include "xe_netlink.h"
>   #include "xe_pcode.h"
>   #include "xe_pm.h"
>   #include "xe_query.h"
> @@ -317,6 +318,8 @@ int xe_device_probe(struct xe_device *xe)
>   
>   	xe_display_register(xe);
>   
> +	xe_genl_register(xe);

xe_genl_register() can fail

> +
>   	xe_debugfs_register(xe);
>   
>   	err = drmm_add_action_or_reset(&xe->drm, xe_device_sanitize, xe);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 682ebdd1c09e..c9612a54c48f 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -10,6 +10,7 @@
>   
>   #include <drm/drm_device.h>
>   #include <drm/drm_file.h>
> +#include <drm/drm_netlink.h>
>   #include <drm/ttm/ttm_device.h>
>   
>   #include "xe_gt_types.h"
> @@ -347,6 +348,7 @@ struct xe_device {
>   		u32 lvds_channel_mode;
>   	} params;
>   #endif
> +	struct genl_family xe_genl_family;

Should it be added above, before the "private" section?
Maybe add a kernel-doc comment for it?

>   };
>   
>   /**
> diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
> index 6860586ce7f8..1eb73eb9a9a5 100644
> --- a/drivers/gpu/drm/xe/xe_module.c
> +++ b/drivers/gpu/drm/xe/xe_module.c
> @@ -11,6 +11,7 @@
>   #include "xe_drv.h"
>   #include "xe_hw_fence.h"
>   #include "xe_module.h"
> +#include "xe_netlink.h"
>   #include "xe_pci.h"
>   #include "xe_sched_job.h"
>   
> @@ -67,6 +68,7 @@ static void __exit xe_exit(void)
>   {
>   	int i;
>   
> +	xe_genl_cleanup();
>   	xe_unregister_pci_driver();
>   
>   	for (i = ARRAY_SIZE(init_funcs) - 1; i >= 0; i--)
> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
> new file mode 100644
> index 000000000000..63ef238ebc27
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_netlink.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2023 Intel Corporation
> + */
> +
> +#include <drm/drm_managed.h>
> +
> +#include "xe_device.h"
> +
> +DEFINE_XARRAY(xe_xarray);

xe_array sounds too generic. Maybe it should be more specific like 
xe_genl_xarray?
In addition, it should be probably static.

Thanks,
Tomer

> +
> +static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
> +{
> +	return 0;
> +}
> +
> +static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info)
> +{
> +	return 0;
> +}
> +
> +/* operations definition */
> +static const struct genl_ops xe_genl_ops[] = {
> +	{
> +		.cmd = DRM_CMD_QUERY,
> +		.doit = xe_genl_list_errors,
> +		.policy = drm_attr_policy_query,
> +	},
> +	{
> +		.cmd = DRM_CMD_READ_ONE,
> +		.doit = xe_genl_read_error,
> +		.policy = drm_attr_policy_read_one,
> +	},
> +	{
> +		.cmd = DRM_CMD_READ_ALL,
> +		.doit = xe_genl_list_errors,
> +		.policy = drm_attr_policy_query,
> +	},
> +};
> +
> +static void xe_genl_deregister(struct drm_device *dev,  void *arg)
> +{
> +	struct xe_device *xe = arg;
> +
> +	xa_erase(&xe_xarray, xe->xe_genl_family.id);
> +
> +	drm_dbg_driver(&xe->drm, "unregistering genl family %s\n", xe->xe_genl_family.name);
> +
> +	genl_unregister_family(&xe->xe_genl_family);
> +}
> +
> +static void xe_genl_family_init(struct xe_device *xe)
> +{
> +	/* Use drm primary node name eg: card0 to name the genl family */
> +	snprintf(xe->xe_genl_family.name, sizeof(xe->xe_genl_family.name), "%s", xe->drm.primary->kdev->kobj.name);
> +	xe->xe_genl_family.version = DRM_GENL_VERSION;
> +	xe->xe_genl_family.parallel_ops = true;
> +	xe->xe_genl_family.ops = xe_genl_ops;
> +	xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops);
> +	xe->xe_genl_family.maxattr = DRM_ATTR_MAX;
> +	xe->xe_genl_family.module = THIS_MODULE;
> +}
> +
> +int xe_genl_register(struct xe_device *xe)
> +{
> +	int ret;
> +
> +	xe_genl_family_init(xe);
> +
> +	ret = genl_register_family(&xe->xe_genl_family);
> +	if (ret < 0) {
> +		drm_warn(&xe->drm, "xe genl family registration failed\n");
> +		return ret;
> +	}
> +
> +	drm_dbg_driver(&xe->drm, "genl family id %d and name %s\n", xe->xe_genl_family.id, xe->xe_genl_family.name);
> +
> +	xa_store(&xe_xarray, xe->xe_genl_family.id, xe, GFP_KERNEL);
> +
> +	ret = drmm_add_action_or_reset(&xe->drm, xe_genl_deregister, xe);
> +
> +	return ret;
> +}
> +
> +void xe_genl_cleanup(void)
> +{
> +	/* destroy xarray */
> +	xa_destroy(&xe_xarray);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_netlink.h b/drivers/gpu/drm/xe/xe_netlink.h
> new file mode 100644
> index 000000000000..3bbddb620539
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_netlink.h
> @@ -0,0 +1,14 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2021 Intel Corporation
> + */
> +
> +#ifndef _XE_GENL_H_
> +#define _XE_GENL_H_
> +
> +#include "xe_device.h"
> +
> +int xe_genl_register(struct xe_device *xe);
> +void xe_genl_cleanup(void);
> +
> +#endif



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (5 preceding siblings ...)
  2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar
@ 2023-06-05 16:47 ` Alex Deucher
  2023-06-06 11:56   ` Iddamsetty, Aravind
  2023-06-21 17:24 ` Sebastian Wick
  7 siblings, 1 reply; 18+ messages in thread
From: Alex Deucher @ 2023-06-05 16:47 UTC (permalink / raw)
  To: Aravind Iddamsetty, Hawking Zhang, Harish Kasiviswanathan,
	Kuehling, Felix, Tuikov, Luben
  Cc: alexander.deucher, ogabbay, intel-xe, dri-devel

Adding the relevant AMD folks for RAS.  We currently expose RAS via
sysfs, but also have an event interface in KFD which may be somewhat
similar to this.

If we were to converge on a common RAS interface, would we want to
look at any commonality in bad page storage/reporting for device
memory?

Alex

On Fri, May 26, 2023 at 12:21 PM Aravind Iddamsetty
<aravind.iddamsetty@intel.com> wrote:
>
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> exposing a set of error counters which can be used by observability
> tools to take corrective actions or repairs. Traditionally there were
> being exposed via PMU (for relative counters) and sysfs interface (for
> absolute value) in our internal branch. But, due to the limitations in
> this approach to use two interfaces and also not able to have an event
> based reporting or configurability, an alternative approach to try
> netlink was suggested by community for drm subsystem wide UAPI for RAS
> and telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.
>
> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/116181/
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name                                                    config-id               counter
>
> error-gt0-correctable-guc                               0x0000000000000001      0
> error-gt0-correctable-slm                               0x0000000000000003      0
> error-gt0-correctable-eu-ic                             0x0000000000000004      0
> error-gt0-correctable-eu-grf                            0x0000000000000005      0
> error-gt0-fatal-guc                                     0x0000000000000009      0
> error-gt0-fatal-slm                                     0x000000000000000d      0
> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> error-gt0-fatal-fpu                                     0x0000000000000010      0
> error-gt0-fatal-tlb                                     0x0000000000000011      0
> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> error-gt0-correctable-subslice                          0x0000000000000013      0
> error-gt0-correctable-l3bank                            0x0000000000000014      0
> error-gt0-fatal-subslice                                0x0000000000000015      0
> error-gt0-fatal-l3bank                                  0x0000000000000016      0
> error-gt0-sgunit-correctable                            0x0000000000000017      0
> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> error-gt0-sgunit-fatal                                  0x0000000000000019      0
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> error-gt0-soc-fatal-punit                               0x000000000000001d      0
> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> error-gt1-correctable-guc                               0x1000000000000001      0
> error-gt1-correctable-slm                               0x1000000000000003      0
> error-gt1-correctable-eu-ic                             0x1000000000000004      0
> error-gt1-correctable-eu-grf                            0x1000000000000005      0
> error-gt1-fatal-guc                                     0x1000000000000009      0
> error-gt1-fatal-slm                                     0x100000000000000d      0
> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> error-gt1-fatal-fpu                                     0x1000000000000010      0
> error-gt1-fatal-tlb                                     0x1000000000000011      0
> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> error-gt1-correctable-subslice                          0x1000000000000013      0
> error-gt1-correctable-l3bank                            0x1000000000000014      0
> error-gt1-fatal-subslice                                0x1000000000000015      0
> error-gt1-fatal-l3bank                                  0x1000000000000016      0
> error-gt1-sgunit-correctable                            0x1000000000000017      0
> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> error-gt1-sgunit-fatal                                  0x1000000000000019      0
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> error-gt1-soc-fatal-punit                               0x100000000000001d      0
> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>
> wait on a error event:
>
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> waiting for error event
> error event received
> counter value 0
>
> list all errors:
>
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name                                                    config-id
>
> error-gt0-correctable-guc                               0x0000000000000001
> error-gt0-correctable-slm                               0x0000000000000003
> error-gt0-correctable-eu-ic                             0x0000000000000004
> error-gt0-correctable-eu-grf                            0x0000000000000005
> error-gt0-fatal-guc                                     0x0000000000000009
> error-gt0-fatal-slm                                     0x000000000000000d
> error-gt0-fatal-eu-grf                                  0x000000000000000f
> error-gt0-fatal-fpu                                     0x0000000000000010
> error-gt0-fatal-tlb                                     0x0000000000000011
> error-gt0-fatal-l3-fabric                               0x0000000000000012
> error-gt0-correctable-subslice                          0x0000000000000013
> error-gt0-correctable-l3bank                            0x0000000000000014
> error-gt0-fatal-subslice                                0x0000000000000015
> error-gt0-fatal-l3bank                                  0x0000000000000016
> error-gt0-sgunit-correctable                            0x0000000000000017
> error-gt0-sgunit-nonfatal                               0x0000000000000018
> error-gt0-sgunit-fatal                                  0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> error-gt0-soc-fatal-punit                               0x000000000000001d
> error-gt0-soc-fatal-psf-0                               0x000000000000001e
> error-gt0-soc-fatal-psf-1                               0x000000000000001f
> error-gt0-soc-fatal-psf-2                               0x0000000000000020
> error-gt0-soc-fatal-cd0                                 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> error-gt1-correctable-guc                               0x1000000000000001
> error-gt1-correctable-slm                               0x1000000000000003
> error-gt1-correctable-eu-ic                             0x1000000000000004
> error-gt1-correctable-eu-grf                            0x1000000000000005
> error-gt1-fatal-guc                                     0x1000000000000009
> error-gt1-fatal-slm                                     0x100000000000000d
> error-gt1-fatal-eu-grf                                  0x100000000000000f
> error-gt1-fatal-fpu                                     0x1000000000000010
> error-gt1-fatal-tlb                                     0x1000000000000011
> error-gt1-fatal-l3-fabric                               0x1000000000000012
> error-gt1-correctable-subslice                          0x1000000000000013
> error-gt1-correctable-l3bank                            0x1000000000000014
> error-gt1-fatal-subslice                                0x1000000000000015
> error-gt1-fatal-l3bank                                  0x1000000000000016
> error-gt1-sgunit-correctable                            0x1000000000000017
> error-gt1-sgunit-nonfatal                               0x1000000000000018
> error-gt1-sgunit-fatal                                  0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> error-gt1-soc-fatal-punit                               0x100000000000001d
> error-gt1-soc-fatal-psf-0                               0x100000000000001e
> error-gt1-soc-fatal-psf-1                               0x100000000000001f
> error-gt1-soc-fatal-psf-2                               0x1000000000000020
> error-gt1-soc-fatal-cd0                                 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Oded Gabbay <ogabbay@kernel.org>
>
>
> Aravind Iddamsetty (5):
>   drm/netlink: Add netlink infrastructure
>   drm/xe/RAS: Register a genl netlink family
>   drm/xe/RAS: Expose the error counters
>   drm/netlink: define multicast groups
>   drm/xe/RAS: send multicast event on occurrence of an error
>
>  drivers/gpu/drm/xe/Makefile          |   1 +
>  drivers/gpu/drm/xe/xe_device.c       |   3 +
>  drivers/gpu/drm/xe/xe_device_types.h |   2 +
>  drivers/gpu/drm/xe/xe_irq.c          |  32 ++
>  drivers/gpu/drm/xe/xe_module.c       |   2 +
>  drivers/gpu/drm/xe/xe_netlink.c      | 526 +++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_netlink.h      |  14 +
>  include/uapi/drm/drm_netlink.h       |  81 +++++
>  include/uapi/drm/xe_drm.h            |  64 ++++
>  9 files changed, 725 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
>  create mode 100644 include/uapi/drm/drm_netlink.h
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar
@ 2023-06-05 17:17   ` Iddamsetty, Aravind
  0 siblings, 0 replies; 18+ messages in thread
From: Iddamsetty, Aravind @ 2023-06-05 17:17 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay



On 04-06-2023 22:37, Tomer Tayar wrote:
> On 26/05/2023 19:20, Aravind Iddamsetty wrote:
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> exposing a set of error counters which can be used by observability
>> tools to take corrective actions or repairs. Traditionally there were
>> being exposed via PMU (for relative counters) and sysfs interface (for
>> absolute value) in our internal branch. But, due to the limitations in
>> this approach to use two interfaces and also not able to have an event
>> based reporting or configurability, an alternative approach to try
>> netlink was suggested by community for drm subsystem wide UAPI for RAS
>> and telemetry as discussed in [1].
>>
>> This [1] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.
> 
> Hi Aravind,

Hi Tomer,

Thanks a lot for your review.
> 
> The habanalabs driver is another candidate to use this netlink-based drm 
> framework.
> As a single-user device, we have an additional "control" device that 
> allows multiple applications to query for information and to monitor the 
> "compute" device.
> And while we are about to move the compute device to the accel nodes, we 
> don't have a real replacement there for the control device.
> 
> Another possible usage of this framework for habanalabs is the events 
> notification.
> Currently we have an eventfd-based mechanism, and after being notified 
> about an event, user starts querying about the event and the relevant 
> info, usually in several requests.
> With this framework we should be allegedly possible to gather all 
> relevant info together with the event itself.

that is right with the multicast event we can pack data too.
> 
> The current implementation seems intended more to errors (and quite 
> "tailored" to Xe needs ...), while in habanalabs we would need it also 
> for non-error static/dynamic info.
> Maybe we should revise the existing commands/attributes to be more generic?

correct, at present that is the usecase xe driver has and atleast for
the error part I believe is generic if not we can make it, the framework
is extensible. The idea I had was generic commands which every driver
can use will be part of drm framework and if there are specific commands
or attributes that shall be part of driver. But some thought is needed
here as MAX attributes is needed by userspace and how to define
attribute policy etc..,

> 
> Moreover, the drm part is very small, while most of the netlink "mess" 
> is still done by the specific driver.
> So what is the added value in making it a "drm framework"? Do we enforce 
> something here for drm drivers that use it? Do we help them with simpler 
> APIs and hiding the internals of netlink?> Maybe it would be worth moving some functionality from the Xe driver
> into drm helpers?

your suggestion sounds good and interesting but it might need some
analysis like if we move the registration parts to drm framework how
would we register the driver private commands and attributes if there
are any. But ya having most of the part at drm level helps all the
driver. I'll do some analysis and i'll come back on this.

Thanks,
Aravind.

> 
> Thanks,
> Tomer
> 
>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>
>> this series is on top of https://patchwork.freedesktop.org/series/116181/
>>
>> Below is an example tool drm_ras which demonstrates the use of the
>> supported commands. The tool will be sent to ML with the subject
>> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>
>> read single error counter:
>>
>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>> counter value 0
>>
>> read all error counters:
>>
>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>> name                                                    config-id               counter
>>
>> error-gt0-correctable-guc                               0x0000000000000001      0
>> error-gt0-correctable-slm                               0x0000000000000003      0
>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>> error-gt0-fatal-guc                                     0x0000000000000009      0
>> error-gt0-fatal-slm                                     0x000000000000000d      0
>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>> error-gt0-correctable-subslice                          0x0000000000000013      0
>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>> error-gt0-fatal-subslice                                0x0000000000000015      0
>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>> error-gt1-correctable-guc                               0x1000000000000001      0
>> error-gt1-correctable-slm                               0x1000000000000003      0
>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>> error-gt1-fatal-guc                                     0x1000000000000009      0
>> error-gt1-fatal-slm                                     0x100000000000000d      0
>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>> error-gt1-correctable-subslice                          0x1000000000000013      0
>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>> error-gt1-fatal-subslice                                0x1000000000000015      0
>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>
>> wait on a error event:
>>
>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>> waiting for error event
>> error event received
>> counter value 0
>>
>> list all errors:
>>
>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>> name                                                    config-id
>>
>> error-gt0-correctable-guc                               0x0000000000000001
>> error-gt0-correctable-slm                               0x0000000000000003
>> error-gt0-correctable-eu-ic                             0x0000000000000004
>> error-gt0-correctable-eu-grf                            0x0000000000000005
>> error-gt0-fatal-guc                                     0x0000000000000009
>> error-gt0-fatal-slm                                     0x000000000000000d
>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>> error-gt0-fatal-fpu                                     0x0000000000000010
>> error-gt0-fatal-tlb                                     0x0000000000000011
>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>> error-gt0-correctable-subslice                          0x0000000000000013
>> error-gt0-correctable-l3bank                            0x0000000000000014
>> error-gt0-fatal-subslice                                0x0000000000000015
>> error-gt0-fatal-l3bank                                  0x0000000000000016
>> error-gt0-sgunit-correctable                            0x0000000000000017
>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>> error-gt0-sgunit-fatal                                  0x0000000000000019
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>> error-gt0-soc-fatal-punit                               0x000000000000001d
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>> error-gt1-correctable-guc                               0x1000000000000001
>> error-gt1-correctable-slm                               0x1000000000000003
>> error-gt1-correctable-eu-ic                             0x1000000000000004
>> error-gt1-correctable-eu-grf                            0x1000000000000005
>> error-gt1-fatal-guc                                     0x1000000000000009
>> error-gt1-fatal-slm                                     0x100000000000000d
>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>> error-gt1-fatal-fpu                                     0x1000000000000010
>> error-gt1-fatal-tlb                                     0x1000000000000011
>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>> error-gt1-correctable-subslice                          0x1000000000000013
>> error-gt1-correctable-l3bank                            0x1000000000000014
>> error-gt1-fatal-subslice                                0x1000000000000015
>> error-gt1-fatal-l3bank                                  0x1000000000000016
>> error-gt1-sgunit-correctable                            0x1000000000000017
>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>> error-gt1-sgunit-fatal                                  0x1000000000000019
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>> error-gt1-soc-fatal-punit                               0x100000000000001d
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Oded Gabbay <ogabbay@kernel.org>
>>
>>
>> Aravind Iddamsetty (5):
>>    drm/netlink: Add netlink infrastructure
>>    drm/xe/RAS: Register a genl netlink family
>>    drm/xe/RAS: Expose the error counters
>>    drm/netlink: define multicast groups
>>    drm/xe/RAS: send multicast event on occurrence of an error
>>
>>   drivers/gpu/drm/xe/Makefile          |   1 +
>>   drivers/gpu/drm/xe/xe_device.c       |   3 +
>>   drivers/gpu/drm/xe/xe_device_types.h |   2 +
>>   drivers/gpu/drm/xe/xe_irq.c          |  32 ++
>>   drivers/gpu/drm/xe/xe_module.c       |   2 +
>>   drivers/gpu/drm/xe/xe_netlink.c      | 526 +++++++++++++++++++++++++++
>>   drivers/gpu/drm/xe/xe_netlink.h      |  14 +
>>   include/uapi/drm/drm_netlink.h       |  81 +++++
>>   include/uapi/drm/xe_drm.h            |  64 ++++
>>   9 files changed, 725 insertions(+)
>>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
>>   create mode 100644 include/uapi/drm/drm_netlink.h
>>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure
  2023-06-04 17:07   ` [Intel-xe] " Tomer Tayar
@ 2023-06-05 17:18     ` Iddamsetty, Aravind
  2023-06-06 14:04       ` Tomer Tayar
  0 siblings, 1 reply; 18+ messages in thread
From: Iddamsetty, Aravind @ 2023-06-05 17:18 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay



On 04-06-2023 22:37, Tomer Tayar wrote:
> On 26/05/2023 19:20, Aravind Iddamsetty wrote:
>> Define the netlink commands and attributes that can be commonly used
>> across by drm drivers.
>>
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> ---
>>   include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 68 insertions(+)
>>   create mode 100644 include/uapi/drm/drm_netlink.h
>>
>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>> new file mode 100644
>> index 000000000000..28e7a334d0c7
>> --- /dev/null
>> +++ b/include/uapi/drm/drm_netlink.h
>> @@ -0,0 +1,68 @@
>> +/*
>> + * Copyright 2023 Intel Corporation
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a
>> + * copy of this software and associated documentation files (the "Software"),
>> + * to deal in the Software without restriction, including without limitation
>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>> + * and/or sell copies of the Software, and to permit persons to whom the
>> + * Software is furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice (including the next
>> + * paragraph) shall be included in all copies or substantial portions of the
>> + * Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
>> + * OTHER DEALINGS IN THE SOFTWARE.
>> + */
>> +
>> +#ifndef _DRM_NETLINK_H_
>> +#define _DRM_NETLINK_H_
>> +
>> +#include <linux/netdevice.h>
>> +#include <net/genetlink.h>
>> +#include <net/sock.h>
> 
> This is a uapi header.
> Are all header files here available for user?

no they are not, I later came to know that we should not have any of
that user can't use so will split the header into 2.
> Also, should we add here "#if defined(__cplusplus) extern "C" { ..."?

ya will add that
> 
>> +
>> +#define DRM_GENL_VERSION 1
>> +
>> +enum error_cmds {
>> +	DRM_CMD_UNSPEC,
>> +	/* command to list all errors names with config-id */
>> +	DRM_CMD_QUERY,
>> +	/* command to get a counter for a specific error */
>> +	DRM_CMD_READ_ONE,
>> +	/* command to get counters of all errors */
>> +	DRM_CMD_READ_ALL,
>> +
>> +	__DRM_CMD_MAX,
>> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>> +};
>> +
>> +enum error_attr {
>> +	DRM_ATTR_UNSPEC,
>> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>> +	DRM_ATTR_REQUEST, /* NLA_U8 */
>> +	DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/
> 
> Missing spaces in /*NLA_NESTED*/
> 
>> +	DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>> +	DRM_ATTR_ERROR_ID, /* NLA_U64 */
>> +	DRM_ATTR_ERROR_VALUE, /* NLA_U64 */
>> +
>> +	__DRM_ATTR_MAX,
>> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>> +};
>> +
>> +/* attribute policies */
>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>> +	[DRM_ATTR_REQUEST] = { .type = NLA_U8 },
>> +};
> 
> Should these policies structures be in uapi?

so ya these will also likely move into a separate drm header as
userspace would define there own policy.
> 
>> +
>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
>> +	[DRM_ATTR_ERROR_ID] = { .type = NLA_U64 },
>> +};
> 
> I might miss something here, but why it is not a single policy structure 
> with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID?

so each command can have it's own policy defined, i.e what attributes it
expects we could define only those, that way we can have a check as
well. So, in the present implementation DRM_CMD_QUERY and
DRM_CMD_READ_ALL expect only DRM_ATTR_REQUEST and while DRM_CMD_READ_ONE
expects only DRM_ATTR_ERROR_ID as part of the incoming message from user.

Thanks,
Aravind.
> 
> Thanks,
> Tomer
> 
>> +
>> +#endif
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 2/5] drm/xe/RAS: Register a genl netlink family
  2023-06-04 17:09   ` [Intel-xe] " Tomer Tayar
@ 2023-06-05 17:21     ` Iddamsetty, Aravind
  0 siblings, 0 replies; 18+ messages in thread
From: Iddamsetty, Aravind @ 2023-06-05 17:21 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay



On 04-06-2023 22:39, Tomer Tayar wrote:
> On 26/05/2023 19:20, Aravind Iddamsetty wrote:
>> Use the generic netlink(genl) subsystem to expose the RAS counters to
>> userspace. We define a family structure and operations and register to
>> genl subsystem and these callbacks will be invoked by genl subsystem when
>> userspace sends a registered command with a family identifier to genl
>> subsystem.
>>
>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> ---
>>   drivers/gpu/drm/xe/Makefile          |  1 +
>>   drivers/gpu/drm/xe/xe_device.c       |  3 +
>>   drivers/gpu/drm/xe/xe_device_types.h |  2 +
>>   drivers/gpu/drm/xe/xe_module.c       |  2 +
>>   drivers/gpu/drm/xe/xe_netlink.c      | 89 ++++++++++++++++++++++++++++
>>   drivers/gpu/drm/xe/xe_netlink.h      | 14 +++++
>>   6 files changed, 111 insertions(+)
>>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
>>
>> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
>> index b84e191ba14f..2b42165bc824 100644
>> --- a/drivers/gpu/drm/xe/Makefile
>> +++ b/drivers/gpu/drm/xe/Makefile
>> @@ -67,6 +67,7 @@ xe-y += xe_bb.o \
>>   	xe_mmio.o \
>>   	xe_mocs.o \
>>   	xe_module.o \
>> +	xe_netlink.o \
>>   	xe_pat.o \
>>   	xe_pci.o \
>>   	xe_pcode.o \
>> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
>> index 323356a44e7f..aa12ef12d9dc 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -24,6 +24,7 @@
>>   #include "xe_irq.h"
>>   #include "xe_mmio.h"
>>   #include "xe_module.h"
>> +#include "xe_netlink.h"
>>   #include "xe_pcode.h"
>>   #include "xe_pm.h"
>>   #include "xe_query.h"
>> @@ -317,6 +318,8 @@ int xe_device_probe(struct xe_device *xe)
>>   
>>   	xe_display_register(xe);
>>   
>> +	xe_genl_register(xe);
> 
> xe_genl_register() can fail

That is right but I didn't want to fail the driver load as it would not
impact any device functionality but doesn't provide observability. hence
a warning would be printed "xe genl family registration failed".
> 
>> +
>>   	xe_debugfs_register(xe);
>>   
>>   	err = drmm_add_action_or_reset(&xe->drm, xe_device_sanitize, xe);
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> index 682ebdd1c09e..c9612a54c48f 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -10,6 +10,7 @@
>>   
>>   #include <drm/drm_device.h>
>>   #include <drm/drm_file.h>
>> +#include <drm/drm_netlink.h>
>>   #include <drm/ttm/ttm_device.h>
>>   
>>   #include "xe_gt_types.h"
>> @@ -347,6 +348,7 @@ struct xe_device {
>>   		u32 lvds_channel_mode;
>>   	} params;
>>   #endif
>> +	struct genl_family xe_genl_family;
> 
> Should it be added above, before the "private" section?
> Maybe add a kernel-doc comment for it?

thanks for pointing out will move it there.

> 
>>   };
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
>> index 6860586ce7f8..1eb73eb9a9a5 100644
>> --- a/drivers/gpu/drm/xe/xe_module.c
>> +++ b/drivers/gpu/drm/xe/xe_module.c
>> @@ -11,6 +11,7 @@
>>   #include "xe_drv.h"
>>   #include "xe_hw_fence.h"
>>   #include "xe_module.h"
>> +#include "xe_netlink.h"
>>   #include "xe_pci.h"
>>   #include "xe_sched_job.h"
>>   
>> @@ -67,6 +68,7 @@ static void __exit xe_exit(void)
>>   {
>>   	int i;
>>   
>> +	xe_genl_cleanup();
>>   	xe_unregister_pci_driver();
>>   
>>   	for (i = ARRAY_SIZE(init_funcs) - 1; i >= 0; i--)
>> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c
>> new file mode 100644
>> index 000000000000..63ef238ebc27
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_netlink.c
>> @@ -0,0 +1,89 @@
>> +// SPDX-License-Identifier: MIT
>> +/*
>> + * Copyright © 2023 Intel Corporation
>> + */
>> +
>> +#include <drm/drm_managed.h>
>> +
>> +#include "xe_device.h"
>> +
>> +DEFINE_XARRAY(xe_xarray);
> 
> xe_array sounds too generic. Maybe it should be more specific like 
> xe_genl_xarray?
> In addition, it should be probably static.

Ok.

Thanks,
Aravind.
> 
> Thanks,
> Tomer
> 
>> +
>> +static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	return 0;
>> +}
>> +
>> +static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info)
>> +{
>> +	return 0;
>> +}
>> +
>> +/* operations definition */
>> +static const struct genl_ops xe_genl_ops[] = {
>> +	{
>> +		.cmd = DRM_CMD_QUERY,
>> +		.doit = xe_genl_list_errors,
>> +		.policy = drm_attr_policy_query,
>> +	},
>> +	{
>> +		.cmd = DRM_CMD_READ_ONE,
>> +		.doit = xe_genl_read_error,
>> +		.policy = drm_attr_policy_read_one,
>> +	},
>> +	{
>> +		.cmd = DRM_CMD_READ_ALL,
>> +		.doit = xe_genl_list_errors,
>> +		.policy = drm_attr_policy_query,
>> +	},
>> +};
>> +
>> +static void xe_genl_deregister(struct drm_device *dev,  void *arg)
>> +{
>> +	struct xe_device *xe = arg;
>> +
>> +	xa_erase(&xe_xarray, xe->xe_genl_family.id);
>> +
>> +	drm_dbg_driver(&xe->drm, "unregistering genl family %s\n", xe->xe_genl_family.name);
>> +
>> +	genl_unregister_family(&xe->xe_genl_family);
>> +}
>> +
>> +static void xe_genl_family_init(struct xe_device *xe)
>> +{
>> +	/* Use drm primary node name eg: card0 to name the genl family */
>> +	snprintf(xe->xe_genl_family.name, sizeof(xe->xe_genl_family.name), "%s", xe->drm.primary->kdev->kobj.name);
>> +	xe->xe_genl_family.version = DRM_GENL_VERSION;
>> +	xe->xe_genl_family.parallel_ops = true;
>> +	xe->xe_genl_family.ops = xe_genl_ops;
>> +	xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops);
>> +	xe->xe_genl_family.maxattr = DRM_ATTR_MAX;
>> +	xe->xe_genl_family.module = THIS_MODULE;
>> +}
>> +
>> +int xe_genl_register(struct xe_device *xe)
>> +{
>> +	int ret;
>> +
>> +	xe_genl_family_init(xe);
>> +
>> +	ret = genl_register_family(&xe->xe_genl_family);
>> +	if (ret < 0) {
>> +		drm_warn(&xe->drm, "xe genl family registration failed\n");
>> +		return ret;
>> +	}
>> +
>> +	drm_dbg_driver(&xe->drm, "genl family id %d and name %s\n", xe->xe_genl_family.id, xe->xe_genl_family.name);
>> +
>> +	xa_store(&xe_xarray, xe->xe_genl_family.id, xe, GFP_KERNEL);
>> +
>> +	ret = drmm_add_action_or_reset(&xe->drm, xe_genl_deregister, xe);
>> +
>> +	return ret;
>> +}
>> +
>> +void xe_genl_cleanup(void)
>> +{
>> +	/* destroy xarray */
>> +	xa_destroy(&xe_xarray);
>> +}
>> diff --git a/drivers/gpu/drm/xe/xe_netlink.h b/drivers/gpu/drm/xe/xe_netlink.h
>> new file mode 100644
>> index 000000000000..3bbddb620539
>> --- /dev/null
>> +++ b/drivers/gpu/drm/xe/xe_netlink.h
>> @@ -0,0 +1,14 @@
>> +/* SPDX-License-Identifier: MIT */
>> +/*
>> + * Copyright © 2021 Intel Corporation
>> + */
>> +
>> +#ifndef _XE_GENL_H_
>> +#define _XE_GENL_H_
>> +
>> +#include "xe_device.h"
>> +
>> +int xe_genl_register(struct xe_device *xe);
>> +void xe_genl_cleanup(void);
>> +
>> +#endif
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-06-05 16:47 ` Alex Deucher
@ 2023-06-06 11:56   ` Iddamsetty, Aravind
  0 siblings, 0 replies; 18+ messages in thread
From: Iddamsetty, Aravind @ 2023-06-06 11:56 UTC (permalink / raw)
  To: Alex Deucher, Hawking Zhang, Harish Kasiviswanathan, Kuehling,
	Felix, Tuikov, Luben
  Cc: alexander.deucher, ogabbay, intel-xe, dri-devel



On 05-06-2023 22:17, Alex Deucher wrote:
> Adding the relevant AMD folks for RAS.  We currently expose RAS via
> sysfs, but also have an event interface in KFD which may be somewhat
> similar to this.
> 
> If we were to converge on a common RAS interface, would we want to
> look at any commonality in bad page storage/reporting for device
> memory?

Could you please elaborate a bit on this.

Thanks,
Aravind.
> 
> Alex
> 
> On Fri, May 26, 2023 at 12:21 PM Aravind Iddamsetty
> <aravind.iddamsetty@intel.com> wrote:
>>
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> exposing a set of error counters which can be used by observability
>> tools to take corrective actions or repairs. Traditionally there were
>> being exposed via PMU (for relative counters) and sysfs interface (for
>> absolute value) in our internal branch. But, due to the limitations in
>> this approach to use two interfaces and also not able to have an event
>> based reporting or configurability, an alternative approach to try
>> netlink was suggested by community for drm subsystem wide UAPI for RAS
>> and telemetry as discussed in [1].
>>
>> This [1] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.
>>
>> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>
>> this series is on top of https://patchwork.freedesktop.org/series/116181/
>>
>> Below is an example tool drm_ras which demonstrates the use of the
>> supported commands. The tool will be sent to ML with the subject
>> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>>
>> read single error counter:
>>
>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
>> counter value 0
>>
>> read all error counters:
>>
>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>> name                                                    config-id               counter
>>
>> error-gt0-correctable-guc                               0x0000000000000001      0
>> error-gt0-correctable-slm                               0x0000000000000003      0
>> error-gt0-correctable-eu-ic                             0x0000000000000004      0
>> error-gt0-correctable-eu-grf                            0x0000000000000005      0
>> error-gt0-fatal-guc                                     0x0000000000000009      0
>> error-gt0-fatal-slm                                     0x000000000000000d      0
>> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
>> error-gt0-fatal-fpu                                     0x0000000000000010      0
>> error-gt0-fatal-tlb                                     0x0000000000000011      0
>> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
>> error-gt0-correctable-subslice                          0x0000000000000013      0
>> error-gt0-correctable-l3bank                            0x0000000000000014      0
>> error-gt0-fatal-subslice                                0x0000000000000015      0
>> error-gt0-fatal-l3bank                                  0x0000000000000016      0
>> error-gt0-sgunit-correctable                            0x0000000000000017      0
>> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
>> error-gt0-sgunit-fatal                                  0x0000000000000019      0
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
>> error-gt0-soc-fatal-punit                               0x000000000000001d      0
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
>> error-gt1-correctable-guc                               0x1000000000000001      0
>> error-gt1-correctable-slm                               0x1000000000000003      0
>> error-gt1-correctable-eu-ic                             0x1000000000000004      0
>> error-gt1-correctable-eu-grf                            0x1000000000000005      0
>> error-gt1-fatal-guc                                     0x1000000000000009      0
>> error-gt1-fatal-slm                                     0x100000000000000d      0
>> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
>> error-gt1-fatal-fpu                                     0x1000000000000010      0
>> error-gt1-fatal-tlb                                     0x1000000000000011      0
>> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
>> error-gt1-correctable-subslice                          0x1000000000000013      0
>> error-gt1-correctable-l3bank                            0x1000000000000014      0
>> error-gt1-fatal-subslice                                0x1000000000000015      0
>> error-gt1-fatal-l3bank                                  0x1000000000000016      0
>> error-gt1-sgunit-correctable                            0x1000000000000017      0
>> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
>> error-gt1-sgunit-fatal                                  0x1000000000000019      0
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
>> error-gt1-soc-fatal-punit                               0x100000000000001d      0
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>>
>> wait on a error event:
>>
>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
>> waiting for error event
>> error event received
>> counter value 0
>>
>> list all errors:
>>
>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
>> name                                                    config-id
>>
>> error-gt0-correctable-guc                               0x0000000000000001
>> error-gt0-correctable-slm                               0x0000000000000003
>> error-gt0-correctable-eu-ic                             0x0000000000000004
>> error-gt0-correctable-eu-grf                            0x0000000000000005
>> error-gt0-fatal-guc                                     0x0000000000000009
>> error-gt0-fatal-slm                                     0x000000000000000d
>> error-gt0-fatal-eu-grf                                  0x000000000000000f
>> error-gt0-fatal-fpu                                     0x0000000000000010
>> error-gt0-fatal-tlb                                     0x0000000000000011
>> error-gt0-fatal-l3-fabric                               0x0000000000000012
>> error-gt0-correctable-subslice                          0x0000000000000013
>> error-gt0-correctable-l3bank                            0x0000000000000014
>> error-gt0-fatal-subslice                                0x0000000000000015
>> error-gt0-fatal-l3bank                                  0x0000000000000016
>> error-gt0-sgunit-correctable                            0x0000000000000017
>> error-gt0-sgunit-nonfatal                               0x0000000000000018
>> error-gt0-sgunit-fatal                                  0x0000000000000019
>> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
>> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
>> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
>> error-gt0-soc-fatal-punit                               0x000000000000001d
>> error-gt0-soc-fatal-psf-0                               0x000000000000001e
>> error-gt0-soc-fatal-psf-1                               0x000000000000001f
>> error-gt0-soc-fatal-psf-2                               0x0000000000000020
>> error-gt0-soc-fatal-cd0                                 0x0000000000000021
>> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
>> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
>> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
>> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
>> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
>> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
>> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
>> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
>> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
>> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
>> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
>> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
>> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
>> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
>> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
>> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
>> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
>> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
>> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
>> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
>> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
>> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
>> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
>> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
>> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
>> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
>> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
>> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
>> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
>> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
>> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
>> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
>> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
>> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
>> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
>> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
>> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
>> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
>> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
>> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
>> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
>> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
>> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
>> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
>> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
>> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
>> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
>> error-gt1-correctable-guc                               0x1000000000000001
>> error-gt1-correctable-slm                               0x1000000000000003
>> error-gt1-correctable-eu-ic                             0x1000000000000004
>> error-gt1-correctable-eu-grf                            0x1000000000000005
>> error-gt1-fatal-guc                                     0x1000000000000009
>> error-gt1-fatal-slm                                     0x100000000000000d
>> error-gt1-fatal-eu-grf                                  0x100000000000000f
>> error-gt1-fatal-fpu                                     0x1000000000000010
>> error-gt1-fatal-tlb                                     0x1000000000000011
>> error-gt1-fatal-l3-fabric                               0x1000000000000012
>> error-gt1-correctable-subslice                          0x1000000000000013
>> error-gt1-correctable-l3bank                            0x1000000000000014
>> error-gt1-fatal-subslice                                0x1000000000000015
>> error-gt1-fatal-l3bank                                  0x1000000000000016
>> error-gt1-sgunit-correctable                            0x1000000000000017
>> error-gt1-sgunit-nonfatal                               0x1000000000000018
>> error-gt1-sgunit-fatal                                  0x1000000000000019
>> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
>> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
>> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
>> error-gt1-soc-fatal-punit                               0x100000000000001d
>> error-gt1-soc-fatal-psf-0                               0x100000000000001e
>> error-gt1-soc-fatal-psf-1                               0x100000000000001f
>> error-gt1-soc-fatal-psf-2                               0x1000000000000020
>> error-gt1-soc-fatal-cd0                                 0x1000000000000021
>> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
>> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
>> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
>> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
>> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
>> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
>> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
>> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
>> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
>> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
>> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
>> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
>> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
>> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
>> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
>> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
>> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
>> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
>> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
>> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
>> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
>> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
>> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
>> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
>> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
>> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
>> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
>> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
>> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
>> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
>> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
>> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
>> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
>> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
>> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Daniel Vetter <daniel@ffwll.ch>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Oded Gabbay <ogabbay@kernel.org>
>>
>>
>> Aravind Iddamsetty (5):
>>   drm/netlink: Add netlink infrastructure
>>   drm/xe/RAS: Register a genl netlink family
>>   drm/xe/RAS: Expose the error counters
>>   drm/netlink: define multicast groups
>>   drm/xe/RAS: send multicast event on occurrence of an error
>>
>>  drivers/gpu/drm/xe/Makefile          |   1 +
>>  drivers/gpu/drm/xe/xe_device.c       |   3 +
>>  drivers/gpu/drm/xe/xe_device_types.h |   2 +
>>  drivers/gpu/drm/xe/xe_irq.c          |  32 ++
>>  drivers/gpu/drm/xe/xe_module.c       |   2 +
>>  drivers/gpu/drm/xe/xe_netlink.c      | 526 +++++++++++++++++++++++++++
>>  drivers/gpu/drm/xe/xe_netlink.h      |  14 +
>>  include/uapi/drm/drm_netlink.h       |  81 +++++
>>  include/uapi/drm/xe_drm.h            |  64 ++++
>>  9 files changed, 725 insertions(+)
>>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
>>  create mode 100644 include/uapi/drm/drm_netlink.h
>>
>> --
>> 2.25.1
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure
  2023-06-05 17:18     ` Iddamsetty, Aravind
@ 2023-06-06 14:04       ` Tomer Tayar
  2023-06-21  6:40         ` Iddamsetty, Aravind
  0 siblings, 1 reply; 18+ messages in thread
From: Tomer Tayar @ 2023-06-06 14:04 UTC (permalink / raw)
  To: Iddamsetty, Aravind, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay

On 05/06/2023 20:18, Iddamsetty, Aravind wrote:
>
> On 04-06-2023 22:37, Tomer Tayar wrote:
>> On 26/05/2023 19:20, Aravind Iddamsetty wrote:
>>> Define the netlink commands and attributes that can be commonly used
>>> across by drm drivers.
>>>
>>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>>> ---
>>>    include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++
>>>    1 file changed, 68 insertions(+)
>>>    create mode 100644 include/uapi/drm/drm_netlink.h
>>>
>>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>>> new file mode 100644
>>> index 000000000000..28e7a334d0c7
>>> --- /dev/null
>>> +++ b/include/uapi/drm/drm_netlink.h
>>> @@ -0,0 +1,68 @@
>>> +/*
>>> + * Copyright 2023 Intel Corporation
>>> + *
>>> + * Permission is hereby granted, free of charge, to any person obtaining a
>>> + * copy of this software and associated documentation files (the "Software"),
>>> + * to deal in the Software without restriction, including without limitation
>>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>>> + * and/or sell copies of the Software, and to permit persons to whom the
>>> + * Software is furnished to do so, subject to the following conditions:
>>> + *
>>> + * The above copyright notice and this permission notice (including the next
>>> + * paragraph) shall be included in all copies or substantial portions of the
>>> + * Software.
>>> + *
>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
>>> + * OTHER DEALINGS IN THE SOFTWARE.
>>> + */
>>> +
>>> +#ifndef _DRM_NETLINK_H_
>>> +#define _DRM_NETLINK_H_
>>> +
>>> +#include <linux/netdevice.h>
>>> +#include <net/genetlink.h>
>>> +#include <net/sock.h>
>> This is a uapi header.
>> Are all header files here available for user?
> no they are not, I later came to know that we should not have any of
> that user can't use so will split the header into 2.
>> Also, should we add here "#if defined(__cplusplus) extern "C" { ..."?
> ya will add that
>>> +
>>> +#define DRM_GENL_VERSION 1
>>> +
>>> +enum error_cmds {
>>> +	DRM_CMD_UNSPEC,
>>> +	/* command to list all errors names with config-id */
>>> +	DRM_CMD_QUERY,
>>> +	/* command to get a counter for a specific error */
>>> +	DRM_CMD_READ_ONE,
>>> +	/* command to get counters of all errors */
>>> +	DRM_CMD_READ_ALL,
>>> +
>>> +	__DRM_CMD_MAX,
>>> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>>> +};
>>> +
>>> +enum error_attr {
>>> +	DRM_ATTR_UNSPEC,
>>> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>>> +	DRM_ATTR_REQUEST, /* NLA_U8 */
>>> +	DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/
>> Missing spaces in /*NLA_NESTED*/
>>
>>> +	DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>>> +	DRM_ATTR_ERROR_ID, /* NLA_U64 */
>>> +	DRM_ATTR_ERROR_VALUE, /* NLA_U64 */
>>> +
>>> +	__DRM_ATTR_MAX,
>>> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>>> +};
>>> +
>>> +/* attribute policies */
>>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>>> +	[DRM_ATTR_REQUEST] = { .type = NLA_U8 },
>>> +};
>> Should these policies structures be in uapi?
> so ya these will also likely move into a separate drm header as
> userspace would define there own policy.
>>> +
>>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
>>> +	[DRM_ATTR_ERROR_ID] = { .type = NLA_U64 },
>>> +};
>> I might miss something here, but why it is not a single policy structure
>> with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID?
> so each command can have it's own policy defined, i.e what attributes it
> expects we could define only those, that way we can have a check as
> well. So, in the present implementation DRM_CMD_QUERY and
> DRM_CMD_READ_ALL expect only DRM_ATTR_REQUEST and while DRM_CMD_READ_ONE
> expects only DRM_ATTR_ERROR_ID as part of the incoming message from user.
>
> Thanks,
> Aravind.

But "struct genl_ops" expects a pointer to "struct nla_policy", and in 
the definition of "xe_genl_ops", each entry is set with a pointer to 
these arrays of "struct nla_policy".
Won't they use the first entry (DRM_ATTR_UNSPEC) of the arrays? 
Shouldn't we set use there the arrays at indices DRM_ATTR_REQUEST and 
DRM_ATTR_ERROR_ID?
If yes, then can't we have a single policy array, each entry defines a 
policy per attribute, and we will use the suitable policy entry for each 
command?

Thanks,
Tomer

>> Thanks,
>> Tomer
>>
>>> +
>>> +#endif
>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure
  2023-06-06 14:04       ` Tomer Tayar
@ 2023-06-21  6:40         ` Iddamsetty, Aravind
  0 siblings, 0 replies; 18+ messages in thread
From: Iddamsetty, Aravind @ 2023-06-21  6:40 UTC (permalink / raw)
  To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay



On 06-06-2023 19:34, Tomer Tayar wrote:
> On 05/06/2023 20:18, Iddamsetty, Aravind wrote:
>>
>> On 04-06-2023 22:37, Tomer Tayar wrote:
>>> On 26/05/2023 19:20, Aravind Iddamsetty wrote:
>>>> Define the netlink commands and attributes that can be commonly used
>>>> across by drm drivers.
>>>>
>>>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>>>> ---
>>>>    include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 68 insertions(+)
>>>>    create mode 100644 include/uapi/drm/drm_netlink.h
>>>>
>>>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h
>>>> new file mode 100644
>>>> index 000000000000..28e7a334d0c7
>>>> --- /dev/null
>>>> +++ b/include/uapi/drm/drm_netlink.h
>>>> @@ -0,0 +1,68 @@
>>>> +/*
>>>> + * Copyright 2023 Intel Corporation
>>>> + *
>>>> + * Permission is hereby granted, free of charge, to any person obtaining a
>>>> + * copy of this software and associated documentation files (the "Software"),
>>>> + * to deal in the Software without restriction, including without limitation
>>>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
>>>> + * and/or sell copies of the Software, and to permit persons to whom the
>>>> + * Software is furnished to do so, subject to the following conditions:
>>>> + *
>>>> + * The above copyright notice and this permission notice (including the next
>>>> + * paragraph) shall be included in all copies or substantial portions of the
>>>> + * Software.
>>>> + *
>>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
>>>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
>>>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
>>>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
>>>> + * OTHER DEALINGS IN THE SOFTWARE.
>>>> + */
>>>> +
>>>> +#ifndef _DRM_NETLINK_H_
>>>> +#define _DRM_NETLINK_H_
>>>> +
>>>> +#include <linux/netdevice.h>
>>>> +#include <net/genetlink.h>
>>>> +#include <net/sock.h>
>>> This is a uapi header.
>>> Are all header files here available for user?
>> no they are not, I later came to know that we should not have any of
>> that user can't use so will split the header into 2.
>>> Also, should we add here "#if defined(__cplusplus) extern "C" { ..."?
>> ya will add that
>>>> +
>>>> +#define DRM_GENL_VERSION 1
>>>> +
>>>> +enum error_cmds {
>>>> +	DRM_CMD_UNSPEC,
>>>> +	/* command to list all errors names with config-id */
>>>> +	DRM_CMD_QUERY,
>>>> +	/* command to get a counter for a specific error */
>>>> +	DRM_CMD_READ_ONE,
>>>> +	/* command to get counters of all errors */
>>>> +	DRM_CMD_READ_ALL,
>>>> +
>>>> +	__DRM_CMD_MAX,
>>>> +	DRM_CMD_MAX = __DRM_CMD_MAX - 1,
>>>> +};
>>>> +
>>>> +enum error_attr {
>>>> +	DRM_ATTR_UNSPEC,
>>>> +	DRM_ATTR_PAD = DRM_ATTR_UNSPEC,
>>>> +	DRM_ATTR_REQUEST, /* NLA_U8 */
>>>> +	DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/
>>> Missing spaces in /*NLA_NESTED*/
>>>
>>>> +	DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */
>>>> +	DRM_ATTR_ERROR_ID, /* NLA_U64 */
>>>> +	DRM_ATTR_ERROR_VALUE, /* NLA_U64 */
>>>> +
>>>> +	__DRM_ATTR_MAX,
>>>> +	DRM_ATTR_MAX = __DRM_ATTR_MAX - 1,
>>>> +};
>>>> +
>>>> +/* attribute policies */
>>>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = {
>>>> +	[DRM_ATTR_REQUEST] = { .type = NLA_U8 },
>>>> +};
>>> Should these policies structures be in uapi?
>> so ya these will also likely move into a separate drm header as
>> userspace would define there own policy.
>>>> +
>>>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = {
>>>> +	[DRM_ATTR_ERROR_ID] = { .type = NLA_U64 },
>>>> +};
>>> I might miss something here, but why it is not a single policy structure
>>> with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID?
>> so each command can have it's own policy defined, i.e what attributes it
>> expects we could define only those, that way we can have a check as
>> well. So, in the present implementation DRM_CMD_QUERY and
>> DRM_CMD_READ_ALL expect only DRM_ATTR_REQUEST and while DRM_CMD_READ_ONE
>> expects only DRM_ATTR_ERROR_ID as part of the incoming message from user.
>>
>> Thanks,
>> Aravind.
> 
> But "struct genl_ops" expects a pointer to "struct nla_policy", and in 
> the definition of "xe_genl_ops", each entry is set with a pointer to 
> these arrays of "struct nla_policy".
> Won't they use the first entry (DRM_ATTR_UNSPEC) of the arrays? 
> Shouldn't we set use there the arrays at indices DRM_ATTR_REQUEST and 
> DRM_ATTR_ERROR_ID?
> If yes, then can't we have a single policy array, each entry defines a 
> policy per attribute, and we will use the suitable policy entry for each 
> command?
Hi Tomer,

apologies for my late reply.

a command can accept more than one attribute. so the genl netlink core
would validate the each attributes passed in the recv message by
checking with the policy array in CMD definition.

Thanks,
Aravind.


> 
> Thanks,
> Tomer
> 
>>> Thanks,
>>> Tomer
>>>
>>>> +
>>>> +#endif
>>>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
                   ` (6 preceding siblings ...)
  2023-06-05 16:47 ` Alex Deucher
@ 2023-06-21 17:24 ` Sebastian Wick
  2023-07-17 12:02   ` Oded Gabbay
  7 siblings, 1 reply; 18+ messages in thread
From: Sebastian Wick @ 2023-06-21 17:24 UTC (permalink / raw)
  To: Aravind Iddamsetty; +Cc: alexander.deucher, ogabbay, intel-xe, dri-devel

On Fri, May 26, 2023 at 6:21 PM Aravind Iddamsetty
<aravind.iddamsetty@intel.com> wrote:
>
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> exposing a set of error counters which can be used by observability
> tools to take corrective actions or repairs. Traditionally there were
> being exposed via PMU (for relative counters) and sysfs interface (for
> absolute value) in our internal branch. But, due to the limitations in
> this approach to use two interfaces and also not able to have an event
> based reporting or configurability, an alternative approach to try
> netlink was suggested by community for drm subsystem wide UAPI for RAS
> and telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.

Be aware that netlink can be quite awkward in user space because it's
attached to the netns while the device is in the mount ns and there
are special rules for netlink regarding namespacing.

> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/116181/
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name                                                    config-id               counter
>
> error-gt0-correctable-guc                               0x0000000000000001      0
> error-gt0-correctable-slm                               0x0000000000000003      0
> error-gt0-correctable-eu-ic                             0x0000000000000004      0
> error-gt0-correctable-eu-grf                            0x0000000000000005      0
> error-gt0-fatal-guc                                     0x0000000000000009      0
> error-gt0-fatal-slm                                     0x000000000000000d      0
> error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> error-gt0-fatal-fpu                                     0x0000000000000010      0
> error-gt0-fatal-tlb                                     0x0000000000000011      0
> error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> error-gt0-correctable-subslice                          0x0000000000000013      0
> error-gt0-correctable-l3bank                            0x0000000000000014      0
> error-gt0-fatal-subslice                                0x0000000000000015      0
> error-gt0-fatal-l3bank                                  0x0000000000000016      0
> error-gt0-sgunit-correctable                            0x0000000000000017      0
> error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> error-gt0-sgunit-fatal                                  0x0000000000000019      0
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> error-gt0-soc-fatal-punit                               0x000000000000001d      0
> error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> error-gt1-correctable-guc                               0x1000000000000001      0
> error-gt1-correctable-slm                               0x1000000000000003      0
> error-gt1-correctable-eu-ic                             0x1000000000000004      0
> error-gt1-correctable-eu-grf                            0x1000000000000005      0
> error-gt1-fatal-guc                                     0x1000000000000009      0
> error-gt1-fatal-slm                                     0x100000000000000d      0
> error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> error-gt1-fatal-fpu                                     0x1000000000000010      0
> error-gt1-fatal-tlb                                     0x1000000000000011      0
> error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> error-gt1-correctable-subslice                          0x1000000000000013      0
> error-gt1-correctable-l3bank                            0x1000000000000014      0
> error-gt1-fatal-subslice                                0x1000000000000015      0
> error-gt1-fatal-l3bank                                  0x1000000000000016      0
> error-gt1-sgunit-correctable                            0x1000000000000017      0
> error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> error-gt1-sgunit-fatal                                  0x1000000000000019      0
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> error-gt1-soc-fatal-punit                               0x100000000000001d      0
> error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
>
> wait on a error event:
>
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> waiting for error event
> error event received
> counter value 0
>
> list all errors:
>
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name                                                    config-id
>
> error-gt0-correctable-guc                               0x0000000000000001
> error-gt0-correctable-slm                               0x0000000000000003
> error-gt0-correctable-eu-ic                             0x0000000000000004
> error-gt0-correctable-eu-grf                            0x0000000000000005
> error-gt0-fatal-guc                                     0x0000000000000009
> error-gt0-fatal-slm                                     0x000000000000000d
> error-gt0-fatal-eu-grf                                  0x000000000000000f
> error-gt0-fatal-fpu                                     0x0000000000000010
> error-gt0-fatal-tlb                                     0x0000000000000011
> error-gt0-fatal-l3-fabric                               0x0000000000000012
> error-gt0-correctable-subslice                          0x0000000000000013
> error-gt0-correctable-l3bank                            0x0000000000000014
> error-gt0-fatal-subslice                                0x0000000000000015
> error-gt0-fatal-l3bank                                  0x0000000000000016
> error-gt0-sgunit-correctable                            0x0000000000000017
> error-gt0-sgunit-nonfatal                               0x0000000000000018
> error-gt0-sgunit-fatal                                  0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> error-gt0-soc-fatal-punit                               0x000000000000001d
> error-gt0-soc-fatal-psf-0                               0x000000000000001e
> error-gt0-soc-fatal-psf-1                               0x000000000000001f
> error-gt0-soc-fatal-psf-2                               0x0000000000000020
> error-gt0-soc-fatal-cd0                                 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> error-gt1-correctable-guc                               0x1000000000000001
> error-gt1-correctable-slm                               0x1000000000000003
> error-gt1-correctable-eu-ic                             0x1000000000000004
> error-gt1-correctable-eu-grf                            0x1000000000000005
> error-gt1-fatal-guc                                     0x1000000000000009
> error-gt1-fatal-slm                                     0x100000000000000d
> error-gt1-fatal-eu-grf                                  0x100000000000000f
> error-gt1-fatal-fpu                                     0x1000000000000010
> error-gt1-fatal-tlb                                     0x1000000000000011
> error-gt1-fatal-l3-fabric                               0x1000000000000012
> error-gt1-correctable-subslice                          0x1000000000000013
> error-gt1-correctable-l3bank                            0x1000000000000014
> error-gt1-fatal-subslice                                0x1000000000000015
> error-gt1-fatal-l3bank                                  0x1000000000000016
> error-gt1-sgunit-correctable                            0x1000000000000017
> error-gt1-sgunit-nonfatal                               0x1000000000000018
> error-gt1-sgunit-fatal                                  0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> error-gt1-soc-fatal-punit                               0x100000000000001d
> error-gt1-soc-fatal-psf-0                               0x100000000000001e
> error-gt1-soc-fatal-psf-1                               0x100000000000001f
> error-gt1-soc-fatal-psf-2                               0x1000000000000020
> error-gt1-soc-fatal-cd0                                 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Oded Gabbay <ogabbay@kernel.org>
>
>
> Aravind Iddamsetty (5):
>   drm/netlink: Add netlink infrastructure
>   drm/xe/RAS: Register a genl netlink family
>   drm/xe/RAS: Expose the error counters
>   drm/netlink: define multicast groups
>   drm/xe/RAS: send multicast event on occurrence of an error
>
>  drivers/gpu/drm/xe/Makefile          |   1 +
>  drivers/gpu/drm/xe/xe_device.c       |   3 +
>  drivers/gpu/drm/xe/xe_device_types.h |   2 +
>  drivers/gpu/drm/xe/xe_irq.c          |  32 ++
>  drivers/gpu/drm/xe/xe_module.c       |   2 +
>  drivers/gpu/drm/xe/xe_netlink.c      | 526 +++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_netlink.h      |  14 +
>  include/uapi/drm/drm_netlink.h       |  81 +++++
>  include/uapi/drm/xe_drm.h            |  64 ++++
>  9 files changed, 725 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
>  create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
>  create mode 100644 include/uapi/drm/drm_netlink.h
>
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem
  2023-06-21 17:24 ` Sebastian Wick
@ 2023-07-17 12:02   ` Oded Gabbay
  0 siblings, 0 replies; 18+ messages in thread
From: Oded Gabbay @ 2023-07-17 12:02 UTC (permalink / raw)
  To: Sebastian Wick; +Cc: alexander.deucher, dri-devel, intel-xe, Aravind Iddamsetty

On Wed, Jun 21, 2023 at 8:24 PM Sebastian Wick
<sebastian.wick@redhat.com> wrote:
>
> On Fri, May 26, 2023 at 6:21 PM Aravind Iddamsetty
> <aravind.iddamsetty@intel.com> wrote:
> >
> > Our hardware supports RAS(Reliability, Availability, Serviceability) by
> > exposing a set of error counters which can be used by observability
> > tools to take corrective actions or repairs. Traditionally there were
> > being exposed via PMU (for relative counters) and sysfs interface (for
> > absolute value) in our internal branch. But, due to the limitations in
> > this approach to use two interfaces and also not able to have an event
> > based reporting or configurability, an alternative approach to try
> > netlink was suggested by community for drm subsystem wide UAPI for RAS
> > and telemetry as discussed in [1].
> >
> > This [1] is the inspiration to this series. It uses the generic
> > netlink(genl) family subsystem and exposes a set of commands that can
> > be used by every drm driver, the framework provides a means to have
> > custom commands too. Each drm driver instance in this example xe driver
> > instance registers a family and operations to the genl subsystem through
> > which it enumerates and reports the error counters. An event based
> > notification is also supported to which userpace can subscribe to and
> > be notified when any error occurs and read the error counter this avoids
> > continuous polling on error counter. This can also be extended to
> > threshold based notification.
>
> Be aware that netlink can be quite awkward in user space because it's
> attached to the netns while the device is in the mount ns and there
> are special rules for netlink regarding namespacing.
I agree, we need to be sure this works in all common deployments,
mainly dockers and kubernetes, before deciding to go down this path.
Oded

>
> > [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> >
> > this series is on top of https://patchwork.freedesktop.org/series/116181/
> >
> > Below is an example tool drm_ras which demonstrates the use of the
> > supported commands. The tool will be sent to ML with the subject
> > "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters"
> >
> > read single error counter:
> >
> > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> > counter value 0
> >
> > read all error counters:
> >
> > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> > name                                                    config-id               counter
> >
> > error-gt0-correctable-guc                               0x0000000000000001      0
> > error-gt0-correctable-slm                               0x0000000000000003      0
> > error-gt0-correctable-eu-ic                             0x0000000000000004      0
> > error-gt0-correctable-eu-grf                            0x0000000000000005      0
> > error-gt0-fatal-guc                                     0x0000000000000009      0
> > error-gt0-fatal-slm                                     0x000000000000000d      0
> > error-gt0-fatal-eu-grf                                  0x000000000000000f      0
> > error-gt0-fatal-fpu                                     0x0000000000000010      0
> > error-gt0-fatal-tlb                                     0x0000000000000011      0
> > error-gt0-fatal-l3-fabric                               0x0000000000000012      0
> > error-gt0-correctable-subslice                          0x0000000000000013      0
> > error-gt0-correctable-l3bank                            0x0000000000000014      0
> > error-gt0-fatal-subslice                                0x0000000000000015      0
> > error-gt0-fatal-l3bank                                  0x0000000000000016      0
> > error-gt0-sgunit-correctable                            0x0000000000000017      0
> > error-gt0-sgunit-nonfatal                               0x0000000000000018      0
> > error-gt0-sgunit-fatal                                  0x0000000000000019      0
> > error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a      0
> > error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b      0
> > error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c      0
> > error-gt0-soc-fatal-punit                               0x000000000000001d      0
> > error-gt0-soc-fatal-psf-0                               0x000000000000001e      0
> > error-gt0-soc-fatal-psf-1                               0x000000000000001f      0
> > error-gt0-soc-fatal-psf-2                               0x0000000000000020      0
> > error-gt0-soc-fatal-cd0                                 0x0000000000000021      0
> > error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022      0
> > error-gt0-soc-fatal-mdfi-east                           0x0000000000000023      0
> > error-gt0-soc-fatal-mdfi-south                          0x0000000000000024      0
> > error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025      0
> > error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026      0
> > error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027      0
> > error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028      0
> > error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029      0
> > error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a      0
> > error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b      0
> > error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c      0
> > error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d      0
> > error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e      0
> > error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f      0
> > error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030      0
> > error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031      0
> > error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032      0
> > error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033      0
> > error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034      0
> > error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035      0
> > error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036      0
> > error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037      0
> > error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038      0
> > error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039      0
> > error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a      0
> > error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b      0
> > error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c      0
> > error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d      0
> > error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e      0
> > error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f      0
> > error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040      0
> > error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041      0
> > error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042      0
> > error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043      0
> > error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044      0
> > error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045      0
> > error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046      0
> > error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047      0
> > error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048      0
> > error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049      0
> > error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a      0
> > error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b      0
> > error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c      0
> > error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d      0
> > error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e      0
> > error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f      0
> > error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050      0
> > error-gt1-correctable-guc                               0x1000000000000001      0
> > error-gt1-correctable-slm                               0x1000000000000003      0
> > error-gt1-correctable-eu-ic                             0x1000000000000004      0
> > error-gt1-correctable-eu-grf                            0x1000000000000005      0
> > error-gt1-fatal-guc                                     0x1000000000000009      0
> > error-gt1-fatal-slm                                     0x100000000000000d      0
> > error-gt1-fatal-eu-grf                                  0x100000000000000f      0
> > error-gt1-fatal-fpu                                     0x1000000000000010      0
> > error-gt1-fatal-tlb                                     0x1000000000000011      0
> > error-gt1-fatal-l3-fabric                               0x1000000000000012      0
> > error-gt1-correctable-subslice                          0x1000000000000013      0
> > error-gt1-correctable-l3bank                            0x1000000000000014      0
> > error-gt1-fatal-subslice                                0x1000000000000015      0
> > error-gt1-fatal-l3bank                                  0x1000000000000016      0
> > error-gt1-sgunit-correctable                            0x1000000000000017      0
> > error-gt1-sgunit-nonfatal                               0x1000000000000018      0
> > error-gt1-sgunit-fatal                                  0x1000000000000019      0
> > error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a      0
> > error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b      0
> > error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c      0
> > error-gt1-soc-fatal-punit                               0x100000000000001d      0
> > error-gt1-soc-fatal-psf-0                               0x100000000000001e      0
> > error-gt1-soc-fatal-psf-1                               0x100000000000001f      0
> > error-gt1-soc-fatal-psf-2                               0x1000000000000020      0
> > error-gt1-soc-fatal-cd0                                 0x1000000000000021      0
> > error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022      0
> > error-gt1-soc-fatal-mdfi-east                           0x1000000000000023      0
> > error-gt1-soc-fatal-mdfi-south                          0x1000000000000024      0
> > error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025      0
> > error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026      0
> > error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027      0
> > error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028      0
> > error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029      0
> > error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a      0
> > error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b      0
> > error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c      0
> > error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d      0
> > error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e      0
> > error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f      0
> > error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030      0
> > error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031      0
> > error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032      0
> > error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033      0
> > error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034      0
> > error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035      0
> > error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036      0
> > error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037      0
> > error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038      0
> > error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039      0
> > error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a      0
> > error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b      0
> > error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c      0
> > error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d      0
> > error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e      0
> > error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f      0
> > error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040      0
> > error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041      0
> > error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042      0
> > error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043      0
> > error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044      0
> >
> > wait on a error event:
> >
> > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> > waiting for error event
> > error event received
> > counter value 0
> >
> > list all errors:
> >
> > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> > name                                                    config-id
> >
> > error-gt0-correctable-guc                               0x0000000000000001
> > error-gt0-correctable-slm                               0x0000000000000003
> > error-gt0-correctable-eu-ic                             0x0000000000000004
> > error-gt0-correctable-eu-grf                            0x0000000000000005
> > error-gt0-fatal-guc                                     0x0000000000000009
> > error-gt0-fatal-slm                                     0x000000000000000d
> > error-gt0-fatal-eu-grf                                  0x000000000000000f
> > error-gt0-fatal-fpu                                     0x0000000000000010
> > error-gt0-fatal-tlb                                     0x0000000000000011
> > error-gt0-fatal-l3-fabric                               0x0000000000000012
> > error-gt0-correctable-subslice                          0x0000000000000013
> > error-gt0-correctable-l3bank                            0x0000000000000014
> > error-gt0-fatal-subslice                                0x0000000000000015
> > error-gt0-fatal-l3bank                                  0x0000000000000016
> > error-gt0-sgunit-correctable                            0x0000000000000017
> > error-gt0-sgunit-nonfatal                               0x0000000000000018
> > error-gt0-sgunit-fatal                                  0x0000000000000019
> > error-gt0-soc-fatal-psf-csc-0                           0x000000000000001a
> > error-gt0-soc-fatal-psf-csc-1                           0x000000000000001b
> > error-gt0-soc-fatal-psf-csc-2                           0x000000000000001c
> > error-gt0-soc-fatal-punit                               0x000000000000001d
> > error-gt0-soc-fatal-psf-0                               0x000000000000001e
> > error-gt0-soc-fatal-psf-1                               0x000000000000001f
> > error-gt0-soc-fatal-psf-2                               0x0000000000000020
> > error-gt0-soc-fatal-cd0                                 0x0000000000000021
> > error-gt0-soc-fatal-cd0-mdfi                            0x0000000000000022
> > error-gt0-soc-fatal-mdfi-east                           0x0000000000000023
> > error-gt0-soc-fatal-mdfi-south                          0x0000000000000024
> > error-gt0-soc-fatal-hbm-ss0-0                           0x0000000000000025
> > error-gt0-soc-fatal-hbm-ss0-1                           0x0000000000000026
> > error-gt0-soc-fatal-hbm-ss0-2                           0x0000000000000027
> > error-gt0-soc-fatal-hbm-ss0-3                           0x0000000000000028
> > error-gt0-soc-fatal-hbm-ss0-4                           0x0000000000000029
> > error-gt0-soc-fatal-hbm-ss0-5                           0x000000000000002a
> > error-gt0-soc-fatal-hbm-ss0-6                           0x000000000000002b
> > error-gt0-soc-fatal-hbm-ss0-7                           0x000000000000002c
> > error-gt0-soc-fatal-hbm-ss1-0                           0x000000000000002d
> > error-gt0-soc-fatal-hbm-ss1-1                           0x000000000000002e
> > error-gt0-soc-fatal-hbm-ss1-2                           0x000000000000002f
> > error-gt0-soc-fatal-hbm-ss1-3                           0x0000000000000030
> > error-gt0-soc-fatal-hbm-ss1-4                           0x0000000000000031
> > error-gt0-soc-fatal-hbm-ss1-5                           0x0000000000000032
> > error-gt0-soc-fatal-hbm-ss1-6                           0x0000000000000033
> > error-gt0-soc-fatal-hbm-ss1-7                           0x0000000000000034
> > error-gt0-soc-fatal-hbm-ss2-0                           0x0000000000000035
> > error-gt0-soc-fatal-hbm-ss2-1                           0x0000000000000036
> > error-gt0-soc-fatal-hbm-ss2-2                           0x0000000000000037
> > error-gt0-soc-fatal-hbm-ss2-3                           0x0000000000000038
> > error-gt0-soc-fatal-hbm-ss2-4                           0x0000000000000039
> > error-gt0-soc-fatal-hbm-ss2-5                           0x000000000000003a
> > error-gt0-soc-fatal-hbm-ss2-6                           0x000000000000003b
> > error-gt0-soc-fatal-hbm-ss2-7                           0x000000000000003c
> > error-gt0-soc-fatal-hbm-ss3-0                           0x000000000000003d
> > error-gt0-soc-fatal-hbm-ss3-1                           0x000000000000003e
> > error-gt0-soc-fatal-hbm-ss3-2                           0x000000000000003f
> > error-gt0-soc-fatal-hbm-ss3-3                           0x0000000000000040
> > error-gt0-soc-fatal-hbm-ss3-4                           0x0000000000000041
> > error-gt0-soc-fatal-hbm-ss3-5                           0x0000000000000042
> > error-gt0-soc-fatal-hbm-ss3-6                           0x0000000000000043
> > error-gt0-soc-fatal-hbm-ss3-7                           0x0000000000000044
> > error-gt0-gsc-correctable-sram-ecc                      0x0000000000000045
> > error-gt0-gsc-nonfatal-mia-shutdown                     0x0000000000000046
> > error-gt0-gsc-nonfatal-mia-int                          0x0000000000000047
> > error-gt0-gsc-nonfatal-sram-ecc                         0x0000000000000048
> > error-gt0-gsc-nonfatal-wdg-timeout                      0x0000000000000049
> > error-gt0-gsc-nonfatal-rom-parity                       0x000000000000004a
> > error-gt0-gsc-nonfatal-ucode-parity                     0x000000000000004b
> > error-gt0-gsc-nonfatal-glitch-det                       0x000000000000004c
> > error-gt0-gsc-nonfatal-fuse-pull                        0x000000000000004d
> > error-gt0-gsc-nonfatal-fuse-crc-check                   0x000000000000004e
> > error-gt0-gsc-nonfatal-selfmbist                        0x000000000000004f
> > error-gt0-gsc-nonfatal-aon-parity                       0x0000000000000050
> > error-gt1-correctable-guc                               0x1000000000000001
> > error-gt1-correctable-slm                               0x1000000000000003
> > error-gt1-correctable-eu-ic                             0x1000000000000004
> > error-gt1-correctable-eu-grf                            0x1000000000000005
> > error-gt1-fatal-guc                                     0x1000000000000009
> > error-gt1-fatal-slm                                     0x100000000000000d
> > error-gt1-fatal-eu-grf                                  0x100000000000000f
> > error-gt1-fatal-fpu                                     0x1000000000000010
> > error-gt1-fatal-tlb                                     0x1000000000000011
> > error-gt1-fatal-l3-fabric                               0x1000000000000012
> > error-gt1-correctable-subslice                          0x1000000000000013
> > error-gt1-correctable-l3bank                            0x1000000000000014
> > error-gt1-fatal-subslice                                0x1000000000000015
> > error-gt1-fatal-l3bank                                  0x1000000000000016
> > error-gt1-sgunit-correctable                            0x1000000000000017
> > error-gt1-sgunit-nonfatal                               0x1000000000000018
> > error-gt1-sgunit-fatal                                  0x1000000000000019
> > error-gt1-soc-fatal-psf-csc-0                           0x100000000000001a
> > error-gt1-soc-fatal-psf-csc-1                           0x100000000000001b
> > error-gt1-soc-fatal-psf-csc-2                           0x100000000000001c
> > error-gt1-soc-fatal-punit                               0x100000000000001d
> > error-gt1-soc-fatal-psf-0                               0x100000000000001e
> > error-gt1-soc-fatal-psf-1                               0x100000000000001f
> > error-gt1-soc-fatal-psf-2                               0x1000000000000020
> > error-gt1-soc-fatal-cd0                                 0x1000000000000021
> > error-gt1-soc-fatal-cd0-mdfi                            0x1000000000000022
> > error-gt1-soc-fatal-mdfi-east                           0x1000000000000023
> > error-gt1-soc-fatal-mdfi-south                          0x1000000000000024
> > error-gt1-soc-fatal-hbm-ss0-0                           0x1000000000000025
> > error-gt1-soc-fatal-hbm-ss0-1                           0x1000000000000026
> > error-gt1-soc-fatal-hbm-ss0-2                           0x1000000000000027
> > error-gt1-soc-fatal-hbm-ss0-3                           0x1000000000000028
> > error-gt1-soc-fatal-hbm-ss0-4                           0x1000000000000029
> > error-gt1-soc-fatal-hbm-ss0-5                           0x100000000000002a
> > error-gt1-soc-fatal-hbm-ss0-6                           0x100000000000002b
> > error-gt1-soc-fatal-hbm-ss0-7                           0x100000000000002c
> > error-gt1-soc-fatal-hbm-ss1-0                           0x100000000000002d
> > error-gt1-soc-fatal-hbm-ss1-1                           0x100000000000002e
> > error-gt1-soc-fatal-hbm-ss1-2                           0x100000000000002f
> > error-gt1-soc-fatal-hbm-ss1-3                           0x1000000000000030
> > error-gt1-soc-fatal-hbm-ss1-4                           0x1000000000000031
> > error-gt1-soc-fatal-hbm-ss1-5                           0x1000000000000032
> > error-gt1-soc-fatal-hbm-ss1-6                           0x1000000000000033
> > error-gt1-soc-fatal-hbm-ss1-7                           0x1000000000000034
> > error-gt1-soc-fatal-hbm-ss2-0                           0x1000000000000035
> > error-gt1-soc-fatal-hbm-ss2-1                           0x1000000000000036
> > error-gt1-soc-fatal-hbm-ss2-2                           0x1000000000000037
> > error-gt1-soc-fatal-hbm-ss2-3                           0x1000000000000038
> > error-gt1-soc-fatal-hbm-ss2-4                           0x1000000000000039
> > error-gt1-soc-fatal-hbm-ss2-5                           0x100000000000003a
> > error-gt1-soc-fatal-hbm-ss2-6                           0x100000000000003b
> > error-gt1-soc-fatal-hbm-ss2-7                           0x100000000000003c
> > error-gt1-soc-fatal-hbm-ss3-0                           0x100000000000003d
> > error-gt1-soc-fatal-hbm-ss3-1                           0x100000000000003e
> > error-gt1-soc-fatal-hbm-ss3-2                           0x100000000000003f
> > error-gt1-soc-fatal-hbm-ss3-3                           0x1000000000000040
> > error-gt1-soc-fatal-hbm-ss3-4                           0x1000000000000041
> > error-gt1-soc-fatal-hbm-ss3-5                           0x1000000000000042
> > error-gt1-soc-fatal-hbm-ss3-6                           0x1000000000000043
> > error-gt1-soc-fatal-hbm-ss3-7                           0x1000000000000044
> >
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Daniel Vetter <daniel@ffwll.ch>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > Cc: Oded Gabbay <ogabbay@kernel.org>
> >
> >
> > Aravind Iddamsetty (5):
> >   drm/netlink: Add netlink infrastructure
> >   drm/xe/RAS: Register a genl netlink family
> >   drm/xe/RAS: Expose the error counters
> >   drm/netlink: define multicast groups
> >   drm/xe/RAS: send multicast event on occurrence of an error
> >
> >  drivers/gpu/drm/xe/Makefile          |   1 +
> >  drivers/gpu/drm/xe/xe_device.c       |   3 +
> >  drivers/gpu/drm/xe/xe_device_types.h |   2 +
> >  drivers/gpu/drm/xe/xe_irq.c          |  32 ++
> >  drivers/gpu/drm/xe/xe_module.c       |   2 +
> >  drivers/gpu/drm/xe/xe_netlink.c      | 526 +++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_netlink.h      |  14 +
> >  include/uapi/drm/drm_netlink.h       |  81 +++++
> >  include/uapi/drm/xe_drm.h            |  64 ++++
> >  9 files changed, 725 insertions(+)
> >  create mode 100644 drivers/gpu/drm/xe/xe_netlink.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_netlink.h
> >  create mode 100644 include/uapi/drm/drm_netlink.h
> >
> > --
> > 2.25.1
> >
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-07-17 12:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty
2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty
2023-06-04 17:07   ` [Intel-xe] " Tomer Tayar
2023-06-05 17:18     ` Iddamsetty, Aravind
2023-06-06 14:04       ` Tomer Tayar
2023-06-21  6:40         ` Iddamsetty, Aravind
2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty
2023-06-04 17:09   ` [Intel-xe] " Tomer Tayar
2023-06-05 17:21     ` Iddamsetty, Aravind
2023-05-26 16:20 ` [RFC 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty
2023-05-26 16:20 ` [RFC 4/5] drm/netlink: define multicast groups Aravind Iddamsetty
2023-05-26 16:20 ` [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty
2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar
2023-06-05 17:17   ` Iddamsetty, Aravind
2023-06-05 16:47 ` Alex Deucher
2023-06-06 11:56   ` Iddamsetty, Aravind
2023-06-21 17:24 ` Sebastian Wick
2023-07-17 12:02   ` Oded Gabbay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).