* [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem @ 2023-05-26 16:20 Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty ` (7 more replies) 0 siblings, 8 replies; 20+ messages in thread From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw) To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay Our hardware supports RAS(Reliability, Availability, Serviceability) by exposing a set of error counters which can be used by observability tools to take corrective actions or repairs. Traditionally there were being exposed via PMU (for relative counters) and sysfs interface (for absolute value) in our internal branch. But, due to the limitations in this approach to use two interfaces and also not able to have an event based reporting or configurability, an alternative approach to try netlink was suggested by community for drm subsystem wide UAPI for RAS and telemetry as discussed in [1]. This [1] is the inspiration to this series. It uses the generic netlink(genl) family subsystem and exposes a set of commands that can be used by every drm driver, the framework provides a means to have custom commands too. Each drm driver instance in this example xe driver instance registers a family and operations to the genl subsystem through which it enumerates and reports the error counters. An event based notification is also supported to which userpace can subscribe to and be notified when any error occurs and read the error counter this avoids continuous polling on error counter. This can also be extended to threshold based notification. [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html this series is on top of https://patchwork.freedesktop.org/series/116181/ Below is an example tool drm_ras which demonstrates the use of the supported commands. The tool will be sent to ML with the subject "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" read single error counter: $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 counter value 0 read all error counters: $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 name config-id counter error-gt0-correctable-guc 0x0000000000000001 0 error-gt0-correctable-slm 0x0000000000000003 0 error-gt0-correctable-eu-ic 0x0000000000000004 0 error-gt0-correctable-eu-grf 0x0000000000000005 0 error-gt0-fatal-guc 0x0000000000000009 0 error-gt0-fatal-slm 0x000000000000000d 0 error-gt0-fatal-eu-grf 0x000000000000000f 0 error-gt0-fatal-fpu 0x0000000000000010 0 error-gt0-fatal-tlb 0x0000000000000011 0 error-gt0-fatal-l3-fabric 0x0000000000000012 0 error-gt0-correctable-subslice 0x0000000000000013 0 error-gt0-correctable-l3bank 0x0000000000000014 0 error-gt0-fatal-subslice 0x0000000000000015 0 error-gt0-fatal-l3bank 0x0000000000000016 0 error-gt0-sgunit-correctable 0x0000000000000017 0 error-gt0-sgunit-nonfatal 0x0000000000000018 0 error-gt0-sgunit-fatal 0x0000000000000019 0 error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 error-gt0-soc-fatal-punit 0x000000000000001d 0 error-gt0-soc-fatal-psf-0 0x000000000000001e 0 error-gt0-soc-fatal-psf-1 0x000000000000001f 0 error-gt0-soc-fatal-psf-2 0x0000000000000020 0 error-gt0-soc-fatal-cd0 0x0000000000000021 0 error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 error-gt1-correctable-guc 0x1000000000000001 0 error-gt1-correctable-slm 0x1000000000000003 0 error-gt1-correctable-eu-ic 0x1000000000000004 0 error-gt1-correctable-eu-grf 0x1000000000000005 0 error-gt1-fatal-guc 0x1000000000000009 0 error-gt1-fatal-slm 0x100000000000000d 0 error-gt1-fatal-eu-grf 0x100000000000000f 0 error-gt1-fatal-fpu 0x1000000000000010 0 error-gt1-fatal-tlb 0x1000000000000011 0 error-gt1-fatal-l3-fabric 0x1000000000000012 0 error-gt1-correctable-subslice 0x1000000000000013 0 error-gt1-correctable-l3bank 0x1000000000000014 0 error-gt1-fatal-subslice 0x1000000000000015 0 error-gt1-fatal-l3bank 0x1000000000000016 0 error-gt1-sgunit-correctable 0x1000000000000017 0 error-gt1-sgunit-nonfatal 0x1000000000000018 0 error-gt1-sgunit-fatal 0x1000000000000019 0 error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 error-gt1-soc-fatal-punit 0x100000000000001d 0 error-gt1-soc-fatal-psf-0 0x100000000000001e 0 error-gt1-soc-fatal-psf-1 0x100000000000001f 0 error-gt1-soc-fatal-psf-2 0x1000000000000020 0 error-gt1-soc-fatal-cd0 0x1000000000000021 0 error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 wait on a error event: $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 waiting for error event error event received counter value 0 list all errors: $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 name config-id error-gt0-correctable-guc 0x0000000000000001 error-gt0-correctable-slm 0x0000000000000003 error-gt0-correctable-eu-ic 0x0000000000000004 error-gt0-correctable-eu-grf 0x0000000000000005 error-gt0-fatal-guc 0x0000000000000009 error-gt0-fatal-slm 0x000000000000000d error-gt0-fatal-eu-grf 0x000000000000000f error-gt0-fatal-fpu 0x0000000000000010 error-gt0-fatal-tlb 0x0000000000000011 error-gt0-fatal-l3-fabric 0x0000000000000012 error-gt0-correctable-subslice 0x0000000000000013 error-gt0-correctable-l3bank 0x0000000000000014 error-gt0-fatal-subslice 0x0000000000000015 error-gt0-fatal-l3bank 0x0000000000000016 error-gt0-sgunit-correctable 0x0000000000000017 error-gt0-sgunit-nonfatal 0x0000000000000018 error-gt0-sgunit-fatal 0x0000000000000019 error-gt0-soc-fatal-psf-csc-0 0x000000000000001a error-gt0-soc-fatal-psf-csc-1 0x000000000000001b error-gt0-soc-fatal-psf-csc-2 0x000000000000001c error-gt0-soc-fatal-punit 0x000000000000001d error-gt0-soc-fatal-psf-0 0x000000000000001e error-gt0-soc-fatal-psf-1 0x000000000000001f error-gt0-soc-fatal-psf-2 0x0000000000000020 error-gt0-soc-fatal-cd0 0x0000000000000021 error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 error-gt0-soc-fatal-mdfi-east 0x0000000000000023 error-gt0-soc-fatal-mdfi-south 0x0000000000000024 error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 error-gt1-correctable-guc 0x1000000000000001 error-gt1-correctable-slm 0x1000000000000003 error-gt1-correctable-eu-ic 0x1000000000000004 error-gt1-correctable-eu-grf 0x1000000000000005 error-gt1-fatal-guc 0x1000000000000009 error-gt1-fatal-slm 0x100000000000000d error-gt1-fatal-eu-grf 0x100000000000000f error-gt1-fatal-fpu 0x1000000000000010 error-gt1-fatal-tlb 0x1000000000000011 error-gt1-fatal-l3-fabric 0x1000000000000012 error-gt1-correctable-subslice 0x1000000000000013 error-gt1-correctable-l3bank 0x1000000000000014 error-gt1-fatal-subslice 0x1000000000000015 error-gt1-fatal-l3bank 0x1000000000000016 error-gt1-sgunit-correctable 0x1000000000000017 error-gt1-sgunit-nonfatal 0x1000000000000018 error-gt1-sgunit-fatal 0x1000000000000019 error-gt1-soc-fatal-psf-csc-0 0x100000000000001a error-gt1-soc-fatal-psf-csc-1 0x100000000000001b error-gt1-soc-fatal-psf-csc-2 0x100000000000001c error-gt1-soc-fatal-punit 0x100000000000001d error-gt1-soc-fatal-psf-0 0x100000000000001e error-gt1-soc-fatal-psf-1 0x100000000000001f error-gt1-soc-fatal-psf-2 0x1000000000000020 error-gt1-soc-fatal-cd0 0x1000000000000021 error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 error-gt1-soc-fatal-mdfi-east 0x1000000000000023 error-gt1-soc-fatal-mdfi-south 0x1000000000000024 error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Oded Gabbay <ogabbay@kernel.org> Aravind Iddamsetty (5): drm/netlink: Add netlink infrastructure drm/xe/RAS: Register a genl netlink family drm/xe/RAS: Expose the error counters drm/netlink: define multicast groups drm/xe/RAS: send multicast event on occurrence of an error drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/xe_device.c | 3 + drivers/gpu/drm/xe/xe_device_types.h | 2 + drivers/gpu/drm/xe/xe_irq.c | 32 ++ drivers/gpu/drm/xe/xe_module.c | 2 + drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ drivers/gpu/drm/xe/xe_netlink.h | 14 + include/uapi/drm/drm_netlink.h | 81 +++++ include/uapi/drm/xe_drm.h | 64 ++++ 9 files changed, 725 insertions(+) create mode 100644 drivers/gpu/drm/xe/xe_netlink.c create mode 100644 drivers/gpu/drm/xe/xe_netlink.h create mode 100644 include/uapi/drm/drm_netlink.h -- 2.25.1 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC 1/5] drm/netlink: Add netlink infrastructure 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty @ 2023-05-26 16:20 ` Aravind Iddamsetty 2023-06-04 17:07 ` [Intel-xe] " Tomer Tayar 2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty ` (6 subsequent siblings) 7 siblings, 1 reply; 20+ messages in thread From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw) To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay Define the netlink commands and attributes that can be commonly used across by drm drivers. Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> --- include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 include/uapi/drm/drm_netlink.h diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h new file mode 100644 index 000000000000..28e7a334d0c7 --- /dev/null +++ b/include/uapi/drm/drm_netlink.h @@ -0,0 +1,68 @@ +/* + * Copyright 2023 Intel Corporation + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice (including the next + * paragraph) shall be included in all copies or substantial portions of the + * Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + */ + +#ifndef _DRM_NETLINK_H_ +#define _DRM_NETLINK_H_ + +#include <linux/netdevice.h> +#include <net/genetlink.h> +#include <net/sock.h> + +#define DRM_GENL_VERSION 1 + +enum error_cmds { + DRM_CMD_UNSPEC, + /* command to list all errors names with config-id */ + DRM_CMD_QUERY, + /* command to get a counter for a specific error */ + DRM_CMD_READ_ONE, + /* command to get counters of all errors */ + DRM_CMD_READ_ALL, + + __DRM_CMD_MAX, + DRM_CMD_MAX = __DRM_CMD_MAX - 1, +}; + +enum error_attr { + DRM_ATTR_UNSPEC, + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, + DRM_ATTR_REQUEST, /* NLA_U8 */ + DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/ + DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ + DRM_ATTR_ERROR_ID, /* NLA_U64 */ + DRM_ATTR_ERROR_VALUE, /* NLA_U64 */ + + __DRM_ATTR_MAX, + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, +}; + +/* attribute policies */ +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = { + [DRM_ATTR_REQUEST] = { .type = NLA_U8 }, +}; + +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = { + [DRM_ATTR_ERROR_ID] = { .type = NLA_U64 }, +}; + +#endif -- 2.25.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure 2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty @ 2023-06-04 17:07 ` Tomer Tayar 2023-06-05 17:18 ` Iddamsetty, Aravind 0 siblings, 1 reply; 20+ messages in thread From: Tomer Tayar @ 2023-06-04 17:07 UTC (permalink / raw) To: Aravind Iddamsetty, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 26/05/2023 19:20, Aravind Iddamsetty wrote: > Define the netlink commands and attributes that can be commonly used > across by drm drivers. > > Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> > --- > include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++ > 1 file changed, 68 insertions(+) > create mode 100644 include/uapi/drm/drm_netlink.h > > diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h > new file mode 100644 > index 000000000000..28e7a334d0c7 > --- /dev/null > +++ b/include/uapi/drm/drm_netlink.h > @@ -0,0 +1,68 @@ > +/* > + * Copyright 2023 Intel Corporation > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > + * copy of this software and associated documentation files (the "Software"), > + * to deal in the Software without restriction, including without limitation > + * the rights to use, copy, modify, merge, publish, distribute, sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice (including the next > + * paragraph) shall be included in all copies or substantial portions of the > + * Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL > + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR > + * OTHER DEALINGS IN THE SOFTWARE. > + */ > + > +#ifndef _DRM_NETLINK_H_ > +#define _DRM_NETLINK_H_ > + > +#include <linux/netdevice.h> > +#include <net/genetlink.h> > +#include <net/sock.h> This is a uapi header. Are all header files here available for user? Also, should we add here "#if defined(__cplusplus) extern "C" { ..."? > + > +#define DRM_GENL_VERSION 1 > + > +enum error_cmds { > + DRM_CMD_UNSPEC, > + /* command to list all errors names with config-id */ > + DRM_CMD_QUERY, > + /* command to get a counter for a specific error */ > + DRM_CMD_READ_ONE, > + /* command to get counters of all errors */ > + DRM_CMD_READ_ALL, > + > + __DRM_CMD_MAX, > + DRM_CMD_MAX = __DRM_CMD_MAX - 1, > +}; > + > +enum error_attr { > + DRM_ATTR_UNSPEC, > + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, > + DRM_ATTR_REQUEST, /* NLA_U8 */ > + DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/ Missing spaces in /*NLA_NESTED*/ > + DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ > + DRM_ATTR_ERROR_ID, /* NLA_U64 */ > + DRM_ATTR_ERROR_VALUE, /* NLA_U64 */ > + > + __DRM_ATTR_MAX, > + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, > +}; > + > +/* attribute policies */ > +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = { > + [DRM_ATTR_REQUEST] = { .type = NLA_U8 }, > +}; Should these policies structures be in uapi? > + > +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = { > + [DRM_ATTR_ERROR_ID] = { .type = NLA_U64 }, > +}; I might miss something here, but why it is not a single policy structure with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID? Thanks, Tomer > + > +#endif ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure 2023-06-04 17:07 ` [Intel-xe] " Tomer Tayar @ 2023-06-05 17:18 ` Iddamsetty, Aravind 2023-06-06 14:04 ` Tomer Tayar 0 siblings, 1 reply; 20+ messages in thread From: Iddamsetty, Aravind @ 2023-06-05 17:18 UTC (permalink / raw) To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 04-06-2023 22:37, Tomer Tayar wrote: > On 26/05/2023 19:20, Aravind Iddamsetty wrote: >> Define the netlink commands and attributes that can be commonly used >> across by drm drivers. >> >> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> >> --- >> include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++ >> 1 file changed, 68 insertions(+) >> create mode 100644 include/uapi/drm/drm_netlink.h >> >> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h >> new file mode 100644 >> index 000000000000..28e7a334d0c7 >> --- /dev/null >> +++ b/include/uapi/drm/drm_netlink.h >> @@ -0,0 +1,68 @@ >> +/* >> + * Copyright 2023 Intel Corporation >> + * >> + * Permission is hereby granted, free of charge, to any person obtaining a >> + * copy of this software and associated documentation files (the "Software"), >> + * to deal in the Software without restriction, including without limitation >> + * the rights to use, copy, modify, merge, publish, distribute, sublicense, >> + * and/or sell copies of the Software, and to permit persons to whom the >> + * Software is furnished to do so, subject to the following conditions: >> + * >> + * The above copyright notice and this permission notice (including the next >> + * paragraph) shall be included in all copies or substantial portions of the >> + * Software. >> + * >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL >> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR >> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, >> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR >> + * OTHER DEALINGS IN THE SOFTWARE. >> + */ >> + >> +#ifndef _DRM_NETLINK_H_ >> +#define _DRM_NETLINK_H_ >> + >> +#include <linux/netdevice.h> >> +#include <net/genetlink.h> >> +#include <net/sock.h> > > This is a uapi header. > Are all header files here available for user? no they are not, I later came to know that we should not have any of that user can't use so will split the header into 2. > Also, should we add here "#if defined(__cplusplus) extern "C" { ..."? ya will add that > >> + >> +#define DRM_GENL_VERSION 1 >> + >> +enum error_cmds { >> + DRM_CMD_UNSPEC, >> + /* command to list all errors names with config-id */ >> + DRM_CMD_QUERY, >> + /* command to get a counter for a specific error */ >> + DRM_CMD_READ_ONE, >> + /* command to get counters of all errors */ >> + DRM_CMD_READ_ALL, >> + >> + __DRM_CMD_MAX, >> + DRM_CMD_MAX = __DRM_CMD_MAX - 1, >> +}; >> + >> +enum error_attr { >> + DRM_ATTR_UNSPEC, >> + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, >> + DRM_ATTR_REQUEST, /* NLA_U8 */ >> + DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/ > > Missing spaces in /*NLA_NESTED*/ > >> + DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ >> + DRM_ATTR_ERROR_ID, /* NLA_U64 */ >> + DRM_ATTR_ERROR_VALUE, /* NLA_U64 */ >> + >> + __DRM_ATTR_MAX, >> + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, >> +}; >> + >> +/* attribute policies */ >> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = { >> + [DRM_ATTR_REQUEST] = { .type = NLA_U8 }, >> +}; > > Should these policies structures be in uapi? so ya these will also likely move into a separate drm header as userspace would define there own policy. > >> + >> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = { >> + [DRM_ATTR_ERROR_ID] = { .type = NLA_U64 }, >> +}; > > I might miss something here, but why it is not a single policy structure > with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID? so each command can have it's own policy defined, i.e what attributes it expects we could define only those, that way we can have a check as well. So, in the present implementation DRM_CMD_QUERY and DRM_CMD_READ_ALL expect only DRM_ATTR_REQUEST and while DRM_CMD_READ_ONE expects only DRM_ATTR_ERROR_ID as part of the incoming message from user. Thanks, Aravind. > > Thanks, > Tomer > >> + >> +#endif > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure 2023-06-05 17:18 ` Iddamsetty, Aravind @ 2023-06-06 14:04 ` Tomer Tayar 2023-06-21 6:40 ` Iddamsetty, Aravind 0 siblings, 1 reply; 20+ messages in thread From: Tomer Tayar @ 2023-06-06 14:04 UTC (permalink / raw) To: Iddamsetty, Aravind, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 05/06/2023 20:18, Iddamsetty, Aravind wrote: > > On 04-06-2023 22:37, Tomer Tayar wrote: >> On 26/05/2023 19:20, Aravind Iddamsetty wrote: >>> Define the netlink commands and attributes that can be commonly used >>> across by drm drivers. >>> >>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> >>> --- >>> include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++ >>> 1 file changed, 68 insertions(+) >>> create mode 100644 include/uapi/drm/drm_netlink.h >>> >>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h >>> new file mode 100644 >>> index 000000000000..28e7a334d0c7 >>> --- /dev/null >>> +++ b/include/uapi/drm/drm_netlink.h >>> @@ -0,0 +1,68 @@ >>> +/* >>> + * Copyright 2023 Intel Corporation >>> + * >>> + * Permission is hereby granted, free of charge, to any person obtaining a >>> + * copy of this software and associated documentation files (the "Software"), >>> + * to deal in the Software without restriction, including without limitation >>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense, >>> + * and/or sell copies of the Software, and to permit persons to whom the >>> + * Software is furnished to do so, subject to the following conditions: >>> + * >>> + * The above copyright notice and this permission notice (including the next >>> + * paragraph) shall be included in all copies or substantial portions of the >>> + * Software. >>> + * >>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR >>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, >>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL >>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR >>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, >>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR >>> + * OTHER DEALINGS IN THE SOFTWARE. >>> + */ >>> + >>> +#ifndef _DRM_NETLINK_H_ >>> +#define _DRM_NETLINK_H_ >>> + >>> +#include <linux/netdevice.h> >>> +#include <net/genetlink.h> >>> +#include <net/sock.h> >> This is a uapi header. >> Are all header files here available for user? > no they are not, I later came to know that we should not have any of > that user can't use so will split the header into 2. >> Also, should we add here "#if defined(__cplusplus) extern "C" { ..."? > ya will add that >>> + >>> +#define DRM_GENL_VERSION 1 >>> + >>> +enum error_cmds { >>> + DRM_CMD_UNSPEC, >>> + /* command to list all errors names with config-id */ >>> + DRM_CMD_QUERY, >>> + /* command to get a counter for a specific error */ >>> + DRM_CMD_READ_ONE, >>> + /* command to get counters of all errors */ >>> + DRM_CMD_READ_ALL, >>> + >>> + __DRM_CMD_MAX, >>> + DRM_CMD_MAX = __DRM_CMD_MAX - 1, >>> +}; >>> + >>> +enum error_attr { >>> + DRM_ATTR_UNSPEC, >>> + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, >>> + DRM_ATTR_REQUEST, /* NLA_U8 */ >>> + DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/ >> Missing spaces in /*NLA_NESTED*/ >> >>> + DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ >>> + DRM_ATTR_ERROR_ID, /* NLA_U64 */ >>> + DRM_ATTR_ERROR_VALUE, /* NLA_U64 */ >>> + >>> + __DRM_ATTR_MAX, >>> + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, >>> +}; >>> + >>> +/* attribute policies */ >>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = { >>> + [DRM_ATTR_REQUEST] = { .type = NLA_U8 }, >>> +}; >> Should these policies structures be in uapi? > so ya these will also likely move into a separate drm header as > userspace would define there own policy. >>> + >>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = { >>> + [DRM_ATTR_ERROR_ID] = { .type = NLA_U64 }, >>> +}; >> I might miss something here, but why it is not a single policy structure >> with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID? > so each command can have it's own policy defined, i.e what attributes it > expects we could define only those, that way we can have a check as > well. So, in the present implementation DRM_CMD_QUERY and > DRM_CMD_READ_ALL expect only DRM_ATTR_REQUEST and while DRM_CMD_READ_ONE > expects only DRM_ATTR_ERROR_ID as part of the incoming message from user. > > Thanks, > Aravind. But "struct genl_ops" expects a pointer to "struct nla_policy", and in the definition of "xe_genl_ops", each entry is set with a pointer to these arrays of "struct nla_policy". Won't they use the first entry (DRM_ATTR_UNSPEC) of the arrays? Shouldn't we set use there the arrays at indices DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID? If yes, then can't we have a single policy array, each entry defines a policy per attribute, and we will use the suitable policy entry for each command? Thanks, Tomer >> Thanks, >> Tomer >> >>> + >>> +#endif >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 1/5] drm/netlink: Add netlink infrastructure 2023-06-06 14:04 ` Tomer Tayar @ 2023-06-21 6:40 ` Iddamsetty, Aravind 0 siblings, 0 replies; 20+ messages in thread From: Iddamsetty, Aravind @ 2023-06-21 6:40 UTC (permalink / raw) To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 06-06-2023 19:34, Tomer Tayar wrote: > On 05/06/2023 20:18, Iddamsetty, Aravind wrote: >> >> On 04-06-2023 22:37, Tomer Tayar wrote: >>> On 26/05/2023 19:20, Aravind Iddamsetty wrote: >>>> Define the netlink commands and attributes that can be commonly used >>>> across by drm drivers. >>>> >>>> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> >>>> --- >>>> include/uapi/drm/drm_netlink.h | 68 ++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 68 insertions(+) >>>> create mode 100644 include/uapi/drm/drm_netlink.h >>>> >>>> diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h >>>> new file mode 100644 >>>> index 000000000000..28e7a334d0c7 >>>> --- /dev/null >>>> +++ b/include/uapi/drm/drm_netlink.h >>>> @@ -0,0 +1,68 @@ >>>> +/* >>>> + * Copyright 2023 Intel Corporation >>>> + * >>>> + * Permission is hereby granted, free of charge, to any person obtaining a >>>> + * copy of this software and associated documentation files (the "Software"), >>>> + * to deal in the Software without restriction, including without limitation >>>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense, >>>> + * and/or sell copies of the Software, and to permit persons to whom the >>>> + * Software is furnished to do so, subject to the following conditions: >>>> + * >>>> + * The above copyright notice and this permission notice (including the next >>>> + * paragraph) shall be included in all copies or substantial portions of the >>>> + * Software. >>>> + * >>>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR >>>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, >>>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL >>>> + * VA LINUX SYSTEMS AND/OR ITS SUPPLIERS BE LIABLE FOR ANY CLAIM, DAMAGES OR >>>> + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, >>>> + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR >>>> + * OTHER DEALINGS IN THE SOFTWARE. >>>> + */ >>>> + >>>> +#ifndef _DRM_NETLINK_H_ >>>> +#define _DRM_NETLINK_H_ >>>> + >>>> +#include <linux/netdevice.h> >>>> +#include <net/genetlink.h> >>>> +#include <net/sock.h> >>> This is a uapi header. >>> Are all header files here available for user? >> no they are not, I later came to know that we should not have any of >> that user can't use so will split the header into 2. >>> Also, should we add here "#if defined(__cplusplus) extern "C" { ..."? >> ya will add that >>>> + >>>> +#define DRM_GENL_VERSION 1 >>>> + >>>> +enum error_cmds { >>>> + DRM_CMD_UNSPEC, >>>> + /* command to list all errors names with config-id */ >>>> + DRM_CMD_QUERY, >>>> + /* command to get a counter for a specific error */ >>>> + DRM_CMD_READ_ONE, >>>> + /* command to get counters of all errors */ >>>> + DRM_CMD_READ_ALL, >>>> + >>>> + __DRM_CMD_MAX, >>>> + DRM_CMD_MAX = __DRM_CMD_MAX - 1, >>>> +}; >>>> + >>>> +enum error_attr { >>>> + DRM_ATTR_UNSPEC, >>>> + DRM_ATTR_PAD = DRM_ATTR_UNSPEC, >>>> + DRM_ATTR_REQUEST, /* NLA_U8 */ >>>> + DRM_ATTR_QUERY_REPLY, /*NLA_NESTED*/ >>> Missing spaces in /*NLA_NESTED*/ >>> >>>> + DRM_ATTR_ERROR_NAME, /* NLA_NUL_STRING */ >>>> + DRM_ATTR_ERROR_ID, /* NLA_U64 */ >>>> + DRM_ATTR_ERROR_VALUE, /* NLA_U64 */ >>>> + >>>> + __DRM_ATTR_MAX, >>>> + DRM_ATTR_MAX = __DRM_ATTR_MAX - 1, >>>> +}; >>>> + >>>> +/* attribute policies */ >>>> +static const struct nla_policy drm_attr_policy_query[DRM_ATTR_MAX + 1] = { >>>> + [DRM_ATTR_REQUEST] = { .type = NLA_U8 }, >>>> +}; >>> Should these policies structures be in uapi? >> so ya these will also likely move into a separate drm header as >> userspace would define there own policy. >>>> + >>>> +static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = { >>>> + [DRM_ATTR_ERROR_ID] = { .type = NLA_U64 }, >>>> +}; >>> I might miss something here, but why it is not a single policy structure >>> with entries for DRM_ATTR_REQUEST and DRM_ATTR_ERROR_ID? >> so each command can have it's own policy defined, i.e what attributes it >> expects we could define only those, that way we can have a check as >> well. So, in the present implementation DRM_CMD_QUERY and >> DRM_CMD_READ_ALL expect only DRM_ATTR_REQUEST and while DRM_CMD_READ_ONE >> expects only DRM_ATTR_ERROR_ID as part of the incoming message from user. >> >> Thanks, >> Aravind. > > But "struct genl_ops" expects a pointer to "struct nla_policy", and in > the definition of "xe_genl_ops", each entry is set with a pointer to > these arrays of "struct nla_policy". > Won't they use the first entry (DRM_ATTR_UNSPEC) of the arrays? > Shouldn't we set use there the arrays at indices DRM_ATTR_REQUEST and > DRM_ATTR_ERROR_ID? > If yes, then can't we have a single policy array, each entry defines a > policy per attribute, and we will use the suitable policy entry for each > command? Hi Tomer, apologies for my late reply. a command can accept more than one attribute. so the genl netlink core would validate the each attributes passed in the recv message by checking with the policy array in CMD definition. Thanks, Aravind. > > Thanks, > Tomer > >>> Thanks, >>> Tomer >>> >>>> + >>>> +#endif >>> > ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC 2/5] drm/xe/RAS: Register a genl netlink family 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty @ 2023-05-26 16:20 ` Aravind Iddamsetty 2023-06-04 17:09 ` [Intel-xe] " Tomer Tayar 2023-05-26 16:20 ` [RFC 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty ` (5 subsequent siblings) 7 siblings, 1 reply; 20+ messages in thread From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw) To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay Use the generic netlink(genl) subsystem to expose the RAS counters to userspace. We define a family structure and operations and register to genl subsystem and these callbacks will be invoked by genl subsystem when userspace sends a registered command with a family identifier to genl subsystem. Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> --- drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/xe_device.c | 3 + drivers/gpu/drm/xe/xe_device_types.h | 2 + drivers/gpu/drm/xe/xe_module.c | 2 + drivers/gpu/drm/xe/xe_netlink.c | 89 ++++++++++++++++++++++++++++ drivers/gpu/drm/xe/xe_netlink.h | 14 +++++ 6 files changed, 111 insertions(+) create mode 100644 drivers/gpu/drm/xe/xe_netlink.c create mode 100644 drivers/gpu/drm/xe/xe_netlink.h diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile index b84e191ba14f..2b42165bc824 100644 --- a/drivers/gpu/drm/xe/Makefile +++ b/drivers/gpu/drm/xe/Makefile @@ -67,6 +67,7 @@ xe-y += xe_bb.o \ xe_mmio.o \ xe_mocs.o \ xe_module.o \ + xe_netlink.o \ xe_pat.o \ xe_pci.o \ xe_pcode.o \ diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index 323356a44e7f..aa12ef12d9dc 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -24,6 +24,7 @@ #include "xe_irq.h" #include "xe_mmio.h" #include "xe_module.h" +#include "xe_netlink.h" #include "xe_pcode.h" #include "xe_pm.h" #include "xe_query.h" @@ -317,6 +318,8 @@ int xe_device_probe(struct xe_device *xe) xe_display_register(xe); + xe_genl_register(xe); + xe_debugfs_register(xe); err = drmm_add_action_or_reset(&xe->drm, xe_device_sanitize, xe); diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index 682ebdd1c09e..c9612a54c48f 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -10,6 +10,7 @@ #include <drm/drm_device.h> #include <drm/drm_file.h> +#include <drm/drm_netlink.h> #include <drm/ttm/ttm_device.h> #include "xe_gt_types.h" @@ -347,6 +348,7 @@ struct xe_device { u32 lvds_channel_mode; } params; #endif + struct genl_family xe_genl_family; }; /** diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c index 6860586ce7f8..1eb73eb9a9a5 100644 --- a/drivers/gpu/drm/xe/xe_module.c +++ b/drivers/gpu/drm/xe/xe_module.c @@ -11,6 +11,7 @@ #include "xe_drv.h" #include "xe_hw_fence.h" #include "xe_module.h" +#include "xe_netlink.h" #include "xe_pci.h" #include "xe_sched_job.h" @@ -67,6 +68,7 @@ static void __exit xe_exit(void) { int i; + xe_genl_cleanup(); xe_unregister_pci_driver(); for (i = ARRAY_SIZE(init_funcs) - 1; i >= 0; i--) diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c new file mode 100644 index 000000000000..63ef238ebc27 --- /dev/null +++ b/drivers/gpu/drm/xe/xe_netlink.c @@ -0,0 +1,89 @@ +// SPDX-License-Identifier: MIT +/* + * Copyright © 2023 Intel Corporation + */ + +#include <drm/drm_managed.h> + +#include "xe_device.h" + +DEFINE_XARRAY(xe_xarray); + +static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info) +{ + return 0; +} + +static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info) +{ + return 0; +} + +/* operations definition */ +static const struct genl_ops xe_genl_ops[] = { + { + .cmd = DRM_CMD_QUERY, + .doit = xe_genl_list_errors, + .policy = drm_attr_policy_query, + }, + { + .cmd = DRM_CMD_READ_ONE, + .doit = xe_genl_read_error, + .policy = drm_attr_policy_read_one, + }, + { + .cmd = DRM_CMD_READ_ALL, + .doit = xe_genl_list_errors, + .policy = drm_attr_policy_query, + }, +}; + +static void xe_genl_deregister(struct drm_device *dev, void *arg) +{ + struct xe_device *xe = arg; + + xa_erase(&xe_xarray, xe->xe_genl_family.id); + + drm_dbg_driver(&xe->drm, "unregistering genl family %s\n", xe->xe_genl_family.name); + + genl_unregister_family(&xe->xe_genl_family); +} + +static void xe_genl_family_init(struct xe_device *xe) +{ + /* Use drm primary node name eg: card0 to name the genl family */ + snprintf(xe->xe_genl_family.name, sizeof(xe->xe_genl_family.name), "%s", xe->drm.primary->kdev->kobj.name); + xe->xe_genl_family.version = DRM_GENL_VERSION; + xe->xe_genl_family.parallel_ops = true; + xe->xe_genl_family.ops = xe_genl_ops; + xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops); + xe->xe_genl_family.maxattr = DRM_ATTR_MAX; + xe->xe_genl_family.module = THIS_MODULE; +} + +int xe_genl_register(struct xe_device *xe) +{ + int ret; + + xe_genl_family_init(xe); + + ret = genl_register_family(&xe->xe_genl_family); + if (ret < 0) { + drm_warn(&xe->drm, "xe genl family registration failed\n"); + return ret; + } + + drm_dbg_driver(&xe->drm, "genl family id %d and name %s\n", xe->xe_genl_family.id, xe->xe_genl_family.name); + + xa_store(&xe_xarray, xe->xe_genl_family.id, xe, GFP_KERNEL); + + ret = drmm_add_action_or_reset(&xe->drm, xe_genl_deregister, xe); + + return ret; +} + +void xe_genl_cleanup(void) +{ + /* destroy xarray */ + xa_destroy(&xe_xarray); +} diff --git a/drivers/gpu/drm/xe/xe_netlink.h b/drivers/gpu/drm/xe/xe_netlink.h new file mode 100644 index 000000000000..3bbddb620539 --- /dev/null +++ b/drivers/gpu/drm/xe/xe_netlink.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright © 2021 Intel Corporation + */ + +#ifndef _XE_GENL_H_ +#define _XE_GENL_H_ + +#include "xe_device.h" + +int xe_genl_register(struct xe_device *xe); +void xe_genl_cleanup(void); + +#endif -- 2.25.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 2/5] drm/xe/RAS: Register a genl netlink family 2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty @ 2023-06-04 17:09 ` Tomer Tayar 2023-06-05 17:21 ` Iddamsetty, Aravind 0 siblings, 1 reply; 20+ messages in thread From: Tomer Tayar @ 2023-06-04 17:09 UTC (permalink / raw) To: Aravind Iddamsetty, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 26/05/2023 19:20, Aravind Iddamsetty wrote: > Use the generic netlink(genl) subsystem to expose the RAS counters to > userspace. We define a family structure and operations and register to > genl subsystem and these callbacks will be invoked by genl subsystem when > userspace sends a registered command with a family identifier to genl > subsystem. > > Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> > --- > drivers/gpu/drm/xe/Makefile | 1 + > drivers/gpu/drm/xe/xe_device.c | 3 + > drivers/gpu/drm/xe/xe_device_types.h | 2 + > drivers/gpu/drm/xe/xe_module.c | 2 + > drivers/gpu/drm/xe/xe_netlink.c | 89 ++++++++++++++++++++++++++++ > drivers/gpu/drm/xe/xe_netlink.h | 14 +++++ > 6 files changed, 111 insertions(+) > create mode 100644 drivers/gpu/drm/xe/xe_netlink.c > create mode 100644 drivers/gpu/drm/xe/xe_netlink.h > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile > index b84e191ba14f..2b42165bc824 100644 > --- a/drivers/gpu/drm/xe/Makefile > +++ b/drivers/gpu/drm/xe/Makefile > @@ -67,6 +67,7 @@ xe-y += xe_bb.o \ > xe_mmio.o \ > xe_mocs.o \ > xe_module.o \ > + xe_netlink.o \ > xe_pat.o \ > xe_pci.o \ > xe_pcode.o \ > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index 323356a44e7f..aa12ef12d9dc 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -24,6 +24,7 @@ > #include "xe_irq.h" > #include "xe_mmio.h" > #include "xe_module.h" > +#include "xe_netlink.h" > #include "xe_pcode.h" > #include "xe_pm.h" > #include "xe_query.h" > @@ -317,6 +318,8 @@ int xe_device_probe(struct xe_device *xe) > > xe_display_register(xe); > > + xe_genl_register(xe); xe_genl_register() can fail > + > xe_debugfs_register(xe); > > err = drmm_add_action_or_reset(&xe->drm, xe_device_sanitize, xe); > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h > index 682ebdd1c09e..c9612a54c48f 100644 > --- a/drivers/gpu/drm/xe/xe_device_types.h > +++ b/drivers/gpu/drm/xe/xe_device_types.h > @@ -10,6 +10,7 @@ > > #include <drm/drm_device.h> > #include <drm/drm_file.h> > +#include <drm/drm_netlink.h> > #include <drm/ttm/ttm_device.h> > > #include "xe_gt_types.h" > @@ -347,6 +348,7 @@ struct xe_device { > u32 lvds_channel_mode; > } params; > #endif > + struct genl_family xe_genl_family; Should it be added above, before the "private" section? Maybe add a kernel-doc comment for it? > }; > > /** > diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c > index 6860586ce7f8..1eb73eb9a9a5 100644 > --- a/drivers/gpu/drm/xe/xe_module.c > +++ b/drivers/gpu/drm/xe/xe_module.c > @@ -11,6 +11,7 @@ > #include "xe_drv.h" > #include "xe_hw_fence.h" > #include "xe_module.h" > +#include "xe_netlink.h" > #include "xe_pci.h" > #include "xe_sched_job.h" > > @@ -67,6 +68,7 @@ static void __exit xe_exit(void) > { > int i; > > + xe_genl_cleanup(); > xe_unregister_pci_driver(); > > for (i = ARRAY_SIZE(init_funcs) - 1; i >= 0; i--) > diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c > new file mode 100644 > index 000000000000..63ef238ebc27 > --- /dev/null > +++ b/drivers/gpu/drm/xe/xe_netlink.c > @@ -0,0 +1,89 @@ > +// SPDX-License-Identifier: MIT > +/* > + * Copyright © 2023 Intel Corporation > + */ > + > +#include <drm/drm_managed.h> > + > +#include "xe_device.h" > + > +DEFINE_XARRAY(xe_xarray); xe_array sounds too generic. Maybe it should be more specific like xe_genl_xarray? In addition, it should be probably static. Thanks, Tomer > + > +static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info) > +{ > + return 0; > +} > + > +static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info) > +{ > + return 0; > +} > + > +/* operations definition */ > +static const struct genl_ops xe_genl_ops[] = { > + { > + .cmd = DRM_CMD_QUERY, > + .doit = xe_genl_list_errors, > + .policy = drm_attr_policy_query, > + }, > + { > + .cmd = DRM_CMD_READ_ONE, > + .doit = xe_genl_read_error, > + .policy = drm_attr_policy_read_one, > + }, > + { > + .cmd = DRM_CMD_READ_ALL, > + .doit = xe_genl_list_errors, > + .policy = drm_attr_policy_query, > + }, > +}; > + > +static void xe_genl_deregister(struct drm_device *dev, void *arg) > +{ > + struct xe_device *xe = arg; > + > + xa_erase(&xe_xarray, xe->xe_genl_family.id); > + > + drm_dbg_driver(&xe->drm, "unregistering genl family %s\n", xe->xe_genl_family.name); > + > + genl_unregister_family(&xe->xe_genl_family); > +} > + > +static void xe_genl_family_init(struct xe_device *xe) > +{ > + /* Use drm primary node name eg: card0 to name the genl family */ > + snprintf(xe->xe_genl_family.name, sizeof(xe->xe_genl_family.name), "%s", xe->drm.primary->kdev->kobj.name); > + xe->xe_genl_family.version = DRM_GENL_VERSION; > + xe->xe_genl_family.parallel_ops = true; > + xe->xe_genl_family.ops = xe_genl_ops; > + xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops); > + xe->xe_genl_family.maxattr = DRM_ATTR_MAX; > + xe->xe_genl_family.module = THIS_MODULE; > +} > + > +int xe_genl_register(struct xe_device *xe) > +{ > + int ret; > + > + xe_genl_family_init(xe); > + > + ret = genl_register_family(&xe->xe_genl_family); > + if (ret < 0) { > + drm_warn(&xe->drm, "xe genl family registration failed\n"); > + return ret; > + } > + > + drm_dbg_driver(&xe->drm, "genl family id %d and name %s\n", xe->xe_genl_family.id, xe->xe_genl_family.name); > + > + xa_store(&xe_xarray, xe->xe_genl_family.id, xe, GFP_KERNEL); > + > + ret = drmm_add_action_or_reset(&xe->drm, xe_genl_deregister, xe); > + > + return ret; > +} > + > +void xe_genl_cleanup(void) > +{ > + /* destroy xarray */ > + xa_destroy(&xe_xarray); > +} > diff --git a/drivers/gpu/drm/xe/xe_netlink.h b/drivers/gpu/drm/xe/xe_netlink.h > new file mode 100644 > index 000000000000..3bbddb620539 > --- /dev/null > +++ b/drivers/gpu/drm/xe/xe_netlink.h > @@ -0,0 +1,14 @@ > +/* SPDX-License-Identifier: MIT */ > +/* > + * Copyright © 2021 Intel Corporation > + */ > + > +#ifndef _XE_GENL_H_ > +#define _XE_GENL_H_ > + > +#include "xe_device.h" > + > +int xe_genl_register(struct xe_device *xe); > +void xe_genl_cleanup(void); > + > +#endif ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 2/5] drm/xe/RAS: Register a genl netlink family 2023-06-04 17:09 ` [Intel-xe] " Tomer Tayar @ 2023-06-05 17:21 ` Iddamsetty, Aravind 0 siblings, 0 replies; 20+ messages in thread From: Iddamsetty, Aravind @ 2023-06-05 17:21 UTC (permalink / raw) To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 04-06-2023 22:39, Tomer Tayar wrote: > On 26/05/2023 19:20, Aravind Iddamsetty wrote: >> Use the generic netlink(genl) subsystem to expose the RAS counters to >> userspace. We define a family structure and operations and register to >> genl subsystem and these callbacks will be invoked by genl subsystem when >> userspace sends a registered command with a family identifier to genl >> subsystem. >> >> Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> >> --- >> drivers/gpu/drm/xe/Makefile | 1 + >> drivers/gpu/drm/xe/xe_device.c | 3 + >> drivers/gpu/drm/xe/xe_device_types.h | 2 + >> drivers/gpu/drm/xe/xe_module.c | 2 + >> drivers/gpu/drm/xe/xe_netlink.c | 89 ++++++++++++++++++++++++++++ >> drivers/gpu/drm/xe/xe_netlink.h | 14 +++++ >> 6 files changed, 111 insertions(+) >> create mode 100644 drivers/gpu/drm/xe/xe_netlink.c >> create mode 100644 drivers/gpu/drm/xe/xe_netlink.h >> >> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile >> index b84e191ba14f..2b42165bc824 100644 >> --- a/drivers/gpu/drm/xe/Makefile >> +++ b/drivers/gpu/drm/xe/Makefile >> @@ -67,6 +67,7 @@ xe-y += xe_bb.o \ >> xe_mmio.o \ >> xe_mocs.o \ >> xe_module.o \ >> + xe_netlink.o \ >> xe_pat.o \ >> xe_pci.o \ >> xe_pcode.o \ >> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c >> index 323356a44e7f..aa12ef12d9dc 100644 >> --- a/drivers/gpu/drm/xe/xe_device.c >> +++ b/drivers/gpu/drm/xe/xe_device.c >> @@ -24,6 +24,7 @@ >> #include "xe_irq.h" >> #include "xe_mmio.h" >> #include "xe_module.h" >> +#include "xe_netlink.h" >> #include "xe_pcode.h" >> #include "xe_pm.h" >> #include "xe_query.h" >> @@ -317,6 +318,8 @@ int xe_device_probe(struct xe_device *xe) >> >> xe_display_register(xe); >> >> + xe_genl_register(xe); > > xe_genl_register() can fail That is right but I didn't want to fail the driver load as it would not impact any device functionality but doesn't provide observability. hence a warning would be printed "xe genl family registration failed". > >> + >> xe_debugfs_register(xe); >> >> err = drmm_add_action_or_reset(&xe->drm, xe_device_sanitize, xe); >> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h >> index 682ebdd1c09e..c9612a54c48f 100644 >> --- a/drivers/gpu/drm/xe/xe_device_types.h >> +++ b/drivers/gpu/drm/xe/xe_device_types.h >> @@ -10,6 +10,7 @@ >> >> #include <drm/drm_device.h> >> #include <drm/drm_file.h> >> +#include <drm/drm_netlink.h> >> #include <drm/ttm/ttm_device.h> >> >> #include "xe_gt_types.h" >> @@ -347,6 +348,7 @@ struct xe_device { >> u32 lvds_channel_mode; >> } params; >> #endif >> + struct genl_family xe_genl_family; > > Should it be added above, before the "private" section? > Maybe add a kernel-doc comment for it? thanks for pointing out will move it there. > >> }; >> >> /** >> diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c >> index 6860586ce7f8..1eb73eb9a9a5 100644 >> --- a/drivers/gpu/drm/xe/xe_module.c >> +++ b/drivers/gpu/drm/xe/xe_module.c >> @@ -11,6 +11,7 @@ >> #include "xe_drv.h" >> #include "xe_hw_fence.h" >> #include "xe_module.h" >> +#include "xe_netlink.h" >> #include "xe_pci.h" >> #include "xe_sched_job.h" >> >> @@ -67,6 +68,7 @@ static void __exit xe_exit(void) >> { >> int i; >> >> + xe_genl_cleanup(); >> xe_unregister_pci_driver(); >> >> for (i = ARRAY_SIZE(init_funcs) - 1; i >= 0; i--) >> diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c >> new file mode 100644 >> index 000000000000..63ef238ebc27 >> --- /dev/null >> +++ b/drivers/gpu/drm/xe/xe_netlink.c >> @@ -0,0 +1,89 @@ >> +// SPDX-License-Identifier: MIT >> +/* >> + * Copyright © 2023 Intel Corporation >> + */ >> + >> +#include <drm/drm_managed.h> >> + >> +#include "xe_device.h" >> + >> +DEFINE_XARRAY(xe_xarray); > > xe_array sounds too generic. Maybe it should be more specific like > xe_genl_xarray? > In addition, it should be probably static. Ok. Thanks, Aravind. > > Thanks, > Tomer > >> + >> +static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info) >> +{ >> + return 0; >> +} >> + >> +static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info) >> +{ >> + return 0; >> +} >> + >> +/* operations definition */ >> +static const struct genl_ops xe_genl_ops[] = { >> + { >> + .cmd = DRM_CMD_QUERY, >> + .doit = xe_genl_list_errors, >> + .policy = drm_attr_policy_query, >> + }, >> + { >> + .cmd = DRM_CMD_READ_ONE, >> + .doit = xe_genl_read_error, >> + .policy = drm_attr_policy_read_one, >> + }, >> + { >> + .cmd = DRM_CMD_READ_ALL, >> + .doit = xe_genl_list_errors, >> + .policy = drm_attr_policy_query, >> + }, >> +}; >> + >> +static void xe_genl_deregister(struct drm_device *dev, void *arg) >> +{ >> + struct xe_device *xe = arg; >> + >> + xa_erase(&xe_xarray, xe->xe_genl_family.id); >> + >> + drm_dbg_driver(&xe->drm, "unregistering genl family %s\n", xe->xe_genl_family.name); >> + >> + genl_unregister_family(&xe->xe_genl_family); >> +} >> + >> +static void xe_genl_family_init(struct xe_device *xe) >> +{ >> + /* Use drm primary node name eg: card0 to name the genl family */ >> + snprintf(xe->xe_genl_family.name, sizeof(xe->xe_genl_family.name), "%s", xe->drm.primary->kdev->kobj.name); >> + xe->xe_genl_family.version = DRM_GENL_VERSION; >> + xe->xe_genl_family.parallel_ops = true; >> + xe->xe_genl_family.ops = xe_genl_ops; >> + xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops); >> + xe->xe_genl_family.maxattr = DRM_ATTR_MAX; >> + xe->xe_genl_family.module = THIS_MODULE; >> +} >> + >> +int xe_genl_register(struct xe_device *xe) >> +{ >> + int ret; >> + >> + xe_genl_family_init(xe); >> + >> + ret = genl_register_family(&xe->xe_genl_family); >> + if (ret < 0) { >> + drm_warn(&xe->drm, "xe genl family registration failed\n"); >> + return ret; >> + } >> + >> + drm_dbg_driver(&xe->drm, "genl family id %d and name %s\n", xe->xe_genl_family.id, xe->xe_genl_family.name); >> + >> + xa_store(&xe_xarray, xe->xe_genl_family.id, xe, GFP_KERNEL); >> + >> + ret = drmm_add_action_or_reset(&xe->drm, xe_genl_deregister, xe); >> + >> + return ret; >> +} >> + >> +void xe_genl_cleanup(void) >> +{ >> + /* destroy xarray */ >> + xa_destroy(&xe_xarray); >> +} >> diff --git a/drivers/gpu/drm/xe/xe_netlink.h b/drivers/gpu/drm/xe/xe_netlink.h >> new file mode 100644 >> index 000000000000..3bbddb620539 >> --- /dev/null >> +++ b/drivers/gpu/drm/xe/xe_netlink.h >> @@ -0,0 +1,14 @@ >> +/* SPDX-License-Identifier: MIT */ >> +/* >> + * Copyright © 2021 Intel Corporation >> + */ >> + >> +#ifndef _XE_GENL_H_ >> +#define _XE_GENL_H_ >> + >> +#include "xe_device.h" >> + >> +int xe_genl_register(struct xe_device *xe); >> +void xe_genl_cleanup(void); >> + >> +#endif > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC 3/5] drm/xe/RAS: Expose the error counters 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty @ 2023-05-26 16:20 ` Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 4/5] drm/netlink: define multicast groups Aravind Iddamsetty ` (4 subsequent siblings) 7 siblings, 0 replies; 20+ messages in thread From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw) To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay We expose the various error counters supported on a hardware via genl subsystem through the registered commands to userspace. The DRM_CMD_QUERY lists the error names with config id, DRM_CMD_READ_ONE returns the counter value for the requested config id and the DRM_CMD_READ_ALL list the counters for all errors along with their names and config ids. Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> --- drivers/gpu/drm/xe/xe_netlink.c | 439 +++++++++++++++++++++++++++++++- include/uapi/drm/xe_drm.h | 64 +++++ 2 files changed, 501 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c index 63ef238ebc27..2a6965f5cde9 100644 --- a/drivers/gpu/drm/xe/xe_netlink.c +++ b/drivers/gpu/drm/xe/xe_netlink.c @@ -4,19 +4,451 @@ */ #include <drm/drm_managed.h> +#include <drm/xe_drm.h> #include "xe_device.h" +#define MAX_ERROR_NAME 50 + +#define HAS_GT_ERROR_VECTORS(xe) ((xe)->info.has_gt_error_vectors) +#define HAS_MEM_SPARING_SUPPORT(xe) ((xe)->info.has_mem_sparing) + DEFINE_XARRAY(xe_xarray); -static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info) +static const char * const xe_hw_error_events[] = { + [XE_GT_ERROR_CORRECTABLE_L3_SNG] = "correctable-l3-sng", + [XE_GT_ERROR_CORRECTABLE_GUC] = "correctable-guc", + [XE_GT_ERROR_CORRECTABLE_SAMPLER] = "correctable-sampler", + [XE_GT_ERROR_CORRECTABLE_SLM] = "correctable-slm", + [XE_GT_ERROR_CORRECTABLE_EU_IC] = "correctable-eu-ic", + [XE_GT_ERROR_CORRECTABLE_EU_GRF] = "correctable-eu-grf", + [XE_GT_ERROR_FATAL_ARR_BIST] = "fatal-array-bist", + [XE_GT_ERROR_FATAL_L3_DOUB] = "fatal-l3-double", + [XE_GT_ERROR_FATAL_L3_ECC_CHK] = "fatal-l3-ecc-checker", + [XE_GT_ERROR_FATAL_GUC] = "fatal-guc", + [XE_GT_ERROR_FATAL_IDI_PAR] = "fatal-idi-parity", + [XE_GT_ERROR_FATAL_SQIDI] = "fatal-sqidi", + [XE_GT_ERROR_FATAL_SAMPLER] = "fatal-sampler", + [XE_GT_ERROR_FATAL_SLM] = "fatal-slm", + [XE_GT_ERROR_FATAL_EU_IC] = "fatal-eu-ic", + [XE_GT_ERROR_FATAL_EU_GRF] = "fatal-eu-grf", + [XE_GT_ERROR_FATAL_FPU] = "fatal-fpu", + [XE_GT_ERROR_FATAL_TLB] = "fatal-tlb", + [XE_GT_ERROR_FATAL_L3_FABRIC] = "fatal-l3-fabric", + [XE_GT_ERROR_CORRECTABLE_SUBSLICE] = "correctable-subslice", + [XE_GT_ERROR_CORRECTABLE_L3BANK] = "correctable-l3bank", + [XE_GT_ERROR_FATAL_SUBSLICE] = "fatal-subslice", + [XE_GT_ERROR_FATAL_L3BANK] = "fatal-l3bank", + [XE_SGUNIT_ERROR_CORRECTABLE] = "sgunit-correctable", + [XE_SGUNIT_ERROR_NONFATAL] = "sgunit-nonfatal", + [XE_SGUNIT_ERROR_FATAL] = "sgunit-fatal", + [XE_SOC_ERROR_FATAL_PSF_CSC_0] = "soc-fatal-psf-csc-0", + [XE_SOC_ERROR_FATAL_PSF_CSC_1] = "soc-fatal-psf-csc-1", + [XE_SOC_ERROR_FATAL_PSF_CSC_2] = "soc-fatal-psf-csc-2", + [XE_SOC_ERROR_FATAL_PUNIT] = "soc-fatal-punit", + [XE_PVC_SOC_ERROR_FATAL_PSF_0] = "soc-fatal-psf-0", + [XE_PVC_SOC_ERROR_FATAL_PSF_1] = "soc-fatal-psf-1", + [XE_PVC_SOC_ERROR_FATAL_PSF_2] = "soc-fatal-psf-2", + [XE_PVC_SOC_ERROR_FATAL_CD0] = "soc-fatal-cd0", + [XE_PVC_SOC_ERROR_FATAL_CD0_MDFI] = "soc-fatal-cd0-mdfi", + [XE_PVC_SOC_ERROR_FATAL_MDFI_EAST] = "soc-fatal-mdfi-east", + [XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH] = "soc-fatal-mdfi-south", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 0)] = "soc-fatal-hbm-ss0-0", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 1)] = "soc-fatal-hbm-ss0-1", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 2)] = "soc-fatal-hbm-ss0-2", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 3)] = "soc-fatal-hbm-ss0-3", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 4)] = "soc-fatal-hbm-ss0-4", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 5)] = "soc-fatal-hbm-ss0-5", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 6)] = "soc-fatal-hbm-ss0-6", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 7)] = "soc-fatal-hbm-ss0-7", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 8)] = "soc-fatal-hbm-ss1-0", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 9)] = "soc-fatal-hbm-ss1-1", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 10)] = "soc-fatal-hbm-ss1-2", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 11)] = "soc-fatal-hbm-ss1-3", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 12)] = "soc-fatal-hbm-ss1-4", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 13)] = "soc-fatal-hbm-ss1-5", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 14)] = "soc-fatal-hbm-ss1-6", + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 15)] = "soc-fatal-hbm-ss1-7", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 0)] = "soc-fatal-hbm-ss2-0", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 1)] = "soc-fatal-hbm-ss2-1", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 2)] = "soc-fatal-hbm-ss2-2", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 3)] = "soc-fatal-hbm-ss2-3", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 4)] = "soc-fatal-hbm-ss2-4", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 5)] = "soc-fatal-hbm-ss2-5", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 6)] = "soc-fatal-hbm-ss2-6", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 7)] = "soc-fatal-hbm-ss2-7", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 8)] = "soc-fatal-hbm-ss3-0", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 9)] = "soc-fatal-hbm-ss3-1", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 10)] = "soc-fatal-hbm-ss3-2", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 11)] = "soc-fatal-hbm-ss3-3", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 12)] = "soc-fatal-hbm-ss3-4", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 13)] = "soc-fatal-hbm-ss3-5", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 14)] = "soc-fatal-hbm-ss3-6", + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 15)] = "soc-fatal-hbm-ss3-7", + [XE_GSC_ERROR_CORRECTABLE_SRAM_ECC] = "gsc-correctable-sram-ecc", + [XE_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = "gsc-nonfatal-mia-shutdown", + [XE_GSC_ERROR_NONFATAL_MIA_INT] = "gsc-nonfatal-mia-int", + [XE_GSC_ERROR_NONFATAL_SRAM_ECC] = "gsc-nonfatal-sram-ecc", + [XE_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = "gsc-nonfatal-wdg-timeout", + [XE_GSC_ERROR_NONFATAL_ROM_PARITY] = "gsc-nonfatal-rom-parity", + [XE_GSC_ERROR_NONFATAL_UCODE_PARITY] = "gsc-nonfatal-ucode-parity", + [XE_GSC_ERROR_NONFATAL_GLITCH_DET] = "gsc-nonfatal-glitch-det", + [XE_GSC_ERROR_NONFATAL_FUSE_PULL] = "gsc-nonfatal-fuse-pull", + [XE_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = "gsc-nonfatal-fuse-crc-check", + [XE_GSC_ERROR_NONFATAL_FUSE_SELFMBIST] = "gsc-nonfatal-selfmbist", + [XE_GSC_ERROR_NONFATAL_AON_PARITY] = "gsc-nonfatal-aon-parity", +}; + +static const unsigned long xe_hw_error_map[] = { + [XE_GT_ERROR_CORRECTABLE_L3_SNG] = INTEL_GT_HW_ERROR_COR_L3_SNG, + [XE_GT_ERROR_CORRECTABLE_GUC] = INTEL_GT_HW_ERROR_COR_GUC, + [XE_GT_ERROR_CORRECTABLE_SAMPLER] = INTEL_GT_HW_ERROR_COR_SAMPLER, + [XE_GT_ERROR_CORRECTABLE_SLM] = INTEL_GT_HW_ERROR_COR_SLM, + [XE_GT_ERROR_CORRECTABLE_EU_IC] = INTEL_GT_HW_ERROR_COR_EU_IC, + [XE_GT_ERROR_CORRECTABLE_EU_GRF] = INTEL_GT_HW_ERROR_COR_EU_GRF, + [XE_GT_ERROR_FATAL_ARR_BIST] = INTEL_GT_HW_ERROR_FAT_ARR_BIST, + [XE_GT_ERROR_FATAL_L3_DOUB] = INTEL_GT_HW_ERROR_FAT_L3_DOUB, + [XE_GT_ERROR_FATAL_L3_ECC_CHK] = INTEL_GT_HW_ERROR_FAT_L3_ECC_CHK, + [XE_GT_ERROR_FATAL_GUC] = INTEL_GT_HW_ERROR_FAT_GUC, + [XE_GT_ERROR_FATAL_IDI_PAR] = INTEL_GT_HW_ERROR_FAT_IDI_PAR, + [XE_GT_ERROR_FATAL_SQIDI] = INTEL_GT_HW_ERROR_FAT_SQIDI, + [XE_GT_ERROR_FATAL_SAMPLER] = INTEL_GT_HW_ERROR_FAT_SAMPLER, + [XE_GT_ERROR_FATAL_SLM] = INTEL_GT_HW_ERROR_FAT_SLM, + [XE_GT_ERROR_FATAL_EU_IC] = INTEL_GT_HW_ERROR_FAT_EU_IC, + [XE_GT_ERROR_FATAL_EU_GRF] = INTEL_GT_HW_ERROR_FAT_EU_GRF, + [XE_GT_ERROR_FATAL_FPU] = INTEL_GT_HW_ERROR_FAT_FPU, + [XE_GT_ERROR_FATAL_TLB] = INTEL_GT_HW_ERROR_FAT_TLB, + [XE_GT_ERROR_FATAL_L3_FABRIC] = INTEL_GT_HW_ERROR_FAT_L3_FABRIC, + [XE_GT_ERROR_CORRECTABLE_SUBSLICE] = INTEL_GT_HW_ERROR_COR_SUBSLICE, + [XE_GT_ERROR_CORRECTABLE_L3BANK] = INTEL_GT_HW_ERROR_COR_L3BANK, + [XE_GT_ERROR_FATAL_SUBSLICE] = INTEL_GT_HW_ERROR_FAT_SUBSLICE, + [XE_GT_ERROR_FATAL_L3BANK] = INTEL_GT_HW_ERROR_FAT_L3BANK, + [XE_SGUNIT_ERROR_CORRECTABLE] = HARDWARE_ERROR_CORRECTABLE, + [XE_SGUNIT_ERROR_NONFATAL] = HARDWARE_ERROR_NONFATAL, + [XE_SGUNIT_ERROR_FATAL] = HARDWARE_ERROR_FATAL, + [XE_SOC_ERROR_FATAL_PSF_CSC_0] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_0), + [XE_SOC_ERROR_FATAL_PSF_CSC_1] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_1), + [XE_SOC_ERROR_FATAL_PSF_CSC_2] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, SOC_PSF_CSC_2), + [XE_SOC_ERROR_FATAL_PUNIT] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_PUNIT), + [XE_PVC_SOC_ERROR_FATAL_PSF_0] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_0), + [XE_PVC_SOC_ERROR_FATAL_PSF_1] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_1), + [XE_PVC_SOC_ERROR_FATAL_PSF_2] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_PSF_2), + [XE_PVC_SOC_ERROR_FATAL_CD0] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_CD0), + [XE_PVC_SOC_ERROR_FATAL_CD0_MDFI] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_CD0_MDFI), + [XE_PVC_SOC_ERROR_FATAL_MDFI_EAST] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_MDFI_EAST), + [XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_LOCAL, HARDWARE_ERROR_FATAL, PVC_SOC_MDFI_SOUTH), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 0)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_0), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 1)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_1), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 2)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_2), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 3)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_3), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 4)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_4), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 5)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_5), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 6)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_6), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 7)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, SOC_HBM_SS0_7), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 8)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_0), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 9)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_1), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 10)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_2), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 11)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_3), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 12)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_4), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 13)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_5), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 14)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_6), + [XE_PVC_SOC_ERROR_FATAL_HBM(0, 15)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH0, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS1_7), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 0)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_0), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 1)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_1), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 2)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_2), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 3)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_3), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 4)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_4), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 5)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_5), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 6)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_6), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 7)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS2_7), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 8)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_0), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 9)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_1), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 10)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_2), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 11)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_3), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 12)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_4), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 13)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_5), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 14)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_6), + [XE_PVC_SOC_ERROR_FATAL_HBM(1, 15)] = SOC_ERR_INDEX(INTEL_GT_SOC_IEH1, INTEL_SOC_REG_GLOBAL, HARDWARE_ERROR_FATAL, PVC_SOC_HBM_SS3_7), + [XE_GSC_ERROR_CORRECTABLE_SRAM_ECC] = INTEL_GSC_HW_ERROR_COR_SRAM_ECC, + [XE_GSC_ERROR_NONFATAL_MIA_SHUTDOWN] = INTEL_GSC_HW_ERROR_UNCOR_MIA_SHUTDOWN, + [XE_GSC_ERROR_NONFATAL_MIA_INT] = INTEL_GSC_HW_ERROR_UNCOR_MIA_INT, + [XE_GSC_ERROR_NONFATAL_SRAM_ECC] = INTEL_GSC_HW_ERROR_UNCOR_SRAM_ECC, + [XE_GSC_ERROR_NONFATAL_WDG_TIMEOUT] = INTEL_GSC_HW_ERROR_UNCOR_WDG_TIMEOUT, + [XE_GSC_ERROR_NONFATAL_ROM_PARITY] = INTEL_GSC_HW_ERROR_UNCOR_ROM_PARITY, + [XE_GSC_ERROR_NONFATAL_UCODE_PARITY] = INTEL_GSC_HW_ERROR_UNCOR_UCODE_PARITY, + [XE_GSC_ERROR_NONFATAL_GLITCH_DET] = INTEL_GSC_HW_ERROR_UNCOR_GLITCH_DET, + [XE_GSC_ERROR_NONFATAL_FUSE_PULL] = INTEL_GSC_HW_ERROR_UNCOR_FUSE_PULL, + [XE_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK] = INTEL_GSC_HW_ERROR_UNCOR_FUSE_CRC_CHECK, + [XE_GSC_ERROR_NONFATAL_FUSE_SELFMBIST] = INTEL_GSC_HW_ERROR_UNCOR_SELFMBIST, + [XE_GSC_ERROR_NONFATAL_AON_PARITY] = INTEL_GSC_HW_ERROR_UNCOR_AON_PARITY, +}; + +static unsigned int config_gt_id(const u64 config) +{ + return config >> __XE_GT_SHIFT; +} + +static u64 config_counter(const u64 config) +{ + return config & ~(~0ULL << __XE_GT_SHIFT); +} + +static bool is_gt_vector_error(const u64 config) { + unsigned int error; + + error = config_counter(config); + if (error >= XE_GT_ERROR_FATAL_TLB && + error <= XE_GT_ERROR_FATAL_L3BANK) + return true; + + return false; +} + +static bool is_pvc_invalid_gt_errors(const u64 config) +{ + switch (config_counter(config)) { + case XE_GT_ERROR_CORRECTABLE_L3_SNG: + case XE_GT_ERROR_CORRECTABLE_SAMPLER: + case XE_GT_ERROR_FATAL_ARR_BIST: + case XE_GT_ERROR_FATAL_L3_DOUB: + case XE_GT_ERROR_FATAL_L3_ECC_CHK: + case XE_GT_ERROR_FATAL_IDI_PAR: + case XE_GT_ERROR_FATAL_SQIDI: + case XE_GT_ERROR_FATAL_SAMPLER: + case XE_GT_ERROR_FATAL_EU_IC: + return true; + default: + return false; + } +} + +static bool is_gsc_hw_error(const u64 config) +{ + if (config_counter(config) >= XE_GSC_ERROR_CORRECTABLE_SRAM_ECC && + config_counter(config) <= XE_GSC_ERROR_NONFATAL_AON_PARITY) + return true; + + return false; +} + +static bool is_soc_error(const u64 config) +{ + if (config_counter(config) >= XE_SOC_ERROR_FATAL_PSF_CSC_0 && + config_counter(config) <= XE_PVC_SOC_ERROR_FATAL_HBM(1, 15)) + return true; + + return false; +} + +static int +config_status(struct xe_device *xe, u64 config) +{ + unsigned int gt_id = config_gt_id(config); + + if (!IS_DGFX(xe)) + return -ENODEV; + + if (xe->gt[gt_id].info.type == XE_GT_TYPE_UNINITIALIZED) + return -ENOENT; + + /* GSC HW ERRORS are present on root tile of + * platform supporting MEMORY SPARING only + */ + if (is_gsc_hw_error(config) && !(HAS_MEM_SPARING_SUPPORT(xe) && gt_id == 0)) + return -ENODEV; + + /* GT vectors error are valid on Platforms supporting error vectors only */ + if (is_gt_vector_error(config) && !HAS_GT_ERROR_VECTORS(xe)) + return -ENODEV; + + /* Skip gt errors not supported on pvc */ + if (is_pvc_invalid_gt_errors(config) && (xe->info.platform == XE_PVC)) + return -ENODEV; + + /* FATAL FPU error is valid on PVC only */ + if (config_counter(config) == XE_GT_ERROR_FATAL_FPU && + !(xe->info.platform == XE_PVC)) + return -ENODEV; + + if (is_soc_error(config) && !(xe->info.platform == XE_PVC)) + return -ENODEV; + + return (config_counter(config) >= + ARRAY_SIZE(xe_hw_error_map)) ? -ENOENT : 0; +} + +static u64 get_counter_value(struct xe_device *xe, u64 config) +{ + const unsigned int gt_id = config_gt_id(config); + unsigned int id = config_counter(config); + + if (is_soc_error(config)) + return xa_to_value(xa_load(&xe->gt[gt_id].errors.soc, xe_hw_error_map[id])); + else if (is_gsc_hw_error(config)) + return xe->gt[gt_id].errors.gsc_hw[xe_hw_error_map[id]]; + else if (id >= XE_SGUNIT_ERROR_CORRECTABLE && + id <= XE_SGUNIT_ERROR_FATAL) + return xe->gt[gt_id].errors.sgunit[xe_hw_error_map[id]]; + else + return xe->gt[gt_id].errors.hw[xe_hw_error_map[id]]; + return 0; } -static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info) +static struct xe_device *genl_to_xe(struct genl_info *info) +{ + return xa_load(&xe_xarray, info->nlhdr->nlmsg_type); +} + +static int xe_genl_send(struct sk_buff *msg, struct genl_info *info, void *usrhdr) { + int ret; + + genlmsg_end(msg, usrhdr); + + ret = genlmsg_reply(msg, info); + if (ret) + nlmsg_free(msg); + + return ret; +} + +static struct sk_buff * +xe_genl_alloc_msg(struct xe_device *xe, + struct genl_info *info, + size_t msg_size, void **usrhdr) +{ + struct sk_buff *new_msg; + + new_msg = genlmsg_new(msg_size, GFP_KERNEL); + if (!new_msg) + return new_msg; + + *usrhdr = genlmsg_put_reply(new_msg, info, &xe->xe_genl_family, 0, info->genlhdr->cmd); + if (!*usrhdr) { + nlmsg_free(new_msg); + new_msg = NULL; + } + + return new_msg; +} + +int fill_error_details(struct genl_info *info, struct sk_buff *new_msg) +{ + struct xe_device *xe = genl_to_xe(info); + struct nlattr *entry_attr; + struct xe_gt *gt; + int i, j; + bool counter = false; + + if (info->genlhdr->cmd == DRM_CMD_READ_ALL) + counter = true; + + entry_attr = nla_nest_start(new_msg, DRM_ATTR_QUERY_REPLY); + if (!entry_attr) + return -EMSGSIZE; + + for_each_gt(gt, xe, j) { + char str[MAX_ERROR_NAME]; + u64 val; + + for (i = 0; i < ARRAY_SIZE(xe_hw_error_events); i++) { + u64 config = XE_HW_ERROR(j, i); + + if (config_status(xe, config)) + continue; + + /* should this be cleared everytime */ + snprintf(str, sizeof(str), "error-gt%d-%s", j, xe_hw_error_events[i]); + + if (nla_put_string(new_msg, DRM_ATTR_ERROR_NAME, str)) + goto err; + if (nla_put_u64_64bit(new_msg, DRM_ATTR_ERROR_ID, config, DRM_ATTR_PAD)) + goto err; + if (counter) { + val = get_counter_value(xe, config); + if (nla_put_u64_64bit(new_msg, DRM_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) + goto err; + } + } + } + + nla_nest_end(new_msg, entry_attr); + return 0; +err: + drm_dbg_driver(&xe->drm, "msg buff is small\n"); + nla_nest_cancel(new_msg, entry_attr); + nlmsg_free(new_msg); + + return -EMSGSIZE; +} + +static int xe_genl_list_errors(struct sk_buff *msg, struct genl_info *info) +{ + struct xe_device *xe = genl_to_xe(info); + size_t msg_size = NLMSG_DEFAULT_SIZE; + struct sk_buff *new_msg; + void *usrhdr; + int ret = 0; + int retries = 2; + + if (GENL_REQ_ATTR_CHECK(info, DRM_ATTR_REQUEST)) + return -EINVAL; + + do { + new_msg = xe_genl_alloc_msg(xe, info, msg_size, &usrhdr); + if (!new_msg) + return -ENOMEM; + + ret = fill_error_details(info, new_msg); + if (!ret) + break; + + msg_size += NLMSG_DEFAULT_SIZE; + } while (retries--); + + if (!ret) + ret = xe_genl_send(new_msg, info, usrhdr); + + return ret; +} + +static int xe_genl_read_error(struct sk_buff *msg, struct genl_info *info) +{ + struct xe_device *xe = genl_to_xe(info); + size_t msg_size = NLMSG_DEFAULT_SIZE; + struct sk_buff *new_msg; + void *usrhdr; + int ret = 0; + int retries = 2; + u64 config, val; + + if (GENL_REQ_ATTR_CHECK(info, DRM_ATTR_ERROR_ID)) + return -EINVAL; + + config = nla_get_u64(info->attrs[DRM_ATTR_ERROR_ID]); + ret = config_status(xe, config); + if (ret) + return ret; + do { + new_msg = xe_genl_alloc_msg(xe, info, msg_size, &usrhdr); + if (!new_msg) + return -ENOMEM; + + val = get_counter_value(xe, config); + if (nla_put_u64_64bit(new_msg, DRM_ATTR_ERROR_VALUE, val, DRM_ATTR_PAD)) { + msg_size += NLMSG_DEFAULT_SIZE; + continue; + } + + break; + } while (retries--); + + ret = xe_genl_send(new_msg, info, usrhdr); + + return ret; } /* operations definition */ @@ -65,6 +497,9 @@ int xe_genl_register(struct xe_device *xe) { int ret; + BUILD_BUG_ON(ARRAY_SIZE(xe_hw_error_events) != + ARRAY_SIZE(xe_hw_error_map)); + xe_genl_family_init(xe); ret = genl_register_family(&xe->xe_genl_family); diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index b0b80aae3ee8..a2ea238096df 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -801,6 +801,70 @@ struct drm_xe_vm_madvise { __u64 reserved[2]; }; +/* + * HW error IDs + */ + +#define __XE_GT_SHIFT (60) + +#define XE_HW_ERROR(gt, id) \ + ((id) | ((__u64)(gt) << __XE_GT_SHIFT)) + +#define XE_GT_ERROR_CORRECTABLE_L3_SNG (0) +#define XE_GT_ERROR_CORRECTABLE_GUC (1) +#define XE_GT_ERROR_CORRECTABLE_SAMPLER (2) +#define XE_GT_ERROR_CORRECTABLE_SLM (3) +#define XE_GT_ERROR_CORRECTABLE_EU_IC (4) +#define XE_GT_ERROR_CORRECTABLE_EU_GRF (5) +#define XE_GT_ERROR_FATAL_ARR_BIST (6) +#define XE_GT_ERROR_FATAL_L3_DOUB (7) +#define XE_GT_ERROR_FATAL_L3_ECC_CHK (8) +#define XE_GT_ERROR_FATAL_GUC (9) +#define XE_GT_ERROR_FATAL_IDI_PAR (10) +#define XE_GT_ERROR_FATAL_SQIDI (11) +#define XE_GT_ERROR_FATAL_SAMPLER (12) +#define XE_GT_ERROR_FATAL_SLM (13) +#define XE_GT_ERROR_FATAL_EU_IC (14) +#define XE_GT_ERROR_FATAL_EU_GRF (15) +#define XE_GT_ERROR_FATAL_FPU (16) +#define XE_GT_ERROR_FATAL_TLB (17) +#define XE_GT_ERROR_FATAL_L3_FABRIC (18) +#define XE_GT_ERROR_CORRECTABLE_SUBSLICE (19) +#define XE_GT_ERROR_CORRECTABLE_L3BANK (20) +#define XE_GT_ERROR_FATAL_SUBSLICE (21) +#define XE_GT_ERROR_FATAL_L3BANK (22) +#define XE_SGUNIT_ERROR_CORRECTABLE (23) +#define XE_SGUNIT_ERROR_NONFATAL (24) +#define XE_SGUNIT_ERROR_FATAL (25) +#define XE_SOC_ERROR_FATAL_PSF_CSC_0 (26) +#define XE_SOC_ERROR_FATAL_PSF_CSC_1 (27) +#define XE_SOC_ERROR_FATAL_PSF_CSC_2 (28) +#define XE_SOC_ERROR_FATAL_PUNIT (29) +#define XE_PVC_SOC_ERROR_FATAL_PSF_0 (30) +#define XE_PVC_SOC_ERROR_FATAL_PSF_1 (31) +#define XE_PVC_SOC_ERROR_FATAL_PSF_2 (32) +#define XE_PVC_SOC_ERROR_FATAL_CD0 (33) +#define XE_PVC_SOC_ERROR_FATAL_CD0_MDFI (34) +#define XE_PVC_SOC_ERROR_FATAL_MDFI_EAST (35) +#define XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH (36) + +#define XE_PVC_SOC_ERROR_FATAL_HBM(ss, n)\ + (XE_PVC_SOC_ERROR_FATAL_MDFI_SOUTH + 0x1 + (ss) * 0x10 + (n)) + +/* 68 is the last ID used by SOC errors */ +#define XE_GSC_ERROR_CORRECTABLE_SRAM_ECC (69) +#define XE_GSC_ERROR_NONFATAL_MIA_SHUTDOWN (70) +#define XE_GSC_ERROR_NONFATAL_MIA_INT (71) +#define XE_GSC_ERROR_NONFATAL_SRAM_ECC (72) +#define XE_GSC_ERROR_NONFATAL_WDG_TIMEOUT (73) +#define XE_GSC_ERROR_NONFATAL_ROM_PARITY (74) +#define XE_GSC_ERROR_NONFATAL_UCODE_PARITY (75) +#define XE_GSC_ERROR_NONFATAL_GLITCH_DET (76) +#define XE_GSC_ERROR_NONFATAL_FUSE_PULL (77) +#define XE_GSC_ERROR_NONFATAL_FUSE_CRC_CHECK (78) +#define XE_GSC_ERROR_NONFATAL_FUSE_SELFMBIST (79) +#define XE_GSC_ERROR_NONFATAL_AON_PARITY (80) + #if defined(__cplusplus) } #endif -- 2.25.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC 4/5] drm/netlink: define multicast groups 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty ` (2 preceding siblings ...) 2023-05-26 16:20 ` [RFC 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty @ 2023-05-26 16:20 ` Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty ` (3 subsequent siblings) 7 siblings, 0 replies; 20+ messages in thread From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw) To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay Netlink subsystem supports event notifications to userspace. we define two multicast groups for correctable and uncorrectable errors to which userspace can subscribe and be notified when any of those errors happen. The group names are local to the driver's genl netlink family. Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> --- include/uapi/drm/drm_netlink.h | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h index 28e7a334d0c7..bd3a8b293979 100644 --- a/include/uapi/drm/drm_netlink.h +++ b/include/uapi/drm/drm_netlink.h @@ -29,6 +29,8 @@ #include <net/sock.h> #define DRM_GENL_VERSION 1 +#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR "drm_corr_err" +#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR "drm_uncorr_err" enum error_cmds { DRM_CMD_UNSPEC, @@ -38,6 +40,7 @@ enum error_cmds { DRM_CMD_READ_ONE, /* command to get counters of all errors */ DRM_CMD_READ_ALL, + DRM_CMD_ERROR_EVENT, __DRM_CMD_MAX, DRM_CMD_MAX = __DRM_CMD_MAX - 1, @@ -65,4 +68,14 @@ static const struct nla_policy drm_attr_policy_read_one[DRM_ATTR_MAX + 1] = { [DRM_ATTR_ERROR_ID] = { .type = NLA_U64 }, }; +enum mcgrps_events { + DRM_GENL_MCAST_CORR_ERR, + DRM_GENL_MCAST_UNCORR_ERR, +}; + +static const struct genl_multicast_group drm_event_mcgrps[] = { + [DRM_GENL_MCAST_CORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, }, + [DRM_GENL_MCAST_UNCORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, }, +}; + #endif -- 2.25.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty ` (3 preceding siblings ...) 2023-05-26 16:20 ` [RFC 4/5] drm/netlink: define multicast groups Aravind Iddamsetty @ 2023-05-26 16:20 ` Aravind Iddamsetty 2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar ` (2 subsequent siblings) 7 siblings, 0 replies; 20+ messages in thread From: Aravind Iddamsetty @ 2023-05-26 16:20 UTC (permalink / raw) To: intel-xe, dri-devel; +Cc: alexander.deucher, ogabbay Whenever a correctable or an uncorrectable error happens an event is sent to the corresponding listeners of these groups. Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com> --- drivers/gpu/drm/xe/xe_irq.c | 32 ++++++++++++++++++++++++++++++++ drivers/gpu/drm/xe/xe_netlink.c | 2 ++ 2 files changed, 34 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_irq.c b/drivers/gpu/drm/xe/xe_irq.c index 226be96e341a..1b415c8585a4 100644 --- a/drivers/gpu/drm/xe/xe_irq.c +++ b/drivers/gpu/drm/xe/xe_irq.c @@ -1073,6 +1073,37 @@ xe_gsc_hw_error_handler(struct xe_gt *gt, const enum hardware_error hw_err) xe_mmio_write32(gt, GSC_HEC_CORR_UNCORR_ERR_STATUS(base, hw_err).reg, err_status); } +static void generate_netlink_event(struct xe_gt *gt, const enum hardware_error hw_err) +{ + struct xe_device *xe = gt->xe; + struct sk_buff *msg; + void *hdr; + + if (!xe->xe_genl_family.module) + return; + + msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC); + if (!msg) { + drm_dbg_driver(&xe->drm, "couldn't allocate memory for error multicast event\n"); + return; + } + + hdr = genlmsg_put(msg, 0, 0, &xe->xe_genl_family, 0, DRM_CMD_ERROR_EVENT); + if (!hdr) { + drm_dbg_driver(&xe->drm, "mutlicast msg buffer is small\n"); + nlmsg_free(msg); + return; + } + + genlmsg_end(msg, hdr); + + genlmsg_multicast(&xe->xe_genl_family, msg, 0, + hw_err ? + DRM_GENL_MCAST_UNCORR_ERR + : DRM_GENL_MCAST_CORR_ERR, + GFP_ATOMIC); +} + static void xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err) { @@ -1103,6 +1134,7 @@ xe_hw_error_source_handler(struct xe_gt *gt, const enum hardware_error hw_err) xe_mmio_write32(gt, DEV_ERR_STAT_REG(hw_err).reg, errsrc); + generate_netlink_event(gt, hw_err); out_unlock: spin_unlock_irqrestore(>_to_xe(gt)->irq.lock, flags); } diff --git a/drivers/gpu/drm/xe/xe_netlink.c b/drivers/gpu/drm/xe/xe_netlink.c index 2a6965f5cde9..0c1d51e1a9a5 100644 --- a/drivers/gpu/drm/xe/xe_netlink.c +++ b/drivers/gpu/drm/xe/xe_netlink.c @@ -490,6 +490,8 @@ static void xe_genl_family_init(struct xe_device *xe) xe->xe_genl_family.ops = xe_genl_ops; xe->xe_genl_family.n_ops = ARRAY_SIZE(xe_genl_ops); xe->xe_genl_family.maxattr = DRM_ATTR_MAX; + xe->xe_genl_family.mcgrps = drm_event_mcgrps; + xe->xe_genl_family.n_mcgrps = ARRAY_SIZE(drm_event_mcgrps); xe->xe_genl_family.module = THIS_MODULE; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty ` (4 preceding siblings ...) 2023-05-26 16:20 ` [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty @ 2023-06-04 17:07 ` Tomer Tayar 2023-06-05 17:17 ` Iddamsetty, Aravind 2023-06-05 16:47 ` Alex Deucher 2023-06-21 17:24 ` Sebastian Wick 7 siblings, 1 reply; 20+ messages in thread From: Tomer Tayar @ 2023-06-04 17:07 UTC (permalink / raw) To: Aravind Iddamsetty, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 26/05/2023 19:20, Aravind Iddamsetty wrote: > Our hardware supports RAS(Reliability, Availability, Serviceability) by > exposing a set of error counters which can be used by observability > tools to take corrective actions or repairs. Traditionally there were > being exposed via PMU (for relative counters) and sysfs interface (for > absolute value) in our internal branch. But, due to the limitations in > this approach to use two interfaces and also not able to have an event > based reporting or configurability, an alternative approach to try > netlink was suggested by community for drm subsystem wide UAPI for RAS > and telemetry as discussed in [1]. > > This [1] is the inspiration to this series. It uses the generic > netlink(genl) family subsystem and exposes a set of commands that can > be used by every drm driver, the framework provides a means to have > custom commands too. Each drm driver instance in this example xe driver > instance registers a family and operations to the genl subsystem through > which it enumerates and reports the error counters. An event based > notification is also supported to which userpace can subscribe to and > be notified when any error occurs and read the error counter this avoids > continuous polling on error counter. This can also be extended to > threshold based notification. Hi Aravind, The habanalabs driver is another candidate to use this netlink-based drm framework. As a single-user device, we have an additional "control" device that allows multiple applications to query for information and to monitor the "compute" device. And while we are about to move the compute device to the accel nodes, we don't have a real replacement there for the control device. Another possible usage of this framework for habanalabs is the events notification. Currently we have an eventfd-based mechanism, and after being notified about an event, user starts querying about the event and the relevant info, usually in several requests. With this framework we should be allegedly possible to gather all relevant info together with the event itself. The current implementation seems intended more to errors (and quite "tailored" to Xe needs ...), while in habanalabs we would need it also for non-error static/dynamic info. Maybe we should revise the existing commands/attributes to be more generic? Moreover, the drm part is very small, while most of the netlink "mess" is still done by the specific driver. So what is the added value in making it a "drm framework"? Do we enforce something here for drm drivers that use it? Do we help them with simpler APIs and hiding the internals of netlink? Maybe it would be worth moving some functionality from the Xe driver into drm helpers? Thanks, Tomer > [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html > > this series is on top of https://patchwork.freedesktop.org/series/116181/ > > Below is an example tool drm_ras which demonstrates the use of the > supported commands. The tool will be sent to ML with the subject > "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" > > read single error counter: > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 > counter value 0 > > read all error counters: > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > name config-id counter > > error-gt0-correctable-guc 0x0000000000000001 0 > error-gt0-correctable-slm 0x0000000000000003 0 > error-gt0-correctable-eu-ic 0x0000000000000004 0 > error-gt0-correctable-eu-grf 0x0000000000000005 0 > error-gt0-fatal-guc 0x0000000000000009 0 > error-gt0-fatal-slm 0x000000000000000d 0 > error-gt0-fatal-eu-grf 0x000000000000000f 0 > error-gt0-fatal-fpu 0x0000000000000010 0 > error-gt0-fatal-tlb 0x0000000000000011 0 > error-gt0-fatal-l3-fabric 0x0000000000000012 0 > error-gt0-correctable-subslice 0x0000000000000013 0 > error-gt0-correctable-l3bank 0x0000000000000014 0 > error-gt0-fatal-subslice 0x0000000000000015 0 > error-gt0-fatal-l3bank 0x0000000000000016 0 > error-gt0-sgunit-correctable 0x0000000000000017 0 > error-gt0-sgunit-nonfatal 0x0000000000000018 0 > error-gt0-sgunit-fatal 0x0000000000000019 0 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 > error-gt0-soc-fatal-punit 0x000000000000001d 0 > error-gt0-soc-fatal-psf-0 0x000000000000001e 0 > error-gt0-soc-fatal-psf-1 0x000000000000001f 0 > error-gt0-soc-fatal-psf-2 0x0000000000000020 0 > error-gt0-soc-fatal-cd0 0x0000000000000021 0 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 > error-gt1-correctable-guc 0x1000000000000001 0 > error-gt1-correctable-slm 0x1000000000000003 0 > error-gt1-correctable-eu-ic 0x1000000000000004 0 > error-gt1-correctable-eu-grf 0x1000000000000005 0 > error-gt1-fatal-guc 0x1000000000000009 0 > error-gt1-fatal-slm 0x100000000000000d 0 > error-gt1-fatal-eu-grf 0x100000000000000f 0 > error-gt1-fatal-fpu 0x1000000000000010 0 > error-gt1-fatal-tlb 0x1000000000000011 0 > error-gt1-fatal-l3-fabric 0x1000000000000012 0 > error-gt1-correctable-subslice 0x1000000000000013 0 > error-gt1-correctable-l3bank 0x1000000000000014 0 > error-gt1-fatal-subslice 0x1000000000000015 0 > error-gt1-fatal-l3bank 0x1000000000000016 0 > error-gt1-sgunit-correctable 0x1000000000000017 0 > error-gt1-sgunit-nonfatal 0x1000000000000018 0 > error-gt1-sgunit-fatal 0x1000000000000019 0 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 > error-gt1-soc-fatal-punit 0x100000000000001d 0 > error-gt1-soc-fatal-psf-0 0x100000000000001e 0 > error-gt1-soc-fatal-psf-1 0x100000000000001f 0 > error-gt1-soc-fatal-psf-2 0x1000000000000020 0 > error-gt1-soc-fatal-cd0 0x1000000000000021 0 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 > > wait on a error event: > > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 > waiting for error event > error event received > counter value 0 > > list all errors: > > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 > name config-id > > error-gt0-correctable-guc 0x0000000000000001 > error-gt0-correctable-slm 0x0000000000000003 > error-gt0-correctable-eu-ic 0x0000000000000004 > error-gt0-correctable-eu-grf 0x0000000000000005 > error-gt0-fatal-guc 0x0000000000000009 > error-gt0-fatal-slm 0x000000000000000d > error-gt0-fatal-eu-grf 0x000000000000000f > error-gt0-fatal-fpu 0x0000000000000010 > error-gt0-fatal-tlb 0x0000000000000011 > error-gt0-fatal-l3-fabric 0x0000000000000012 > error-gt0-correctable-subslice 0x0000000000000013 > error-gt0-correctable-l3bank 0x0000000000000014 > error-gt0-fatal-subslice 0x0000000000000015 > error-gt0-fatal-l3bank 0x0000000000000016 > error-gt0-sgunit-correctable 0x0000000000000017 > error-gt0-sgunit-nonfatal 0x0000000000000018 > error-gt0-sgunit-fatal 0x0000000000000019 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c > error-gt0-soc-fatal-punit 0x000000000000001d > error-gt0-soc-fatal-psf-0 0x000000000000001e > error-gt0-soc-fatal-psf-1 0x000000000000001f > error-gt0-soc-fatal-psf-2 0x0000000000000020 > error-gt0-soc-fatal-cd0 0x0000000000000021 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 > error-gt1-correctable-guc 0x1000000000000001 > error-gt1-correctable-slm 0x1000000000000003 > error-gt1-correctable-eu-ic 0x1000000000000004 > error-gt1-correctable-eu-grf 0x1000000000000005 > error-gt1-fatal-guc 0x1000000000000009 > error-gt1-fatal-slm 0x100000000000000d > error-gt1-fatal-eu-grf 0x100000000000000f > error-gt1-fatal-fpu 0x1000000000000010 > error-gt1-fatal-tlb 0x1000000000000011 > error-gt1-fatal-l3-fabric 0x1000000000000012 > error-gt1-correctable-subslice 0x1000000000000013 > error-gt1-correctable-l3bank 0x1000000000000014 > error-gt1-fatal-subslice 0x1000000000000015 > error-gt1-fatal-l3bank 0x1000000000000016 > error-gt1-sgunit-correctable 0x1000000000000017 > error-gt1-sgunit-nonfatal 0x1000000000000018 > error-gt1-sgunit-fatal 0x1000000000000019 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c > error-gt1-soc-fatal-punit 0x100000000000001d > error-gt1-soc-fatal-psf-0 0x100000000000001e > error-gt1-soc-fatal-psf-1 0x100000000000001f > error-gt1-soc-fatal-psf-2 0x1000000000000020 > error-gt1-soc-fatal-cd0 0x1000000000000021 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 > > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: David Airlie <airlied@gmail.com> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Oded Gabbay <ogabbay@kernel.org> > > > Aravind Iddamsetty (5): > drm/netlink: Add netlink infrastructure > drm/xe/RAS: Register a genl netlink family > drm/xe/RAS: Expose the error counters > drm/netlink: define multicast groups > drm/xe/RAS: send multicast event on occurrence of an error > > drivers/gpu/drm/xe/Makefile | 1 + > drivers/gpu/drm/xe/xe_device.c | 3 + > drivers/gpu/drm/xe/xe_device_types.h | 2 + > drivers/gpu/drm/xe/xe_irq.c | 32 ++ > drivers/gpu/drm/xe/xe_module.c | 2 + > drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ > drivers/gpu/drm/xe/xe_netlink.h | 14 + > include/uapi/drm/drm_netlink.h | 81 +++++ > include/uapi/drm/xe_drm.h | 64 ++++ > 9 files changed, 725 insertions(+) > create mode 100644 drivers/gpu/drm/xe/xe_netlink.c > create mode 100644 drivers/gpu/drm/xe/xe_netlink.h > create mode 100644 include/uapi/drm/drm_netlink.h > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem 2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar @ 2023-06-05 17:17 ` Iddamsetty, Aravind 0 siblings, 0 replies; 20+ messages in thread From: Iddamsetty, Aravind @ 2023-06-05 17:17 UTC (permalink / raw) To: Tomer Tayar, intel-xe, dri-devel; +Cc: alexander.deucher, Oded Gabbay On 04-06-2023 22:37, Tomer Tayar wrote: > On 26/05/2023 19:20, Aravind Iddamsetty wrote: >> Our hardware supports RAS(Reliability, Availability, Serviceability) by >> exposing a set of error counters which can be used by observability >> tools to take corrective actions or repairs. Traditionally there were >> being exposed via PMU (for relative counters) and sysfs interface (for >> absolute value) in our internal branch. But, due to the limitations in >> this approach to use two interfaces and also not able to have an event >> based reporting or configurability, an alternative approach to try >> netlink was suggested by community for drm subsystem wide UAPI for RAS >> and telemetry as discussed in [1]. >> >> This [1] is the inspiration to this series. It uses the generic >> netlink(genl) family subsystem and exposes a set of commands that can >> be used by every drm driver, the framework provides a means to have >> custom commands too. Each drm driver instance in this example xe driver >> instance registers a family and operations to the genl subsystem through >> which it enumerates and reports the error counters. An event based >> notification is also supported to which userpace can subscribe to and >> be notified when any error occurs and read the error counter this avoids >> continuous polling on error counter. This can also be extended to >> threshold based notification. > > Hi Aravind, Hi Tomer, Thanks a lot for your review. > > The habanalabs driver is another candidate to use this netlink-based drm > framework. > As a single-user device, we have an additional "control" device that > allows multiple applications to query for information and to monitor the > "compute" device. > And while we are about to move the compute device to the accel nodes, we > don't have a real replacement there for the control device. > > Another possible usage of this framework for habanalabs is the events > notification. > Currently we have an eventfd-based mechanism, and after being notified > about an event, user starts querying about the event and the relevant > info, usually in several requests. > With this framework we should be allegedly possible to gather all > relevant info together with the event itself. that is right with the multicast event we can pack data too. > > The current implementation seems intended more to errors (and quite > "tailored" to Xe needs ...), while in habanalabs we would need it also > for non-error static/dynamic info. > Maybe we should revise the existing commands/attributes to be more generic? correct, at present that is the usecase xe driver has and atleast for the error part I believe is generic if not we can make it, the framework is extensible. The idea I had was generic commands which every driver can use will be part of drm framework and if there are specific commands or attributes that shall be part of driver. But some thought is needed here as MAX attributes is needed by userspace and how to define attribute policy etc.., > > Moreover, the drm part is very small, while most of the netlink "mess" > is still done by the specific driver. > So what is the added value in making it a "drm framework"? Do we enforce > something here for drm drivers that use it? Do we help them with simpler > APIs and hiding the internals of netlink?> Maybe it would be worth moving some functionality from the Xe driver > into drm helpers? your suggestion sounds good and interesting but it might need some analysis like if we move the registration parts to drm framework how would we register the driver private commands and attributes if there are any. But ya having most of the part at drm level helps all the driver. I'll do some analysis and i'll come back on this. Thanks, Aravind. > > Thanks, > Tomer > >> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html >> >> this series is on top of https://patchwork.freedesktop.org/series/116181/ >> >> Below is an example tool drm_ras which demonstrates the use of the >> supported commands. The tool will be sent to ML with the subject >> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" >> >> read single error counter: >> >> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 >> counter value 0 >> >> read all error counters: >> >> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 >> name config-id counter >> >> error-gt0-correctable-guc 0x0000000000000001 0 >> error-gt0-correctable-slm 0x0000000000000003 0 >> error-gt0-correctable-eu-ic 0x0000000000000004 0 >> error-gt0-correctable-eu-grf 0x0000000000000005 0 >> error-gt0-fatal-guc 0x0000000000000009 0 >> error-gt0-fatal-slm 0x000000000000000d 0 >> error-gt0-fatal-eu-grf 0x000000000000000f 0 >> error-gt0-fatal-fpu 0x0000000000000010 0 >> error-gt0-fatal-tlb 0x0000000000000011 0 >> error-gt0-fatal-l3-fabric 0x0000000000000012 0 >> error-gt0-correctable-subslice 0x0000000000000013 0 >> error-gt0-correctable-l3bank 0x0000000000000014 0 >> error-gt0-fatal-subslice 0x0000000000000015 0 >> error-gt0-fatal-l3bank 0x0000000000000016 0 >> error-gt0-sgunit-correctable 0x0000000000000017 0 >> error-gt0-sgunit-nonfatal 0x0000000000000018 0 >> error-gt0-sgunit-fatal 0x0000000000000019 0 >> error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 >> error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 >> error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 >> error-gt0-soc-fatal-punit 0x000000000000001d 0 >> error-gt0-soc-fatal-psf-0 0x000000000000001e 0 >> error-gt0-soc-fatal-psf-1 0x000000000000001f 0 >> error-gt0-soc-fatal-psf-2 0x0000000000000020 0 >> error-gt0-soc-fatal-cd0 0x0000000000000021 0 >> error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 >> error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 >> error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 >> error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 >> error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 >> error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 >> error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 >> error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 >> error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 >> error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 >> error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 >> error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 >> error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 >> error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 >> error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 >> error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 >> error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 >> error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 >> error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 >> error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 >> error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 >> error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 >> error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 >> error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 >> error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 >> error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 >> error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 >> error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 >> error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 >> error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 >> error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 >> error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 >> error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 >> error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 >> error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 >> error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 >> error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 >> error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 >> error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 >> error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 >> error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 >> error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 >> error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 >> error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 >> error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 >> error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 >> error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 >> error-gt1-correctable-guc 0x1000000000000001 0 >> error-gt1-correctable-slm 0x1000000000000003 0 >> error-gt1-correctable-eu-ic 0x1000000000000004 0 >> error-gt1-correctable-eu-grf 0x1000000000000005 0 >> error-gt1-fatal-guc 0x1000000000000009 0 >> error-gt1-fatal-slm 0x100000000000000d 0 >> error-gt1-fatal-eu-grf 0x100000000000000f 0 >> error-gt1-fatal-fpu 0x1000000000000010 0 >> error-gt1-fatal-tlb 0x1000000000000011 0 >> error-gt1-fatal-l3-fabric 0x1000000000000012 0 >> error-gt1-correctable-subslice 0x1000000000000013 0 >> error-gt1-correctable-l3bank 0x1000000000000014 0 >> error-gt1-fatal-subslice 0x1000000000000015 0 >> error-gt1-fatal-l3bank 0x1000000000000016 0 >> error-gt1-sgunit-correctable 0x1000000000000017 0 >> error-gt1-sgunit-nonfatal 0x1000000000000018 0 >> error-gt1-sgunit-fatal 0x1000000000000019 0 >> error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 >> error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 >> error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 >> error-gt1-soc-fatal-punit 0x100000000000001d 0 >> error-gt1-soc-fatal-psf-0 0x100000000000001e 0 >> error-gt1-soc-fatal-psf-1 0x100000000000001f 0 >> error-gt1-soc-fatal-psf-2 0x1000000000000020 0 >> error-gt1-soc-fatal-cd0 0x1000000000000021 0 >> error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 >> error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 >> error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 >> error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 >> error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 >> error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 >> error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 >> error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 >> error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 >> error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 >> error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 >> error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 >> error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 >> error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 >> error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 >> error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 >> error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 >> error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 >> error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 >> error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 >> error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 >> error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 >> error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 >> error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 >> error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 >> error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 >> error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 >> error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 >> error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 >> error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 >> error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 >> error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 >> error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 >> error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 >> error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 >> >> wait on a error event: >> >> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 >> waiting for error event >> error event received >> counter value 0 >> >> list all errors: >> >> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 >> name config-id >> >> error-gt0-correctable-guc 0x0000000000000001 >> error-gt0-correctable-slm 0x0000000000000003 >> error-gt0-correctable-eu-ic 0x0000000000000004 >> error-gt0-correctable-eu-grf 0x0000000000000005 >> error-gt0-fatal-guc 0x0000000000000009 >> error-gt0-fatal-slm 0x000000000000000d >> error-gt0-fatal-eu-grf 0x000000000000000f >> error-gt0-fatal-fpu 0x0000000000000010 >> error-gt0-fatal-tlb 0x0000000000000011 >> error-gt0-fatal-l3-fabric 0x0000000000000012 >> error-gt0-correctable-subslice 0x0000000000000013 >> error-gt0-correctable-l3bank 0x0000000000000014 >> error-gt0-fatal-subslice 0x0000000000000015 >> error-gt0-fatal-l3bank 0x0000000000000016 >> error-gt0-sgunit-correctable 0x0000000000000017 >> error-gt0-sgunit-nonfatal 0x0000000000000018 >> error-gt0-sgunit-fatal 0x0000000000000019 >> error-gt0-soc-fatal-psf-csc-0 0x000000000000001a >> error-gt0-soc-fatal-psf-csc-1 0x000000000000001b >> error-gt0-soc-fatal-psf-csc-2 0x000000000000001c >> error-gt0-soc-fatal-punit 0x000000000000001d >> error-gt0-soc-fatal-psf-0 0x000000000000001e >> error-gt0-soc-fatal-psf-1 0x000000000000001f >> error-gt0-soc-fatal-psf-2 0x0000000000000020 >> error-gt0-soc-fatal-cd0 0x0000000000000021 >> error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 >> error-gt0-soc-fatal-mdfi-east 0x0000000000000023 >> error-gt0-soc-fatal-mdfi-south 0x0000000000000024 >> error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 >> error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 >> error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 >> error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 >> error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 >> error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a >> error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b >> error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c >> error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d >> error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e >> error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f >> error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 >> error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 >> error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 >> error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 >> error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 >> error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 >> error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 >> error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 >> error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 >> error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 >> error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a >> error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b >> error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c >> error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d >> error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e >> error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f >> error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 >> error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 >> error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 >> error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 >> error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 >> error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 >> error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 >> error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 >> error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 >> error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 >> error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a >> error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b >> error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c >> error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d >> error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e >> error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f >> error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 >> error-gt1-correctable-guc 0x1000000000000001 >> error-gt1-correctable-slm 0x1000000000000003 >> error-gt1-correctable-eu-ic 0x1000000000000004 >> error-gt1-correctable-eu-grf 0x1000000000000005 >> error-gt1-fatal-guc 0x1000000000000009 >> error-gt1-fatal-slm 0x100000000000000d >> error-gt1-fatal-eu-grf 0x100000000000000f >> error-gt1-fatal-fpu 0x1000000000000010 >> error-gt1-fatal-tlb 0x1000000000000011 >> error-gt1-fatal-l3-fabric 0x1000000000000012 >> error-gt1-correctable-subslice 0x1000000000000013 >> error-gt1-correctable-l3bank 0x1000000000000014 >> error-gt1-fatal-subslice 0x1000000000000015 >> error-gt1-fatal-l3bank 0x1000000000000016 >> error-gt1-sgunit-correctable 0x1000000000000017 >> error-gt1-sgunit-nonfatal 0x1000000000000018 >> error-gt1-sgunit-fatal 0x1000000000000019 >> error-gt1-soc-fatal-psf-csc-0 0x100000000000001a >> error-gt1-soc-fatal-psf-csc-1 0x100000000000001b >> error-gt1-soc-fatal-psf-csc-2 0x100000000000001c >> error-gt1-soc-fatal-punit 0x100000000000001d >> error-gt1-soc-fatal-psf-0 0x100000000000001e >> error-gt1-soc-fatal-psf-1 0x100000000000001f >> error-gt1-soc-fatal-psf-2 0x1000000000000020 >> error-gt1-soc-fatal-cd0 0x1000000000000021 >> error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 >> error-gt1-soc-fatal-mdfi-east 0x1000000000000023 >> error-gt1-soc-fatal-mdfi-south 0x1000000000000024 >> error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 >> error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 >> error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 >> error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 >> error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 >> error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a >> error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b >> error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c >> error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d >> error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e >> error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f >> error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 >> error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 >> error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 >> error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 >> error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 >> error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 >> error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 >> error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 >> error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 >> error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 >> error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a >> error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b >> error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c >> error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d >> error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e >> error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f >> error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 >> error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 >> error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 >> error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 >> error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 >> >> Cc: Alex Deucher <alexander.deucher@amd.com> >> Cc: David Airlie <airlied@gmail.com> >> Cc: Daniel Vetter <daniel@ffwll.ch> >> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> >> Cc: Oded Gabbay <ogabbay@kernel.org> >> >> >> Aravind Iddamsetty (5): >> drm/netlink: Add netlink infrastructure >> drm/xe/RAS: Register a genl netlink family >> drm/xe/RAS: Expose the error counters >> drm/netlink: define multicast groups >> drm/xe/RAS: send multicast event on occurrence of an error >> >> drivers/gpu/drm/xe/Makefile | 1 + >> drivers/gpu/drm/xe/xe_device.c | 3 + >> drivers/gpu/drm/xe/xe_device_types.h | 2 + >> drivers/gpu/drm/xe/xe_irq.c | 32 ++ >> drivers/gpu/drm/xe/xe_module.c | 2 + >> drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ >> drivers/gpu/drm/xe/xe_netlink.h | 14 + >> include/uapi/drm/drm_netlink.h | 81 +++++ >> include/uapi/drm/xe_drm.h | 64 ++++ >> 9 files changed, 725 insertions(+) >> create mode 100644 drivers/gpu/drm/xe/xe_netlink.c >> create mode 100644 drivers/gpu/drm/xe/xe_netlink.h >> create mode 100644 include/uapi/drm/drm_netlink.h >> > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty ` (5 preceding siblings ...) 2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar @ 2023-06-05 16:47 ` Alex Deucher 2023-06-06 11:56 ` Iddamsetty, Aravind 2023-06-21 17:24 ` Sebastian Wick 7 siblings, 1 reply; 20+ messages in thread From: Alex Deucher @ 2023-06-05 16:47 UTC (permalink / raw) To: Aravind Iddamsetty, Hawking Zhang, Harish Kasiviswanathan, Kuehling, Felix, Tuikov, Luben Cc: alexander.deucher, ogabbay, intel-xe, dri-devel Adding the relevant AMD folks for RAS. We currently expose RAS via sysfs, but also have an event interface in KFD which may be somewhat similar to this. If we were to converge on a common RAS interface, would we want to look at any commonality in bad page storage/reporting for device memory? Alex On Fri, May 26, 2023 at 12:21 PM Aravind Iddamsetty <aravind.iddamsetty@intel.com> wrote: > > Our hardware supports RAS(Reliability, Availability, Serviceability) by > exposing a set of error counters which can be used by observability > tools to take corrective actions or repairs. Traditionally there were > being exposed via PMU (for relative counters) and sysfs interface (for > absolute value) in our internal branch. But, due to the limitations in > this approach to use two interfaces and also not able to have an event > based reporting or configurability, an alternative approach to try > netlink was suggested by community for drm subsystem wide UAPI for RAS > and telemetry as discussed in [1]. > > This [1] is the inspiration to this series. It uses the generic > netlink(genl) family subsystem and exposes a set of commands that can > be used by every drm driver, the framework provides a means to have > custom commands too. Each drm driver instance in this example xe driver > instance registers a family and operations to the genl subsystem through > which it enumerates and reports the error counters. An event based > notification is also supported to which userpace can subscribe to and > be notified when any error occurs and read the error counter this avoids > continuous polling on error counter. This can also be extended to > threshold based notification. > > [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html > > this series is on top of https://patchwork.freedesktop.org/series/116181/ > > Below is an example tool drm_ras which demonstrates the use of the > supported commands. The tool will be sent to ML with the subject > "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" > > read single error counter: > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 > counter value 0 > > read all error counters: > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > name config-id counter > > error-gt0-correctable-guc 0x0000000000000001 0 > error-gt0-correctable-slm 0x0000000000000003 0 > error-gt0-correctable-eu-ic 0x0000000000000004 0 > error-gt0-correctable-eu-grf 0x0000000000000005 0 > error-gt0-fatal-guc 0x0000000000000009 0 > error-gt0-fatal-slm 0x000000000000000d 0 > error-gt0-fatal-eu-grf 0x000000000000000f 0 > error-gt0-fatal-fpu 0x0000000000000010 0 > error-gt0-fatal-tlb 0x0000000000000011 0 > error-gt0-fatal-l3-fabric 0x0000000000000012 0 > error-gt0-correctable-subslice 0x0000000000000013 0 > error-gt0-correctable-l3bank 0x0000000000000014 0 > error-gt0-fatal-subslice 0x0000000000000015 0 > error-gt0-fatal-l3bank 0x0000000000000016 0 > error-gt0-sgunit-correctable 0x0000000000000017 0 > error-gt0-sgunit-nonfatal 0x0000000000000018 0 > error-gt0-sgunit-fatal 0x0000000000000019 0 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 > error-gt0-soc-fatal-punit 0x000000000000001d 0 > error-gt0-soc-fatal-psf-0 0x000000000000001e 0 > error-gt0-soc-fatal-psf-1 0x000000000000001f 0 > error-gt0-soc-fatal-psf-2 0x0000000000000020 0 > error-gt0-soc-fatal-cd0 0x0000000000000021 0 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 > error-gt1-correctable-guc 0x1000000000000001 0 > error-gt1-correctable-slm 0x1000000000000003 0 > error-gt1-correctable-eu-ic 0x1000000000000004 0 > error-gt1-correctable-eu-grf 0x1000000000000005 0 > error-gt1-fatal-guc 0x1000000000000009 0 > error-gt1-fatal-slm 0x100000000000000d 0 > error-gt1-fatal-eu-grf 0x100000000000000f 0 > error-gt1-fatal-fpu 0x1000000000000010 0 > error-gt1-fatal-tlb 0x1000000000000011 0 > error-gt1-fatal-l3-fabric 0x1000000000000012 0 > error-gt1-correctable-subslice 0x1000000000000013 0 > error-gt1-correctable-l3bank 0x1000000000000014 0 > error-gt1-fatal-subslice 0x1000000000000015 0 > error-gt1-fatal-l3bank 0x1000000000000016 0 > error-gt1-sgunit-correctable 0x1000000000000017 0 > error-gt1-sgunit-nonfatal 0x1000000000000018 0 > error-gt1-sgunit-fatal 0x1000000000000019 0 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 > error-gt1-soc-fatal-punit 0x100000000000001d 0 > error-gt1-soc-fatal-psf-0 0x100000000000001e 0 > error-gt1-soc-fatal-psf-1 0x100000000000001f 0 > error-gt1-soc-fatal-psf-2 0x1000000000000020 0 > error-gt1-soc-fatal-cd0 0x1000000000000021 0 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 > > wait on a error event: > > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 > waiting for error event > error event received > counter value 0 > > list all errors: > > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 > name config-id > > error-gt0-correctable-guc 0x0000000000000001 > error-gt0-correctable-slm 0x0000000000000003 > error-gt0-correctable-eu-ic 0x0000000000000004 > error-gt0-correctable-eu-grf 0x0000000000000005 > error-gt0-fatal-guc 0x0000000000000009 > error-gt0-fatal-slm 0x000000000000000d > error-gt0-fatal-eu-grf 0x000000000000000f > error-gt0-fatal-fpu 0x0000000000000010 > error-gt0-fatal-tlb 0x0000000000000011 > error-gt0-fatal-l3-fabric 0x0000000000000012 > error-gt0-correctable-subslice 0x0000000000000013 > error-gt0-correctable-l3bank 0x0000000000000014 > error-gt0-fatal-subslice 0x0000000000000015 > error-gt0-fatal-l3bank 0x0000000000000016 > error-gt0-sgunit-correctable 0x0000000000000017 > error-gt0-sgunit-nonfatal 0x0000000000000018 > error-gt0-sgunit-fatal 0x0000000000000019 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c > error-gt0-soc-fatal-punit 0x000000000000001d > error-gt0-soc-fatal-psf-0 0x000000000000001e > error-gt0-soc-fatal-psf-1 0x000000000000001f > error-gt0-soc-fatal-psf-2 0x0000000000000020 > error-gt0-soc-fatal-cd0 0x0000000000000021 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 > error-gt1-correctable-guc 0x1000000000000001 > error-gt1-correctable-slm 0x1000000000000003 > error-gt1-correctable-eu-ic 0x1000000000000004 > error-gt1-correctable-eu-grf 0x1000000000000005 > error-gt1-fatal-guc 0x1000000000000009 > error-gt1-fatal-slm 0x100000000000000d > error-gt1-fatal-eu-grf 0x100000000000000f > error-gt1-fatal-fpu 0x1000000000000010 > error-gt1-fatal-tlb 0x1000000000000011 > error-gt1-fatal-l3-fabric 0x1000000000000012 > error-gt1-correctable-subslice 0x1000000000000013 > error-gt1-correctable-l3bank 0x1000000000000014 > error-gt1-fatal-subslice 0x1000000000000015 > error-gt1-fatal-l3bank 0x1000000000000016 > error-gt1-sgunit-correctable 0x1000000000000017 > error-gt1-sgunit-nonfatal 0x1000000000000018 > error-gt1-sgunit-fatal 0x1000000000000019 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c > error-gt1-soc-fatal-punit 0x100000000000001d > error-gt1-soc-fatal-psf-0 0x100000000000001e > error-gt1-soc-fatal-psf-1 0x100000000000001f > error-gt1-soc-fatal-psf-2 0x1000000000000020 > error-gt1-soc-fatal-cd0 0x1000000000000021 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 > > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: David Airlie <airlied@gmail.com> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Oded Gabbay <ogabbay@kernel.org> > > > Aravind Iddamsetty (5): > drm/netlink: Add netlink infrastructure > drm/xe/RAS: Register a genl netlink family > drm/xe/RAS: Expose the error counters > drm/netlink: define multicast groups > drm/xe/RAS: send multicast event on occurrence of an error > > drivers/gpu/drm/xe/Makefile | 1 + > drivers/gpu/drm/xe/xe_device.c | 3 + > drivers/gpu/drm/xe/xe_device_types.h | 2 + > drivers/gpu/drm/xe/xe_irq.c | 32 ++ > drivers/gpu/drm/xe/xe_module.c | 2 + > drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ > drivers/gpu/drm/xe/xe_netlink.h | 14 + > include/uapi/drm/drm_netlink.h | 81 +++++ > include/uapi/drm/xe_drm.h | 64 ++++ > 9 files changed, 725 insertions(+) > create mode 100644 drivers/gpu/drm/xe/xe_netlink.c > create mode 100644 drivers/gpu/drm/xe/xe_netlink.h > create mode 100644 include/uapi/drm/drm_netlink.h > > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem 2023-06-05 16:47 ` Alex Deucher @ 2023-06-06 11:56 ` Iddamsetty, Aravind 0 siblings, 0 replies; 20+ messages in thread From: Iddamsetty, Aravind @ 2023-06-06 11:56 UTC (permalink / raw) To: Alex Deucher, Hawking Zhang, Harish Kasiviswanathan, Kuehling, Felix, Tuikov, Luben Cc: alexander.deucher, ogabbay, intel-xe, dri-devel On 05-06-2023 22:17, Alex Deucher wrote: > Adding the relevant AMD folks for RAS. We currently expose RAS via > sysfs, but also have an event interface in KFD which may be somewhat > similar to this. > > If we were to converge on a common RAS interface, would we want to > look at any commonality in bad page storage/reporting for device > memory? Could you please elaborate a bit on this. Thanks, Aravind. > > Alex > > On Fri, May 26, 2023 at 12:21 PM Aravind Iddamsetty > <aravind.iddamsetty@intel.com> wrote: >> >> Our hardware supports RAS(Reliability, Availability, Serviceability) by >> exposing a set of error counters which can be used by observability >> tools to take corrective actions or repairs. Traditionally there were >> being exposed via PMU (for relative counters) and sysfs interface (for >> absolute value) in our internal branch. But, due to the limitations in >> this approach to use two interfaces and also not able to have an event >> based reporting or configurability, an alternative approach to try >> netlink was suggested by community for drm subsystem wide UAPI for RAS >> and telemetry as discussed in [1]. >> >> This [1] is the inspiration to this series. It uses the generic >> netlink(genl) family subsystem and exposes a set of commands that can >> be used by every drm driver, the framework provides a means to have >> custom commands too. Each drm driver instance in this example xe driver >> instance registers a family and operations to the genl subsystem through >> which it enumerates and reports the error counters. An event based >> notification is also supported to which userpace can subscribe to and >> be notified when any error occurs and read the error counter this avoids >> continuous polling on error counter. This can also be extended to >> threshold based notification. >> >> [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html >> >> this series is on top of https://patchwork.freedesktop.org/series/116181/ >> >> Below is an example tool drm_ras which demonstrates the use of the >> supported commands. The tool will be sent to ML with the subject >> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" >> >> read single error counter: >> >> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 >> counter value 0 >> >> read all error counters: >> >> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 >> name config-id counter >> >> error-gt0-correctable-guc 0x0000000000000001 0 >> error-gt0-correctable-slm 0x0000000000000003 0 >> error-gt0-correctable-eu-ic 0x0000000000000004 0 >> error-gt0-correctable-eu-grf 0x0000000000000005 0 >> error-gt0-fatal-guc 0x0000000000000009 0 >> error-gt0-fatal-slm 0x000000000000000d 0 >> error-gt0-fatal-eu-grf 0x000000000000000f 0 >> error-gt0-fatal-fpu 0x0000000000000010 0 >> error-gt0-fatal-tlb 0x0000000000000011 0 >> error-gt0-fatal-l3-fabric 0x0000000000000012 0 >> error-gt0-correctable-subslice 0x0000000000000013 0 >> error-gt0-correctable-l3bank 0x0000000000000014 0 >> error-gt0-fatal-subslice 0x0000000000000015 0 >> error-gt0-fatal-l3bank 0x0000000000000016 0 >> error-gt0-sgunit-correctable 0x0000000000000017 0 >> error-gt0-sgunit-nonfatal 0x0000000000000018 0 >> error-gt0-sgunit-fatal 0x0000000000000019 0 >> error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 >> error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 >> error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 >> error-gt0-soc-fatal-punit 0x000000000000001d 0 >> error-gt0-soc-fatal-psf-0 0x000000000000001e 0 >> error-gt0-soc-fatal-psf-1 0x000000000000001f 0 >> error-gt0-soc-fatal-psf-2 0x0000000000000020 0 >> error-gt0-soc-fatal-cd0 0x0000000000000021 0 >> error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 >> error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 >> error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 >> error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 >> error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 >> error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 >> error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 >> error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 >> error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 >> error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 >> error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 >> error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 >> error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 >> error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 >> error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 >> error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 >> error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 >> error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 >> error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 >> error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 >> error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 >> error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 >> error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 >> error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 >> error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 >> error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 >> error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 >> error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 >> error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 >> error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 >> error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 >> error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 >> error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 >> error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 >> error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 >> error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 >> error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 >> error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 >> error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 >> error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 >> error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 >> error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 >> error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 >> error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 >> error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 >> error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 >> error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 >> error-gt1-correctable-guc 0x1000000000000001 0 >> error-gt1-correctable-slm 0x1000000000000003 0 >> error-gt1-correctable-eu-ic 0x1000000000000004 0 >> error-gt1-correctable-eu-grf 0x1000000000000005 0 >> error-gt1-fatal-guc 0x1000000000000009 0 >> error-gt1-fatal-slm 0x100000000000000d 0 >> error-gt1-fatal-eu-grf 0x100000000000000f 0 >> error-gt1-fatal-fpu 0x1000000000000010 0 >> error-gt1-fatal-tlb 0x1000000000000011 0 >> error-gt1-fatal-l3-fabric 0x1000000000000012 0 >> error-gt1-correctable-subslice 0x1000000000000013 0 >> error-gt1-correctable-l3bank 0x1000000000000014 0 >> error-gt1-fatal-subslice 0x1000000000000015 0 >> error-gt1-fatal-l3bank 0x1000000000000016 0 >> error-gt1-sgunit-correctable 0x1000000000000017 0 >> error-gt1-sgunit-nonfatal 0x1000000000000018 0 >> error-gt1-sgunit-fatal 0x1000000000000019 0 >> error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 >> error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 >> error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 >> error-gt1-soc-fatal-punit 0x100000000000001d 0 >> error-gt1-soc-fatal-psf-0 0x100000000000001e 0 >> error-gt1-soc-fatal-psf-1 0x100000000000001f 0 >> error-gt1-soc-fatal-psf-2 0x1000000000000020 0 >> error-gt1-soc-fatal-cd0 0x1000000000000021 0 >> error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 >> error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 >> error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 >> error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 >> error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 >> error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 >> error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 >> error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 >> error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 >> error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 >> error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 >> error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 >> error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 >> error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 >> error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 >> error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 >> error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 >> error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 >> error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 >> error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 >> error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 >> error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 >> error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 >> error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 >> error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 >> error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 >> error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 >> error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 >> error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 >> error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 >> error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 >> error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 >> error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 >> error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 >> error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 >> >> wait on a error event: >> >> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 >> waiting for error event >> error event received >> counter value 0 >> >> list all errors: >> >> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 >> name config-id >> >> error-gt0-correctable-guc 0x0000000000000001 >> error-gt0-correctable-slm 0x0000000000000003 >> error-gt0-correctable-eu-ic 0x0000000000000004 >> error-gt0-correctable-eu-grf 0x0000000000000005 >> error-gt0-fatal-guc 0x0000000000000009 >> error-gt0-fatal-slm 0x000000000000000d >> error-gt0-fatal-eu-grf 0x000000000000000f >> error-gt0-fatal-fpu 0x0000000000000010 >> error-gt0-fatal-tlb 0x0000000000000011 >> error-gt0-fatal-l3-fabric 0x0000000000000012 >> error-gt0-correctable-subslice 0x0000000000000013 >> error-gt0-correctable-l3bank 0x0000000000000014 >> error-gt0-fatal-subslice 0x0000000000000015 >> error-gt0-fatal-l3bank 0x0000000000000016 >> error-gt0-sgunit-correctable 0x0000000000000017 >> error-gt0-sgunit-nonfatal 0x0000000000000018 >> error-gt0-sgunit-fatal 0x0000000000000019 >> error-gt0-soc-fatal-psf-csc-0 0x000000000000001a >> error-gt0-soc-fatal-psf-csc-1 0x000000000000001b >> error-gt0-soc-fatal-psf-csc-2 0x000000000000001c >> error-gt0-soc-fatal-punit 0x000000000000001d >> error-gt0-soc-fatal-psf-0 0x000000000000001e >> error-gt0-soc-fatal-psf-1 0x000000000000001f >> error-gt0-soc-fatal-psf-2 0x0000000000000020 >> error-gt0-soc-fatal-cd0 0x0000000000000021 >> error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 >> error-gt0-soc-fatal-mdfi-east 0x0000000000000023 >> error-gt0-soc-fatal-mdfi-south 0x0000000000000024 >> error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 >> error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 >> error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 >> error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 >> error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 >> error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a >> error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b >> error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c >> error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d >> error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e >> error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f >> error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 >> error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 >> error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 >> error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 >> error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 >> error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 >> error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 >> error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 >> error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 >> error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 >> error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a >> error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b >> error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c >> error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d >> error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e >> error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f >> error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 >> error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 >> error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 >> error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 >> error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 >> error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 >> error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 >> error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 >> error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 >> error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 >> error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a >> error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b >> error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c >> error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d >> error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e >> error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f >> error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 >> error-gt1-correctable-guc 0x1000000000000001 >> error-gt1-correctable-slm 0x1000000000000003 >> error-gt1-correctable-eu-ic 0x1000000000000004 >> error-gt1-correctable-eu-grf 0x1000000000000005 >> error-gt1-fatal-guc 0x1000000000000009 >> error-gt1-fatal-slm 0x100000000000000d >> error-gt1-fatal-eu-grf 0x100000000000000f >> error-gt1-fatal-fpu 0x1000000000000010 >> error-gt1-fatal-tlb 0x1000000000000011 >> error-gt1-fatal-l3-fabric 0x1000000000000012 >> error-gt1-correctable-subslice 0x1000000000000013 >> error-gt1-correctable-l3bank 0x1000000000000014 >> error-gt1-fatal-subslice 0x1000000000000015 >> error-gt1-fatal-l3bank 0x1000000000000016 >> error-gt1-sgunit-correctable 0x1000000000000017 >> error-gt1-sgunit-nonfatal 0x1000000000000018 >> error-gt1-sgunit-fatal 0x1000000000000019 >> error-gt1-soc-fatal-psf-csc-0 0x100000000000001a >> error-gt1-soc-fatal-psf-csc-1 0x100000000000001b >> error-gt1-soc-fatal-psf-csc-2 0x100000000000001c >> error-gt1-soc-fatal-punit 0x100000000000001d >> error-gt1-soc-fatal-psf-0 0x100000000000001e >> error-gt1-soc-fatal-psf-1 0x100000000000001f >> error-gt1-soc-fatal-psf-2 0x1000000000000020 >> error-gt1-soc-fatal-cd0 0x1000000000000021 >> error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 >> error-gt1-soc-fatal-mdfi-east 0x1000000000000023 >> error-gt1-soc-fatal-mdfi-south 0x1000000000000024 >> error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 >> error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 >> error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 >> error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 >> error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 >> error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a >> error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b >> error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c >> error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d >> error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e >> error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f >> error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 >> error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 >> error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 >> error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 >> error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 >> error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 >> error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 >> error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 >> error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 >> error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 >> error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a >> error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b >> error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c >> error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d >> error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e >> error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f >> error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 >> error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 >> error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 >> error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 >> error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 >> >> Cc: Alex Deucher <alexander.deucher@amd.com> >> Cc: David Airlie <airlied@gmail.com> >> Cc: Daniel Vetter <daniel@ffwll.ch> >> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> >> Cc: Oded Gabbay <ogabbay@kernel.org> >> >> >> Aravind Iddamsetty (5): >> drm/netlink: Add netlink infrastructure >> drm/xe/RAS: Register a genl netlink family >> drm/xe/RAS: Expose the error counters >> drm/netlink: define multicast groups >> drm/xe/RAS: send multicast event on occurrence of an error >> >> drivers/gpu/drm/xe/Makefile | 1 + >> drivers/gpu/drm/xe/xe_device.c | 3 + >> drivers/gpu/drm/xe/xe_device_types.h | 2 + >> drivers/gpu/drm/xe/xe_irq.c | 32 ++ >> drivers/gpu/drm/xe/xe_module.c | 2 + >> drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ >> drivers/gpu/drm/xe/xe_netlink.h | 14 + >> include/uapi/drm/drm_netlink.h | 81 +++++ >> include/uapi/drm/xe_drm.h | 64 ++++ >> 9 files changed, 725 insertions(+) >> create mode 100644 drivers/gpu/drm/xe/xe_netlink.c >> create mode 100644 drivers/gpu/drm/xe/xe_netlink.h >> create mode 100644 include/uapi/drm/drm_netlink.h >> >> -- >> 2.25.1 >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty ` (6 preceding siblings ...) 2023-06-05 16:47 ` Alex Deucher @ 2023-06-21 17:24 ` Sebastian Wick 2023-07-17 12:02 ` Oded Gabbay 7 siblings, 1 reply; 20+ messages in thread From: Sebastian Wick @ 2023-06-21 17:24 UTC (permalink / raw) To: Aravind Iddamsetty; +Cc: alexander.deucher, ogabbay, intel-xe, dri-devel On Fri, May 26, 2023 at 6:21 PM Aravind Iddamsetty <aravind.iddamsetty@intel.com> wrote: > > Our hardware supports RAS(Reliability, Availability, Serviceability) by > exposing a set of error counters which can be used by observability > tools to take corrective actions or repairs. Traditionally there were > being exposed via PMU (for relative counters) and sysfs interface (for > absolute value) in our internal branch. But, due to the limitations in > this approach to use two interfaces and also not able to have an event > based reporting or configurability, an alternative approach to try > netlink was suggested by community for drm subsystem wide UAPI for RAS > and telemetry as discussed in [1]. > > This [1] is the inspiration to this series. It uses the generic > netlink(genl) family subsystem and exposes a set of commands that can > be used by every drm driver, the framework provides a means to have > custom commands too. Each drm driver instance in this example xe driver > instance registers a family and operations to the genl subsystem through > which it enumerates and reports the error counters. An event based > notification is also supported to which userpace can subscribe to and > be notified when any error occurs and read the error counter this avoids > continuous polling on error counter. This can also be extended to > threshold based notification. Be aware that netlink can be quite awkward in user space because it's attached to the netns while the device is in the mount ns and there are special rules for netlink regarding namespacing. > [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html > > this series is on top of https://patchwork.freedesktop.org/series/116181/ > > Below is an example tool drm_ras which demonstrates the use of the > supported commands. The tool will be sent to ML with the subject > "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" > > read single error counter: > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 > counter value 0 > > read all error counters: > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > name config-id counter > > error-gt0-correctable-guc 0x0000000000000001 0 > error-gt0-correctable-slm 0x0000000000000003 0 > error-gt0-correctable-eu-ic 0x0000000000000004 0 > error-gt0-correctable-eu-grf 0x0000000000000005 0 > error-gt0-fatal-guc 0x0000000000000009 0 > error-gt0-fatal-slm 0x000000000000000d 0 > error-gt0-fatal-eu-grf 0x000000000000000f 0 > error-gt0-fatal-fpu 0x0000000000000010 0 > error-gt0-fatal-tlb 0x0000000000000011 0 > error-gt0-fatal-l3-fabric 0x0000000000000012 0 > error-gt0-correctable-subslice 0x0000000000000013 0 > error-gt0-correctable-l3bank 0x0000000000000014 0 > error-gt0-fatal-subslice 0x0000000000000015 0 > error-gt0-fatal-l3bank 0x0000000000000016 0 > error-gt0-sgunit-correctable 0x0000000000000017 0 > error-gt0-sgunit-nonfatal 0x0000000000000018 0 > error-gt0-sgunit-fatal 0x0000000000000019 0 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 > error-gt0-soc-fatal-punit 0x000000000000001d 0 > error-gt0-soc-fatal-psf-0 0x000000000000001e 0 > error-gt0-soc-fatal-psf-1 0x000000000000001f 0 > error-gt0-soc-fatal-psf-2 0x0000000000000020 0 > error-gt0-soc-fatal-cd0 0x0000000000000021 0 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 > error-gt1-correctable-guc 0x1000000000000001 0 > error-gt1-correctable-slm 0x1000000000000003 0 > error-gt1-correctable-eu-ic 0x1000000000000004 0 > error-gt1-correctable-eu-grf 0x1000000000000005 0 > error-gt1-fatal-guc 0x1000000000000009 0 > error-gt1-fatal-slm 0x100000000000000d 0 > error-gt1-fatal-eu-grf 0x100000000000000f 0 > error-gt1-fatal-fpu 0x1000000000000010 0 > error-gt1-fatal-tlb 0x1000000000000011 0 > error-gt1-fatal-l3-fabric 0x1000000000000012 0 > error-gt1-correctable-subslice 0x1000000000000013 0 > error-gt1-correctable-l3bank 0x1000000000000014 0 > error-gt1-fatal-subslice 0x1000000000000015 0 > error-gt1-fatal-l3bank 0x1000000000000016 0 > error-gt1-sgunit-correctable 0x1000000000000017 0 > error-gt1-sgunit-nonfatal 0x1000000000000018 0 > error-gt1-sgunit-fatal 0x1000000000000019 0 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 > error-gt1-soc-fatal-punit 0x100000000000001d 0 > error-gt1-soc-fatal-psf-0 0x100000000000001e 0 > error-gt1-soc-fatal-psf-1 0x100000000000001f 0 > error-gt1-soc-fatal-psf-2 0x1000000000000020 0 > error-gt1-soc-fatal-cd0 0x1000000000000021 0 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 > > wait on a error event: > > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 > waiting for error event > error event received > counter value 0 > > list all errors: > > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 > name config-id > > error-gt0-correctable-guc 0x0000000000000001 > error-gt0-correctable-slm 0x0000000000000003 > error-gt0-correctable-eu-ic 0x0000000000000004 > error-gt0-correctable-eu-grf 0x0000000000000005 > error-gt0-fatal-guc 0x0000000000000009 > error-gt0-fatal-slm 0x000000000000000d > error-gt0-fatal-eu-grf 0x000000000000000f > error-gt0-fatal-fpu 0x0000000000000010 > error-gt0-fatal-tlb 0x0000000000000011 > error-gt0-fatal-l3-fabric 0x0000000000000012 > error-gt0-correctable-subslice 0x0000000000000013 > error-gt0-correctable-l3bank 0x0000000000000014 > error-gt0-fatal-subslice 0x0000000000000015 > error-gt0-fatal-l3bank 0x0000000000000016 > error-gt0-sgunit-correctable 0x0000000000000017 > error-gt0-sgunit-nonfatal 0x0000000000000018 > error-gt0-sgunit-fatal 0x0000000000000019 > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c > error-gt0-soc-fatal-punit 0x000000000000001d > error-gt0-soc-fatal-psf-0 0x000000000000001e > error-gt0-soc-fatal-psf-1 0x000000000000001f > error-gt0-soc-fatal-psf-2 0x0000000000000020 > error-gt0-soc-fatal-cd0 0x0000000000000021 > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 > error-gt1-correctable-guc 0x1000000000000001 > error-gt1-correctable-slm 0x1000000000000003 > error-gt1-correctable-eu-ic 0x1000000000000004 > error-gt1-correctable-eu-grf 0x1000000000000005 > error-gt1-fatal-guc 0x1000000000000009 > error-gt1-fatal-slm 0x100000000000000d > error-gt1-fatal-eu-grf 0x100000000000000f > error-gt1-fatal-fpu 0x1000000000000010 > error-gt1-fatal-tlb 0x1000000000000011 > error-gt1-fatal-l3-fabric 0x1000000000000012 > error-gt1-correctable-subslice 0x1000000000000013 > error-gt1-correctable-l3bank 0x1000000000000014 > error-gt1-fatal-subslice 0x1000000000000015 > error-gt1-fatal-l3bank 0x1000000000000016 > error-gt1-sgunit-correctable 0x1000000000000017 > error-gt1-sgunit-nonfatal 0x1000000000000018 > error-gt1-sgunit-fatal 0x1000000000000019 > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c > error-gt1-soc-fatal-punit 0x100000000000001d > error-gt1-soc-fatal-psf-0 0x100000000000001e > error-gt1-soc-fatal-psf-1 0x100000000000001f > error-gt1-soc-fatal-psf-2 0x1000000000000020 > error-gt1-soc-fatal-cd0 0x1000000000000021 > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 > > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: David Airlie <airlied@gmail.com> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Oded Gabbay <ogabbay@kernel.org> > > > Aravind Iddamsetty (5): > drm/netlink: Add netlink infrastructure > drm/xe/RAS: Register a genl netlink family > drm/xe/RAS: Expose the error counters > drm/netlink: define multicast groups > drm/xe/RAS: send multicast event on occurrence of an error > > drivers/gpu/drm/xe/Makefile | 1 + > drivers/gpu/drm/xe/xe_device.c | 3 + > drivers/gpu/drm/xe/xe_device_types.h | 2 + > drivers/gpu/drm/xe/xe_irq.c | 32 ++ > drivers/gpu/drm/xe/xe_module.c | 2 + > drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ > drivers/gpu/drm/xe/xe_netlink.h | 14 + > include/uapi/drm/drm_netlink.h | 81 +++++ > include/uapi/drm/xe_drm.h | 64 ++++ > 9 files changed, 725 insertions(+) > create mode 100644 drivers/gpu/drm/xe/xe_netlink.c > create mode 100644 drivers/gpu/drm/xe/xe_netlink.h > create mode 100644 include/uapi/drm/drm_netlink.h > > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem 2023-06-21 17:24 ` Sebastian Wick @ 2023-07-17 12:02 ` Oded Gabbay 0 siblings, 0 replies; 20+ messages in thread From: Oded Gabbay @ 2023-07-17 12:02 UTC (permalink / raw) To: Sebastian Wick; +Cc: alexander.deucher, dri-devel, intel-xe, Aravind Iddamsetty On Wed, Jun 21, 2023 at 8:24 PM Sebastian Wick <sebastian.wick@redhat.com> wrote: > > On Fri, May 26, 2023 at 6:21 PM Aravind Iddamsetty > <aravind.iddamsetty@intel.com> wrote: > > > > Our hardware supports RAS(Reliability, Availability, Serviceability) by > > exposing a set of error counters which can be used by observability > > tools to take corrective actions or repairs. Traditionally there were > > being exposed via PMU (for relative counters) and sysfs interface (for > > absolute value) in our internal branch. But, due to the limitations in > > this approach to use two interfaces and also not able to have an event > > based reporting or configurability, an alternative approach to try > > netlink was suggested by community for drm subsystem wide UAPI for RAS > > and telemetry as discussed in [1]. > > > > This [1] is the inspiration to this series. It uses the generic > > netlink(genl) family subsystem and exposes a set of commands that can > > be used by every drm driver, the framework provides a means to have > > custom commands too. Each drm driver instance in this example xe driver > > instance registers a family and operations to the genl subsystem through > > which it enumerates and reports the error counters. An event based > > notification is also supported to which userpace can subscribe to and > > be notified when any error occurs and read the error counter this avoids > > continuous polling on error counter. This can also be extended to > > threshold based notification. > > Be aware that netlink can be quite awkward in user space because it's > attached to the netns while the device is in the mount ns and there > are special rules for netlink regarding namespacing. I agree, we need to be sure this works in all common deployments, mainly dockers and kubernetes, before deciding to go down this path. Oded > > > [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html > > > > this series is on top of https://patchwork.freedesktop.org/series/116181/ > > > > Below is an example tool drm_ras which demonstrates the use of the > > supported commands. The tool will be sent to ML with the subject > > "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" > > > > read single error counter: > > > > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 > > counter value 0 > > > > read all error counters: > > > > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 > > name config-id counter > > > > error-gt0-correctable-guc 0x0000000000000001 0 > > error-gt0-correctable-slm 0x0000000000000003 0 > > error-gt0-correctable-eu-ic 0x0000000000000004 0 > > error-gt0-correctable-eu-grf 0x0000000000000005 0 > > error-gt0-fatal-guc 0x0000000000000009 0 > > error-gt0-fatal-slm 0x000000000000000d 0 > > error-gt0-fatal-eu-grf 0x000000000000000f 0 > > error-gt0-fatal-fpu 0x0000000000000010 0 > > error-gt0-fatal-tlb 0x0000000000000011 0 > > error-gt0-fatal-l3-fabric 0x0000000000000012 0 > > error-gt0-correctable-subslice 0x0000000000000013 0 > > error-gt0-correctable-l3bank 0x0000000000000014 0 > > error-gt0-fatal-subslice 0x0000000000000015 0 > > error-gt0-fatal-l3bank 0x0000000000000016 0 > > error-gt0-sgunit-correctable 0x0000000000000017 0 > > error-gt0-sgunit-nonfatal 0x0000000000000018 0 > > error-gt0-sgunit-fatal 0x0000000000000019 0 > > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 > > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 > > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 > > error-gt0-soc-fatal-punit 0x000000000000001d 0 > > error-gt0-soc-fatal-psf-0 0x000000000000001e 0 > > error-gt0-soc-fatal-psf-1 0x000000000000001f 0 > > error-gt0-soc-fatal-psf-2 0x0000000000000020 0 > > error-gt0-soc-fatal-cd0 0x0000000000000021 0 > > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 > > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 > > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 > > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 > > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 > > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 > > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 > > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 > > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 > > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 > > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 > > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 > > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 > > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 > > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 > > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 > > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 > > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 > > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 > > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 > > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 > > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 > > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 > > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 > > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 > > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 > > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 > > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 > > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 > > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 > > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 > > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 > > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 > > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 > > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 > > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 > > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 > > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 > > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 > > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 > > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 > > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 > > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 > > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 > > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 > > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 > > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 > > error-gt1-correctable-guc 0x1000000000000001 0 > > error-gt1-correctable-slm 0x1000000000000003 0 > > error-gt1-correctable-eu-ic 0x1000000000000004 0 > > error-gt1-correctable-eu-grf 0x1000000000000005 0 > > error-gt1-fatal-guc 0x1000000000000009 0 > > error-gt1-fatal-slm 0x100000000000000d 0 > > error-gt1-fatal-eu-grf 0x100000000000000f 0 > > error-gt1-fatal-fpu 0x1000000000000010 0 > > error-gt1-fatal-tlb 0x1000000000000011 0 > > error-gt1-fatal-l3-fabric 0x1000000000000012 0 > > error-gt1-correctable-subslice 0x1000000000000013 0 > > error-gt1-correctable-l3bank 0x1000000000000014 0 > > error-gt1-fatal-subslice 0x1000000000000015 0 > > error-gt1-fatal-l3bank 0x1000000000000016 0 > > error-gt1-sgunit-correctable 0x1000000000000017 0 > > error-gt1-sgunit-nonfatal 0x1000000000000018 0 > > error-gt1-sgunit-fatal 0x1000000000000019 0 > > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 > > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 > > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 > > error-gt1-soc-fatal-punit 0x100000000000001d 0 > > error-gt1-soc-fatal-psf-0 0x100000000000001e 0 > > error-gt1-soc-fatal-psf-1 0x100000000000001f 0 > > error-gt1-soc-fatal-psf-2 0x1000000000000020 0 > > error-gt1-soc-fatal-cd0 0x1000000000000021 0 > > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 > > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 > > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 > > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 > > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 > > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 > > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 > > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 > > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 > > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 > > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 > > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 > > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 > > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 > > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 > > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 > > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 > > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 > > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 > > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 > > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 > > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 > > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 > > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 > > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 > > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 > > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 > > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 > > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 > > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 > > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 > > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 > > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 > > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 > > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 > > > > wait on a error event: > > > > $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 > > waiting for error event > > error event received > > counter value 0 > > > > list all errors: > > > > $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 > > name config-id > > > > error-gt0-correctable-guc 0x0000000000000001 > > error-gt0-correctable-slm 0x0000000000000003 > > error-gt0-correctable-eu-ic 0x0000000000000004 > > error-gt0-correctable-eu-grf 0x0000000000000005 > > error-gt0-fatal-guc 0x0000000000000009 > > error-gt0-fatal-slm 0x000000000000000d > > error-gt0-fatal-eu-grf 0x000000000000000f > > error-gt0-fatal-fpu 0x0000000000000010 > > error-gt0-fatal-tlb 0x0000000000000011 > > error-gt0-fatal-l3-fabric 0x0000000000000012 > > error-gt0-correctable-subslice 0x0000000000000013 > > error-gt0-correctable-l3bank 0x0000000000000014 > > error-gt0-fatal-subslice 0x0000000000000015 > > error-gt0-fatal-l3bank 0x0000000000000016 > > error-gt0-sgunit-correctable 0x0000000000000017 > > error-gt0-sgunit-nonfatal 0x0000000000000018 > > error-gt0-sgunit-fatal 0x0000000000000019 > > error-gt0-soc-fatal-psf-csc-0 0x000000000000001a > > error-gt0-soc-fatal-psf-csc-1 0x000000000000001b > > error-gt0-soc-fatal-psf-csc-2 0x000000000000001c > > error-gt0-soc-fatal-punit 0x000000000000001d > > error-gt0-soc-fatal-psf-0 0x000000000000001e > > error-gt0-soc-fatal-psf-1 0x000000000000001f > > error-gt0-soc-fatal-psf-2 0x0000000000000020 > > error-gt0-soc-fatal-cd0 0x0000000000000021 > > error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 > > error-gt0-soc-fatal-mdfi-east 0x0000000000000023 > > error-gt0-soc-fatal-mdfi-south 0x0000000000000024 > > error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 > > error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 > > error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 > > error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 > > error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 > > error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a > > error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b > > error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c > > error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d > > error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e > > error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f > > error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 > > error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 > > error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 > > error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 > > error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 > > error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 > > error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 > > error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 > > error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 > > error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 > > error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a > > error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b > > error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c > > error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d > > error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e > > error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f > > error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 > > error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 > > error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 > > error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 > > error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 > > error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 > > error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 > > error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 > > error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 > > error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 > > error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a > > error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b > > error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c > > error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d > > error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e > > error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f > > error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 > > error-gt1-correctable-guc 0x1000000000000001 > > error-gt1-correctable-slm 0x1000000000000003 > > error-gt1-correctable-eu-ic 0x1000000000000004 > > error-gt1-correctable-eu-grf 0x1000000000000005 > > error-gt1-fatal-guc 0x1000000000000009 > > error-gt1-fatal-slm 0x100000000000000d > > error-gt1-fatal-eu-grf 0x100000000000000f > > error-gt1-fatal-fpu 0x1000000000000010 > > error-gt1-fatal-tlb 0x1000000000000011 > > error-gt1-fatal-l3-fabric 0x1000000000000012 > > error-gt1-correctable-subslice 0x1000000000000013 > > error-gt1-correctable-l3bank 0x1000000000000014 > > error-gt1-fatal-subslice 0x1000000000000015 > > error-gt1-fatal-l3bank 0x1000000000000016 > > error-gt1-sgunit-correctable 0x1000000000000017 > > error-gt1-sgunit-nonfatal 0x1000000000000018 > > error-gt1-sgunit-fatal 0x1000000000000019 > > error-gt1-soc-fatal-psf-csc-0 0x100000000000001a > > error-gt1-soc-fatal-psf-csc-1 0x100000000000001b > > error-gt1-soc-fatal-psf-csc-2 0x100000000000001c > > error-gt1-soc-fatal-punit 0x100000000000001d > > error-gt1-soc-fatal-psf-0 0x100000000000001e > > error-gt1-soc-fatal-psf-1 0x100000000000001f > > error-gt1-soc-fatal-psf-2 0x1000000000000020 > > error-gt1-soc-fatal-cd0 0x1000000000000021 > > error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 > > error-gt1-soc-fatal-mdfi-east 0x1000000000000023 > > error-gt1-soc-fatal-mdfi-south 0x1000000000000024 > > error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 > > error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 > > error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 > > error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 > > error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 > > error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a > > error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b > > error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c > > error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d > > error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e > > error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f > > error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 > > error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 > > error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 > > error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 > > error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 > > error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 > > error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 > > error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 > > error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 > > error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 > > error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a > > error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b > > error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c > > error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d > > error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e > > error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f > > error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 > > error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 > > error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 > > error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 > > error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 > > > > Cc: Alex Deucher <alexander.deucher@amd.com> > > Cc: David Airlie <airlied@gmail.com> > > Cc: Daniel Vetter <daniel@ffwll.ch> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > > Cc: Oded Gabbay <ogabbay@kernel.org> > > > > > > Aravind Iddamsetty (5): > > drm/netlink: Add netlink infrastructure > > drm/xe/RAS: Register a genl netlink family > > drm/xe/RAS: Expose the error counters > > drm/netlink: define multicast groups > > drm/xe/RAS: send multicast event on occurrence of an error > > > > drivers/gpu/drm/xe/Makefile | 1 + > > drivers/gpu/drm/xe/xe_device.c | 3 + > > drivers/gpu/drm/xe/xe_device_types.h | 2 + > > drivers/gpu/drm/xe/xe_irq.c | 32 ++ > > drivers/gpu/drm/xe/xe_module.c | 2 + > > drivers/gpu/drm/xe/xe_netlink.c | 526 +++++++++++++++++++++++++++ > > drivers/gpu/drm/xe/xe_netlink.h | 14 + > > include/uapi/drm/drm_netlink.h | 81 +++++ > > include/uapi/drm/xe_drm.h | 64 ++++ > > 9 files changed, 725 insertions(+) > > create mode 100644 drivers/gpu/drm/xe/xe_netlink.c > > create mode 100644 drivers/gpu/drm/xe/xe_netlink.h > > create mode 100644 include/uapi/drm/drm_netlink.h > > > > -- > > 2.25.1 > > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC v4 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem @ 2023-10-20 15:58 Aravind Iddamsetty 2023-10-20 15:58 ` [RFC 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty 0 siblings, 1 reply; 20+ messages in thread From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw) To: intel-xe, dri-devel, alexander.deucher, airlied, daniel, joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, michael.j.ruhl Our hardware supports RAS(Reliability, Availability, Serviceability) by reporting the errors to the host, which the KMD processes and exposes a set of error counters which can be used by observability tools to take corrective actions or repairs. Traditionally there were being exposed via PMU (for relative counters) and sysfs interface (for absolute value) in our internal branch. But, due to the limitations in this approach to use two interfaces and also not able to have an event based reporting or configurability, an alternative approach to try netlink was suggested by community for drm subsystem wide UAPI for RAS and telemetry as discussed in [1]. This [1] is the inspiration to this series. It uses the generic netlink(genl) family subsystem and exposes a set of commands that can be used by every drm driver, the framework provides a means to have custom commands too. Each drm driver instance in this example xe driver instance registers a family and operations to the genl subsystem through which it enumerates and reports the error counters. An event based notification is also supported to which userpace can subscribe to and be notified when any error occurs and read the error counter this avoids continuous polling on error counter. This can also be extended to threshold based notification. [1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html this series is on top of https://patchwork.freedesktop.org/series/125373/, v4: 1. Rebase 2. rename drm_genl_send to drm_genl_reply 3. catch error from xa_store and handle appropriately 4. presently xe_list_errors fills blank data for IGFX, prevent it by having an early check of IS_DGFX (Michael J. Ruhl) v3: 1. Rebase on latest RAS series for XE 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to register to netlink subsystem v2: define common interfaces to genl netlink subsystem that all drm drivers can leverage. Below is an example tool drm_ras which demonstrates the use of the supported commands. The tool will be sent to ML with the subject "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters" https://patchwork.freedesktop.org/series/118437/#rev2 read single error counter: $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005 counter value 0 read all error counters: $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 name config-id counter error-gt0-correctable-guc 0x0000000000000001 0 error-gt0-correctable-slm 0x0000000000000003 0 error-gt0-correctable-eu-ic 0x0000000000000004 0 error-gt0-correctable-eu-grf 0x0000000000000005 0 error-gt0-fatal-guc 0x0000000000000009 0 error-gt0-fatal-slm 0x000000000000000d 0 error-gt0-fatal-eu-grf 0x000000000000000f 0 error-gt0-fatal-fpu 0x0000000000000010 0 error-gt0-fatal-tlb 0x0000000000000011 0 error-gt0-fatal-l3-fabric 0x0000000000000012 0 error-gt0-correctable-subslice 0x0000000000000013 0 error-gt0-correctable-l3bank 0x0000000000000014 0 error-gt0-fatal-subslice 0x0000000000000015 0 error-gt0-fatal-l3bank 0x0000000000000016 0 error-gt0-sgunit-correctable 0x0000000000000017 0 error-gt0-sgunit-nonfatal 0x0000000000000018 0 error-gt0-sgunit-fatal 0x0000000000000019 0 error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0 error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0 error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0 error-gt0-soc-fatal-punit 0x000000000000001d 0 error-gt0-soc-fatal-psf-0 0x000000000000001e 0 error-gt0-soc-fatal-psf-1 0x000000000000001f 0 error-gt0-soc-fatal-psf-2 0x0000000000000020 0 error-gt0-soc-fatal-cd0 0x0000000000000021 0 error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0 error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0 error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0 error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0 error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0 error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0 error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0 error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0 error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0 error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0 error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0 error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0 error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0 error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0 error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0 error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0 error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0 error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0 error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0 error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0 error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0 error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0 error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0 error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0 error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0 error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0 error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0 error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0 error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0 error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0 error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0 error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0 error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0 error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0 error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0 error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0 error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0 error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0 error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0 error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0 error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0 error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0 error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0 error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0 error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0 error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0 error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0 error-gt1-correctable-guc 0x1000000000000001 0 error-gt1-correctable-slm 0x1000000000000003 0 error-gt1-correctable-eu-ic 0x1000000000000004 0 error-gt1-correctable-eu-grf 0x1000000000000005 0 error-gt1-fatal-guc 0x1000000000000009 0 error-gt1-fatal-slm 0x100000000000000d 0 error-gt1-fatal-eu-grf 0x100000000000000f 0 error-gt1-fatal-fpu 0x1000000000000010 0 error-gt1-fatal-tlb 0x1000000000000011 0 error-gt1-fatal-l3-fabric 0x1000000000000012 0 error-gt1-correctable-subslice 0x1000000000000013 0 error-gt1-correctable-l3bank 0x1000000000000014 0 error-gt1-fatal-subslice 0x1000000000000015 0 error-gt1-fatal-l3bank 0x1000000000000016 0 error-gt1-sgunit-correctable 0x1000000000000017 0 error-gt1-sgunit-nonfatal 0x1000000000000018 0 error-gt1-sgunit-fatal 0x1000000000000019 0 error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0 error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0 error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0 error-gt1-soc-fatal-punit 0x100000000000001d 0 error-gt1-soc-fatal-psf-0 0x100000000000001e 0 error-gt1-soc-fatal-psf-1 0x100000000000001f 0 error-gt1-soc-fatal-psf-2 0x1000000000000020 0 error-gt1-soc-fatal-cd0 0x1000000000000021 0 error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0 error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0 error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0 error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0 error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0 error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0 error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0 error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0 error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0 error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0 error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0 error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0 error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0 error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0 error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0 error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0 error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0 error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0 error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0 error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0 error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0 error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0 error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0 error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0 error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0 error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0 error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0 error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0 error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0 error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0 error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0 error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0 error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0 error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0 error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0 wait on a error event: $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 waiting for error event error event received counter value 0 list all errors: $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 name config-id error-gt0-correctable-guc 0x0000000000000001 error-gt0-correctable-slm 0x0000000000000003 error-gt0-correctable-eu-ic 0x0000000000000004 error-gt0-correctable-eu-grf 0x0000000000000005 error-gt0-fatal-guc 0x0000000000000009 error-gt0-fatal-slm 0x000000000000000d error-gt0-fatal-eu-grf 0x000000000000000f error-gt0-fatal-fpu 0x0000000000000010 error-gt0-fatal-tlb 0x0000000000000011 error-gt0-fatal-l3-fabric 0x0000000000000012 error-gt0-correctable-subslice 0x0000000000000013 error-gt0-correctable-l3bank 0x0000000000000014 error-gt0-fatal-subslice 0x0000000000000015 error-gt0-fatal-l3bank 0x0000000000000016 error-gt0-sgunit-correctable 0x0000000000000017 error-gt0-sgunit-nonfatal 0x0000000000000018 error-gt0-sgunit-fatal 0x0000000000000019 error-gt0-soc-fatal-psf-csc-0 0x000000000000001a error-gt0-soc-fatal-psf-csc-1 0x000000000000001b error-gt0-soc-fatal-psf-csc-2 0x000000000000001c error-gt0-soc-fatal-punit 0x000000000000001d error-gt0-soc-fatal-psf-0 0x000000000000001e error-gt0-soc-fatal-psf-1 0x000000000000001f error-gt0-soc-fatal-psf-2 0x0000000000000020 error-gt0-soc-fatal-cd0 0x0000000000000021 error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 error-gt0-soc-fatal-mdfi-east 0x0000000000000023 error-gt0-soc-fatal-mdfi-south 0x0000000000000024 error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 error-gt1-correctable-guc 0x1000000000000001 error-gt1-correctable-slm 0x1000000000000003 error-gt1-correctable-eu-ic 0x1000000000000004 error-gt1-correctable-eu-grf 0x1000000000000005 error-gt1-fatal-guc 0x1000000000000009 error-gt1-fatal-slm 0x100000000000000d error-gt1-fatal-eu-grf 0x100000000000000f error-gt1-fatal-fpu 0x1000000000000010 error-gt1-fatal-tlb 0x1000000000000011 error-gt1-fatal-l3-fabric 0x1000000000000012 error-gt1-correctable-subslice 0x1000000000000013 error-gt1-correctable-l3bank 0x1000000000000014 error-gt1-fatal-subslice 0x1000000000000015 error-gt1-fatal-l3bank 0x1000000000000016 error-gt1-sgunit-correctable 0x1000000000000017 error-gt1-sgunit-nonfatal 0x1000000000000018 error-gt1-sgunit-fatal 0x1000000000000019 error-gt1-soc-fatal-psf-csc-0 0x100000000000001a error-gt1-soc-fatal-psf-csc-1 0x100000000000001b error-gt1-soc-fatal-psf-csc-2 0x100000000000001c error-gt1-soc-fatal-punit 0x100000000000001d error-gt1-soc-fatal-psf-0 0x100000000000001e error-gt1-soc-fatal-psf-1 0x100000000000001f error-gt1-soc-fatal-psf-2 0x1000000000000020 error-gt1-soc-fatal-cd0 0x1000000000000021 error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 error-gt1-soc-fatal-mdfi-east 0x1000000000000023 error-gt1-soc-fatal-mdfi-south 0x1000000000000024 error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Tomer Tayar <ttayar@habana.ai> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Cc: Kuehling Felix <Felix.Kuehling@amd.com> Cc: Tuikov Luben <Luben.Tuikov@amd.com> Cc: Ruhl, Michael J <michael.j.ruhl@intel.com> Aravind Iddamsetty (5): drm/netlink: Add netlink infrastructure drm/xe/RAS: Register netlink capability drm/xe/RAS: Expose the error counters drm/netlink: Define multicast groups drm/xe/RAS: send multicast event on occurrence of an error drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/drm_drv.c | 7 + drivers/gpu/drm/drm_netlink.c | 195 ++++++++++ drivers/gpu/drm/xe/Makefile | 1 + drivers/gpu/drm/xe/xe_device.c | 4 + drivers/gpu/drm/xe/xe_device_types.h | 1 + drivers/gpu/drm/xe/xe_hw_error.c | 33 ++ drivers/gpu/drm/xe/xe_netlink.c | 517 +++++++++++++++++++++++++++ include/drm/drm_device.h | 8 + include/drm/drm_drv.h | 7 + include/drm/drm_netlink.h | 35 ++ include/uapi/drm/drm_netlink.h | 87 +++++ include/uapi/drm/xe_drm.h | 81 +++++ 13 files changed, 977 insertions(+) create mode 100644 drivers/gpu/drm/drm_netlink.c create mode 100644 drivers/gpu/drm/xe/xe_netlink.c create mode 100644 include/drm/drm_netlink.h create mode 100644 include/uapi/drm/drm_netlink.h -- 2.25.1 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [RFC 4/5] drm/netlink: Define multicast groups 2023-10-20 15:58 [RFC v4 " Aravind Iddamsetty @ 2023-10-20 15:58 ` Aravind Iddamsetty 2023-10-20 20:39 ` Ruhl, Michael J 0 siblings, 1 reply; 20+ messages in thread From: Aravind Iddamsetty @ 2023-10-20 15:58 UTC (permalink / raw) To: intel-xe, dri-devel, alexander.deucher, airlied, daniel, joonas.lahtinen, ogabbay, ttayar, Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov, michael.j.ruhl Netlink subsystem supports event notifications to userspace. we define two multicast groups for correctable and uncorrectable errors to which userspace can subscribe and be notified when any of those errors happen. The group names are local to the driver's genl netlink family. Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> --- drivers/gpu/drm/drm_netlink.c | 7 +++++++ include/drm/drm_netlink.h | 5 +++++ include/uapi/drm/drm_netlink.h | 4 ++++ 3 files changed, 16 insertions(+) diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c index 8add249c1da3..425a7355a573 100644 --- a/drivers/gpu/drm/drm_netlink.c +++ b/drivers/gpu/drm/drm_netlink.c @@ -12,6 +12,11 @@ DEFINE_XARRAY(drm_dev_xarray); +static const struct genl_multicast_group drm_event_mcgrps[] = { + [DRM_GENL_MCAST_CORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, }, + [DRM_GENL_MCAST_UNCORR_ERR] = { .name = DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, }, +}; + /** * drm_genl_reply - response to a request * @msg: socket buffer @@ -133,6 +138,8 @@ static void drm_genl_family_init(struct drm_device *dev) dev->drm_genl_family.ops = drm_genl_ops; dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops); dev->drm_genl_family.maxattr = DRM_ATTR_MAX; + dev->drm_genl_family.mcgrps = drm_event_mcgrps; + dev->drm_genl_family.n_mcgrps = ARRAY_SIZE(drm_event_mcgrps); dev->drm_genl_family.module = dev->dev->driver->owner; } diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h index 54527dae7847..758239643c17 100644 --- a/include/drm/drm_netlink.h +++ b/include/drm/drm_netlink.h @@ -13,6 +13,11 @@ struct drm_device; +enum mcgrps_events { + DRM_GENL_MCAST_CORR_ERR, + DRM_GENL_MCAST_UNCORR_ERR, +}; + struct driver_genl_ops { int (*doit)(struct drm_device *dev, struct sk_buff *skb, diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h index aab42147a20e..c7a0ce5b4624 100644 --- a/include/uapi/drm/drm_netlink.h +++ b/include/uapi/drm/drm_netlink.h @@ -26,6 +26,8 @@ #define _DRM_NETLINK_H_ #define DRM_GENL_VERSION 1 +#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR "drm_corr_err" +#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR "drm_uncorr_err" #if defined(__cplusplus) extern "C" { @@ -43,6 +45,8 @@ enum drm_genl_error_cmds { DRM_RAS_CMD_READ_ONE, /** @DRM_RAS_CMD_READ_ALL: Command to get counters of all errors */ DRM_RAS_CMD_READ_ALL, + /** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of multicast event */ + DRM_RAS_CMD_ERROR_EVENT, __DRM_CMD_MAX, DRM_CMD_MAX = __DRM_CMD_MAX - 1, -- 2.25.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* RE: [RFC 4/5] drm/netlink: Define multicast groups 2023-10-20 15:58 ` [RFC 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty @ 2023-10-20 20:39 ` Ruhl, Michael J 0 siblings, 0 replies; 20+ messages in thread From: Ruhl, Michael J @ 2023-10-20 20:39 UTC (permalink / raw) To: Aravind Iddamsetty, intel-xe, dri-devel, alexander.deucher, airlied, daniel, joonas.lahtinen, ogabbay, Tayar, Tomer (Habana), Hawking.Zhang, Harish.Kasiviswanathan, Felix.Kuehling, Luben.Tuikov >-----Original Message----- >From: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> >Sent: Friday, October 20, 2023 11:59 AM >To: intel-xe@lists.freedesktop.org; dri-devel@lists.freedesktop.org; >alexander.deucher@amd.com; airlied@gmail.com; daniel@ffwll.ch; >joonas.lahtinen@linux.intel.com; ogabbay@kernel.org; Tayar, Tomer (Habana) ><ttayar@habana.ai>; Hawking.Zhang@amd.com; >Harish.Kasiviswanathan@amd.com; Felix.Kuehling@amd.com; >Luben.Tuikov@amd.com; Ruhl, Michael J <michael.j.ruhl@intel.com> >Subject: [RFC 4/5] drm/netlink: Define multicast groups > >Netlink subsystem supports event notifications to userspace. we define >two multicast groups for correctable and uncorrectable errors to which >userspace can subscribe and be notified when any of those errors happen. >The group names are local to the driver's genl netlink family. Hi Aravind, This looks reasonable to me. Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> M >Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com> >--- > drivers/gpu/drm/drm_netlink.c | 7 +++++++ > include/drm/drm_netlink.h | 5 +++++ > include/uapi/drm/drm_netlink.h | 4 ++++ > 3 files changed, 16 insertions(+) > >diff --git a/drivers/gpu/drm/drm_netlink.c b/drivers/gpu/drm/drm_netlink.c >index 8add249c1da3..425a7355a573 100644 >--- a/drivers/gpu/drm/drm_netlink.c >+++ b/drivers/gpu/drm/drm_netlink.c >@@ -12,6 +12,11 @@ > > DEFINE_XARRAY(drm_dev_xarray); > >+static const struct genl_multicast_group drm_event_mcgrps[] = { >+ [DRM_GENL_MCAST_CORR_ERR] = { .name = >DRM_GENL_MCAST_GROUP_NAME_CORR_ERR, }, >+ [DRM_GENL_MCAST_UNCORR_ERR] = { .name = >DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR, }, >+}; >+ > /** > * drm_genl_reply - response to a request > * @msg: socket buffer >@@ -133,6 +138,8 @@ static void drm_genl_family_init(struct drm_device >*dev) > dev->drm_genl_family.ops = drm_genl_ops; > dev->drm_genl_family.n_ops = ARRAY_SIZE(drm_genl_ops); > dev->drm_genl_family.maxattr = DRM_ATTR_MAX; >+ dev->drm_genl_family.mcgrps = drm_event_mcgrps; >+ dev->drm_genl_family.n_mcgrps = ARRAY_SIZE(drm_event_mcgrps); > dev->drm_genl_family.module = dev->dev->driver->owner; > } > >diff --git a/include/drm/drm_netlink.h b/include/drm/drm_netlink.h >index 54527dae7847..758239643c17 100644 >--- a/include/drm/drm_netlink.h >+++ b/include/drm/drm_netlink.h >@@ -13,6 +13,11 @@ > > struct drm_device; > >+enum mcgrps_events { >+ DRM_GENL_MCAST_CORR_ERR, >+ DRM_GENL_MCAST_UNCORR_ERR, >+}; >+ > struct driver_genl_ops { > int (*doit)(struct drm_device *dev, > struct sk_buff *skb, >diff --git a/include/uapi/drm/drm_netlink.h b/include/uapi/drm/drm_netlink.h >index aab42147a20e..c7a0ce5b4624 100644 >--- a/include/uapi/drm/drm_netlink.h >+++ b/include/uapi/drm/drm_netlink.h >@@ -26,6 +26,8 @@ > #define _DRM_NETLINK_H_ > > #define DRM_GENL_VERSION 1 >+#define DRM_GENL_MCAST_GROUP_NAME_CORR_ERR "drm_corr_err" >+#define DRM_GENL_MCAST_GROUP_NAME_UNCORR_ERR > "drm_uncorr_err" > > #if defined(__cplusplus) > extern "C" { >@@ -43,6 +45,8 @@ enum drm_genl_error_cmds { > DRM_RAS_CMD_READ_ONE, > /** @DRM_RAS_CMD_READ_ALL: Command to get counters of all >errors */ > DRM_RAS_CMD_READ_ALL, >+ /** @DRM_RAS_CMD_ERROR_EVENT: Command sent as part of >multicast event */ >+ DRM_RAS_CMD_ERROR_EVENT, > > __DRM_CMD_MAX, > DRM_CMD_MAX = __DRM_CMD_MAX - 1, >-- >2.25.1 ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2023-10-20 20:39 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-05-26 16:20 [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 1/5] drm/netlink: Add netlink infrastructure Aravind Iddamsetty 2023-06-04 17:07 ` [Intel-xe] " Tomer Tayar 2023-06-05 17:18 ` Iddamsetty, Aravind 2023-06-06 14:04 ` Tomer Tayar 2023-06-21 6:40 ` Iddamsetty, Aravind 2023-05-26 16:20 ` [RFC 2/5] drm/xe/RAS: Register a genl netlink family Aravind Iddamsetty 2023-06-04 17:09 ` [Intel-xe] " Tomer Tayar 2023-06-05 17:21 ` Iddamsetty, Aravind 2023-05-26 16:20 ` [RFC 3/5] drm/xe/RAS: Expose the error counters Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 4/5] drm/netlink: define multicast groups Aravind Iddamsetty 2023-05-26 16:20 ` [RFC 5/5] drm/xe/RAS: send multicast event on occurrence of an error Aravind Iddamsetty 2023-06-04 17:07 ` [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem Tomer Tayar 2023-06-05 17:17 ` Iddamsetty, Aravind 2023-06-05 16:47 ` Alex Deucher 2023-06-06 11:56 ` Iddamsetty, Aravind 2023-06-21 17:24 ` Sebastian Wick 2023-07-17 12:02 ` Oded Gabbay 2023-10-20 15:58 [RFC v4 " Aravind Iddamsetty 2023-10-20 15:58 ` [RFC 4/5] drm/netlink: Define multicast groups Aravind Iddamsetty 2023-10-20 20:39 ` Ruhl, Michael J
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).